Posted:July 9, 2012
Abrogans; earliest glossary (from Wikipedia)

There are many semantic technology terms relevant to the context of a semantic technology installation [1]. Some of these are general terms related to language standards, as well as to  ontologies or the dataset concept.

ABox
An ABox (for assertions, the basis for A in ABox) is an “assertion component”; that is, a fact associated with a terminological vocabulary within a knowledge base. ABox are TBox-compliant statements about instances belonging to the concept of an ontology.
Adaptive ontology
An adaptive ontology is a conventional knowledge representational ontology that has added to it a number of specific best practices, including modeling the ABox and TBox constructs separately; information that relates specific types to different and appropriate display templates or visualization components; use of preferred labels for user interfaces, as well as alternative labels and hidden labels; defined concepts; and a design that adheres to the open world assumption.
Administrative ontology
Administrative ontologies govern internal application use and user interface interactions.
Annotation
An annotation, specifically as an annotation property, is a way to provide metadata or to describe vocabularies and properties used within an ontology. Annotations do not participate in reasoning or coherency testing for ontologies.
Atom
The name Atom applies to a pair of related standards. The Atom Syndication Format is an XML language used for web feeds, while the Atom Publishing Protocol (APP for short) is a simple HTTP-based protocol for creating and updating Web resources.
Attributes
These are the aspects, properties, features, characteristics, or parameters that objects (and classes) may have. They are the descriptive characteristics of a thing. Key-value pairs match an attribute with a value; the value may be a reference to another object, an actual value or a descriptive label or string. In an RDF statement, an attribute is expressed as a property (or predicate or relation). In intensional logic, all attributes or characteristics of similarly classifiable items define the membership in that set.
Axiom
An axiom is a premise or starting point of reasoning. In an ontology, each statement (assertion) is an axiom.
Binding
Binding is the creation of a simple reference to something that is larger and more complicated and used frequently. The simple reference can be used instead of having to repeat the larger thing.
Class
A class is a collection of sets or instances (or sometimes other mathematical objects) which can be unambiguously defined by a property that all of its members share. In ontologies, classes may also be known as sets, collections, concepts, types of objects, or kinds of things.
Closed World Assumption
CWA is the presumption that what is not currently known to be true, is false. CWA also has a logical formalization. CWA is the most common logic applied to relational database systems, and is particularly useful for transaction-type systems. In knowledge management, the closed world assumption is used in at least two situations: 1) when the knowledge base is known to be complete (e.g., a corporate database containing records for every employee), and 2) when the knowledge base is known to be incomplete but a “best” definite answer must be derived from incomplete information. See contrast to the open world assumption.
Data Space
A data space may be personal, collective or topical, and is a virtual “container” for related information irrespective of storage location, schema or structure.
Dataset
An aggregation of similar kinds of things or items, mostly comprised of instance records.
DBpedia
A project that extracts structured content from Wikipedia, and then makes that data available as linked data. There are millions of entities characterized by DBpedia in this way. As such, DBpedia is one of the largest — and most central — hubs for linked data on the Web.
DOAP
DOAP (Description Of A Project) is an RDF schema and XML vocabulary to describe open-source projects.
Description logics
Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.
Domain ontology
Domain (or content) ontologies embody more of the traditional ontology functions such as information interoperability, inferencing, reasoning and conceptual and knowledge capture of the applicable domain.
Entity
An individual object or member of a class; when affixed with a proper name or label is also known as a named entity (thus, named entities are a subset of all entities).
Entity–attribute–value model
EAV is a data model to describe entities where the number of attributes (properties, parameters) that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest. In the EAV data model, each attribute-value pair is a fact describing an entity. EAV systems trade off simplicity in the physical and logical structure of the data for complexity in their metadata, which, among other things, plays the role that database constraints and referential integrity do in standard database designs.
Extensional
The extension of a class, concept, idea, or sign consists of the things to which it applies, in contrast with its intension. For example, the extension of the word “dog” is the set of all (past, present and future) dogs in the world. The extension is most akin to the attributes or characteristics of the instances in a set defining its class membership.
FOAF
FOAF (Friend of a Friend) is an RDF schema for machine-readable modeling of homepage-like profiles and social networks.
Folksonomy
A folksonomy is a user-generated set of open-ended labels called tags organized in some manner and used to categorize and retrieve Web content such as Web pages, photographs, and Web links.
GeoNames
GeoNames integrates geographical data such as names of places in various languages, elevation, population and others from various sources.
GRDDL
GRDDL is a markup format for Gleaning Resource Descriptions from Dialects of Languages; that is, for getting RDF data out of XML and XHTML documents using explicitly associated transformation algorithms, typically represented in XSLT.
High-level Subject
A high-level subject is both a subject proxy and category label used in a hierarchical subject classification scheme (taxonomy). Higher-level subjects are classes for more atomic subjects, with the height of the level representing broader or more aggregate classes.
Individual
See Instance.
Inferencing
Inference is the act or process of deriving logical conclusions from premises known or assumed to be true. The logic within and between statements in an ontology is the basis for inferring new conclusions from it, using software applications known as inference engines or reasoners.
Instance
Instances are the basic, “ground level” components of an ontology. An instance is individual member of a class, also used synonomously with entity. The instances in an ontology may include concrete objects such as people, animals, tables, automobiles, molecules, and planets, as well as abstract instances such as numbers and words. An instance is also known as an individual, with member and entity also used somewhat interchangeably.
Instance record
An instance with one or more attributes also provided.
irON
irON (instance record and Object Notation) is a abstract notation and associated vocabulary for specifying RDF (Resource Description Framework) triples and schema in non-RDF forms. Its purpose is to allow users and tools in non-RDF formats to stage interoperable datasets using RDF.
Intensional
The intension of a class is what is intended as a definition of what characteristics its members should have; it is akin to a definition of a concept and what is intended for a class to contain. It is therefore like the schema aspects (or TBox) in an ontology.
Key-value pair
Also known as a name–value pair or attribute–value pair, a key-value pair is a fundamental, open-ended data representation. All or part of the data model may be expressed as a collection of tuples <attribute name, value> where each element is a key-value pair. The key is the defined attribute and the value may be a reference to another object or a literal string or value. In RDF triple terms, the subject is implied in a key-value pair by nature of the instance record at hand.
Kind
Used synonomously herein with class.
Knowledge base
A knowledge base (abbreviated KB or kb) is a special kind of database for knowledge management. A knowledge base provides a means for information to be collected, organized, shared, searched and utilized. Formally, the combination of a TBox and ABox is a knowledge base.
Linkage
A specification that relates an object or attribute name to its full URI (as required in the RDF language).
Linked data
Linked data is a set of best practices for publishing and deploying instance and class data using the RDF data model, and uses uniform resource identifiers (URIs) to name the data objects. The approach exposes the data for access via the HTTP protocol, while emphasizing data interconnections, interrelationships and context useful to both humans and machine agents.
Mapping
A considered correlation of objects in two different sources to one another, with the relation between the objects defined via a specific property. Linkage is a subset of possible mappings.
Member
Used synonomously herein with instance.
Metadata
Metadata (metacontent) is supplementary data that provides information about one or more aspects of the content at hand such as means of creation, purpose, when created or modified, author or provenance, where located, topic or subject matter, standards used, or other annotation characteristics. It is “data about data”, or the means by which data objects or aggregations can be described. Contrasted to an attribute, which is an individual characteristic intrinsic to a data object or instance, metadata is a description about that data, such as how or when created or by whom.
Metamodeling
Metamodeling is the analysis, construction and development of the frames, rules, constraints, models and theories applicable and useful for modeling a predefined class of problems.
Microdata
Microdata is a proposed specification used to nest semantics within existing content on web pages. Microdata is an attempt to provide a simpler way of annotating HTML elements with machine-readable tags than the similar approaches of using RDFa or microformats.
Microformats
A microformat (sometimes abbreviated μF or uF) is a piece of mark up that allows expression of semantics in an HTML (or XHTML) web page. Programs can extract meaning from a web page that is marked up with one or more microformats.
Natural language processing
NLP is the process of a computer extracting meaningful information from natural language input and/or producing natural language output. NLP is one method for assigning structured data characterizations to text content for use in semantic technologies. (Hand assignment is another method.) Some of the specific NLP techniques and applications relevant to semantic technologies include automatic summarization, coreference resolution, machine translation, named entity recognition (NER), question answering, relationship extraction, topic segmentation and recognition, word segmentation, and word sense disambiguation, among others.
OBIE
Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. Ontology-based information extraction (OBIE) is the use of an ontology to inform a “tagger” or information extraction program when doing natural language processing. Input ontologies thus become the basis for generating metadata tags when tagging text or documents.
Ontology
An ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts. Loosely defined, ontologies on the Web can have a broad range of formalism, or expressiveness or reasoning power.
Ontology-driven application
Ontology-driven applications (or ODapps) are modular, generic software applications designed to operate in accordance with the specifications contained in one or more ontologies. The relationships and structure of the information driving these applications are based on the standard functions and roles of ontologies (namely as domain ontologies), as supplemented by UI and instruction sets and validations and rules.
Open Semantic Framework
The open semantic framework, or OSF, is a combination of a layered architecture and an open-source, modular software stack. The stack combines many leading third-party software packages with open source semantic technology developments from Structured Dynamics.
Open World Assumption
OWA is a formal logic assumption that the truth-value of a statement is independent of whether or not it is known by any single observer or agent to be true. OWA is used in knowledge representation to codify the informal notion that in general no single agent or observer has complete knowledge, and therefore cannot make the closed world assumption. The OWA limits the kinds of inference and deductions an agent can make to those that follow from statements that are known to the agent to be true. OWA is useful when we represent knowledge within a system as we discover it, and where we cannot guarantee that we have discovered or will discover complete information. In the OWA, statements about knowledge that are not included in or inferred from the knowledge explicitly recorded in the system may be considered unknown, rather than wrong or false. Semantic Web languages such as OWL make the open world assumption. See contrast to the closed world assumption.
OPML
OPML (Outline Processor Markup Language) is an XML format for outlines, and is commonly used to exchange lists of web feeds between web feed aggregators.
OWL
The Web Ontology Language (OWL) is designed for defining and instantiating formal Web ontologies. An OWL ontology may include descriptions of classes, along with their related properties and instances. There are also a variety of OWL dialects.
Predicate
See Property.
Property
Properties are the ways in which classes and instances can be related to one another. Properties are thus a relationship, and are also known as predicates. Properties are used to define an attribute relation for an instance.
Punning
In computer science, punning refers to a programming technique that subverts or circumvents the type system of a programming language, by allowing a value of a certain type to be manipulated as a value of a different type. When used for ontologies, it means to treat a thing as both a class and an instance, with the use depending on context.
RDF
Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata model but which has come to be used as a general method of modeling information, through a variety of syntax formats. The RDF metadata model is based upon the idea of making statements about resources in the form of subject-predicate-object expressions, called triples in RDF terminology. The subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object.
RDFa
RDFa 1.0 is a set of extensions to XHTML that is a W3C Recommendation. RDFa uses attributes from meta and link elements, and generalizes them so that they are usable on all elements allowing annotation markup with semantics. A W3C Working draft is presently underway that expands RDFa into version 1.1 with HTML5 and SVG support, among other changes.
RDF Schema
RDFS or RDF Schema is an extensible knowledge representation language, providing basic elements for the description of ontologies, otherwise called RDF vocabularies, intended to structure RDF resources.
Reasoner
A semantic reasoner, reasoning engine, rules engine, or simply a reasoner, is a piece of software able to infer logical consequences from a set of asserted facts or axioms. The notion of a semantic reasoner generalizes that of an inference engine, by providing a richer set of mechanisms.
Reasoning
Reasoning is one of many logical tests using inference rules as commonly specified by means of an ontology language, and often a description language. Many reasoners use first-order predicate logic to perform reasoning; inference commonly proceeds by forward chaining or backward chaining.
Record
As used herein, a shorthand reference to an instance record.
Relation
Used synonomously herein with attribute.
RSS
RSS (an acronym for Really Simple Syndication) is a family of web feed formats used to publish frequently updated digital content, such as blogs, news feeds or podcasts.
schema.org
Schema.org is an initiative launched by the major search engines of Bing, Google and Yahoo!, and later jointed by Yandex, in order to create and support a common set of schemas for structured data markup on web pages. schema.org provided a starter set of schema and extension mechanisms for adding to them. schema.org supports markup in microdata, microformat and RDFa formats.
Semantic enterprise
An organization that uses semantic technologies and the languages and standards of the semantic Web, including RDF, RDFS, OWL, SPARQL and others to integrate existing information assets, using the best practices of linked data and the open world assumption, and targeting knowledge management applications.
Semantic technology
Semantic technologies are a combination of software and semantic specifications that encodes meanings separately from data and content files and separately from application code. This approach enables machines as well as people to understand, share and reason with data and specifications separately. With semantic technologies, adding, changing and implementing new relationships or interconnecting programs in a different way can be as simple as changing the external model that these programs share. New data can also be brought into the system and visualized or worked upon based on the existing schema. Semantic technologies provide an abstraction layer above existing IT technologies that enables bridging and interconnection of data, content, and processes.
Semantic Web
The Semantic Web is a collaborative movement led by the World Wide Web Consortium (W3C) that promotes common formats for data on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web of unstructured documents into a “web of data”. It builds on the W3C’s Resource Description Framework (RDF).
Semset
A semset is the use of a series of alternate labels and terms to describe a concept or entity. These alternatives include true synonyms, but may also be more expansive and include jargon, slang, acronyms or alternative terms that usage suggests refers to the same concept.
SIOC
Semantically-Interlinked Online Communities Project (SIOC) is based on RDF and is an ontology defined using RDFS for interconnecting discussion methods such as blogs, forums and mailing lists to each other.
SKOS
SKOS or Simple Knowledge Organisation System is a family of formal languages designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary; it is built upon RDF and RDFS.
SKSI
Semantic Knowledge Source Integration provides a declarative mapping language and API between external sources of structured knowledge and the Cyc knowledge base.
SPARQL
SPARQL (pronounced “sparkle”) is an RDF query language; its name is a recursive acronym that stands for SPARQL Protocol and RDF Query Language.
Statement
A statement is a “triple” in an ontology, which consists of a subject – predicate – object (S-P-O) assertion. By definition, each statement is a “fact” or axiom within an ontology.
Subject
A subject is always a noun or compound noun and is a reference or definition to a particular object, thing or topic, or groups of such items. Subjects are also often referred to as concepts or topics.
Subject extraction
Subject extraction is an automatic process for retrieving and selecting subject names from existing knowledge bases or data sets. Extraction methods involve parsing and tokenization, and then generally the application of one or more information extraction techniques or algorithms.
Subject proxy
A subject proxy as a canonical name or label for a particular object; other terms or controlled vocabularies may be mapped to this label to assist disambiguation. A subject proxy is always representative of its object but is not the object itself.
Tag
A tag is a keyword or term associated with or assigned to a piece of information (e.g., a picture, article, or video clip), thus describing the item and enabling keyword-based classification of information. Tags are usually chosen informally by either the creator or consumer of the item.
TBox
A TBox (for terminological knowledge, the basis for T in TBox) is a “terminological component”; that is, a conceptualization associated with a set of facts. TBox statements describe a conceptualization, a set of concepts and properties for these concepts. The TBox is sufficient to describe an ontology (best practice often suggests keeping a split between instance records — and ABox — and the TBox schema).
Taxonomy
In the context of knowledge systems, taxonomy is the hierarchical classification of entities of interest of an enterprise, organization or administration, used to classify documents, digital assets and other information. Taxonomies can cover virtually any type of physical or conceptual entities (products, processes, knowledge fields, human groups, etc.) at any level of granularity.
Topic
The topic (or theme) is the part of the proposition that is being talked about (predicated). In topic maps, the topic may represent any concept, from people, countries, and organizations to software modules, individual files, and events. Topics and subjects are closely related.
Topic Map
Topic maps are an ISO standard for the representation and interchange of knowledge. A topic map represents information using topics, associations (similar to a predicate relationship), and occurrences (which represent relationships between topics and information resources relevant to them), quite similar in concept to the RDF triple.
Triple
A basic statement in the RDF language, which is comprised of a subjectproperty – object construct, with the subject and property (and object optionally) referenced by URIs.
Type
Used synonomously herein with class.
UMBEL
UMBEL, short for Upper Mapping and Binding Exchange Layer, is an upper ontology of about 28,000 reference concepts, designed to provide common mapping points for relating different ontologies or schema to one another, and a vocabulary for aiding that ontology mapping, including expressions of likelihood relationships distinct from exact identity or equivalence. This vocabulary is also designed for interoperable domain ontologies.
Upper ontology
An upper ontology (also known as a top-level ontology or foundation ontology) is an ontology that describes very general concepts that are the same across all knowledge domains. An important function of an upper ontology is to support very broad semantic interoperability between a large number of ontologies that are accessible ranking “under” this upper ontology.
Vocabulary
A vocabulary in the sense of knowledge systems or ontologies are controlled vocabularies. They provide a way to organize knowledge for subsequent retrieval. They are used in subject indexing schemes, subject headings, thesauri, taxonomies and other form of knowledge organization systems.
WordNet
WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. The database and software tools can be downloaded and used freely. Multiple language versions exist, and WordNet is a frequent reference structure for semantic applications.
YAGO
“Yet another great ontology” is a WordNet structure placed on top of Wikipedia.

[1] This glossary is based on the one provided on the OSF TechWiki. For the latest version, please refer to this link.
Posted:July 2, 2012

Example Ontology (from Wikipedia)Conventional IT Systems are Poorly Suited to Knowledge Applications

Frequently customers ask me why semantic technologies should be used instead of conventional information technologies. In the areas of knowledge representation (KR) and knowledge management (KM), there are compelling reasons and benefits for selecting semantic technologies over conventional approaches. This article attempts to summarize these rationales from a layperson perspective.

It is important to recognize that semantic technologies are orthogonal to the buzz around some other current technologies, including cloud computing and big data. Semantic technologies are also not limited to open data: they are equivalently useful to private or proprietary data. It is also important to note that semantic technologies do not imply some grand, shared schema for organizing all information. Semantic technologies are not “one ring to rule them all,” but rather a way to capture the world views of particular domains and groups of stakeholders. Lastly, semantic technologies done properly are not a replacement for existing information technologies, but rather an added layer that can leverage those assets for interoperability and to overcome the semantic barriers between existing information silos.

Nature of the World

The world is a messy place. Not only is it complicated and richly diverse, but our ways of describing and understanding it are made more complex by differences in language and culture.

We also know the world to be interconnected and interdependent. Effects of one change can propagate into subtle and unforeseen effects. And, not only is the world constantly changing, but so is our understanding of what exists in the world and how it affects and is affected by everything else.

This means we are always uncertain to a degree about how the world works and the dynamics of its working. Through education and research we continually strive to learn more about the world, but often in that process find what we thought was true is no longer so and even our own human existence is modifying our world in manifest ways.

Knowledge is very similar to this nature of the world. We find that knowledge is never complete and it can be found anywhere and everywhere. We capture and codify knowledge in structured, semi-structured and unstructured forms, ranging from “soft” to “hard” information. We find that the structure of knowledge evolves with the incorporation of more information.

We often see that knowledge is not absolute, but contextual. That does not mean that there is no such thing as truth, but that knowledge should be coherent, to reflect a logical consistency and structure that comports with our observations about the physical world. Knowledge, like the world, is constantly changing; we thus must constantly adapt to what we observe and learn.

Knowledge Representation, Not Transactions

These observations about the world and knowledge are not platitudes but important guideposts for how we should organize and manage information, the field known as “information technology.” For IT to truly serve the knowledge function, its logical bases should be consistent with the inherent nature of the world and knowledge.

By knowledge functions we mean those areas of various computer applications that come under the rubrics of search, business intelligence, competitive intelligence, planning, forecasting, data federation, data warehousing, knowledge management, enterprise information integration, master data management, knowledge representation, and so forth. These applications are distinctly different than the earliest and traditional concerns of IT systems:  accounting and transactions.

A transaction system — such as calculating revenue based on seats on a plane, the plane’s occupancy, and various rate classes — is a closed system. We can count the seats, we know the number of customers on board, and we know their rate classes and payments. Much can be done with this information, including yield and profitability analysis and other conventional ways of accounting for costs or revenues or optimizations.

But, as noted, neither the world nor knowledge is a closed system. Trying to apply legacy IT approaches to knowledge problems is fraught with difficulties. That is the reason that for more than four decades enterprises have seen massive cost overruns and failed projects in applying conventional IT approaches to knowledge problems: traditional IT is fundamentally mismatched to the nature of the problems at hand.

What works efficiently for transactions and accounting is a miserable failure applied to knowledge problems. Traditional relational databases work best with structured data; are inflexible and fragile when the nature (schema) of the world changes; and thus require constant (and expensive) re-architecting in the face of new knowledge or new relationships.

Of course, often knowledge problems do consider fixed entities with fixed attributes to describe them. In these cases, relational data systems can continue to act as valuable contributors and data managers of entities and their attributes. But, in the role of organizing across schema or dealing with semantics and differences of definition and scope – that is, the common types of knowledge questions – a much different integration layer with a much different logic basis is demanded.

The New Open World Paradigm

The first change that is demanded is to shift the logic paradigm of how knowledge and the world are modeled. In contrast to the closed-world approach of transaction systems, IT systems based on the logical premise of the open world assumption (OWA) mean:

  • Lack of a given assertion does not imply whether it is true or false; it simply is not known
  • A lack of knowledge does not imply falsity
  • Everything is permitted until it is prohibited
  • Schema can be incremental without re-architecting prior schema (“extensible”), and
  • Information at various levels of incompleteness can be combined.

Much more can be said about OWA, including formal definitions of the logics underlying it [1], but even from the statements above, we can see that the right logic for most knowledge representation (KR) problems is the open world approach.

This logic mismatch is perhaps the most fundamental cause of failures, cost overruns, and disappointing deliverables for KM and KR projects over the years. But, like the fingertip between the eyes that cannot be seen because it is too close at hand, the importance of this logic mismatch strangely continues to be overlooked.

Integrating All Forms of Information

Data exists in many forms and of many natures. As one classification scheme, there are:

  • Structured data — information presented according to a defined data model, often found in relational databases or other forms of tabular data
  • Semi-structured data — does not conform to the formal structure of data models, but contains tags or other markers to denote fields within the content. Markup languages embedded in text are a common form of such sources
  • Unstructured data — information content, generally oriented to text, that lacks an explicit data model or schema; structured information can be obtained from it via data mining or information extraction.

Further, these types of data may be “soft”, such as social information or opinion, or “hard”, more akin to measurable facts or quantities.

These various forms may also be serialized in a variety of data formats or data transfer protocols, some using straight text with a myriad of syntax or markup vocabularies, ranging to scripts or forms encoded or binary.

Still further, any of these data forms may be organized according to a separate schema that describes the semantics and relationships within the data.

These variations further complicate the inherently diverse nature of the world and knowledge of it. A suitable data model for knowledge representation must therefore have the power to be able to capture the form, format, serialization or schema of any existing data within the diversity of these options.

The Resource Description Framework (RDF) data model has such capabilities [2]. Any extant data form or schema (from the simple to the complex) can be converted to the RDF data model. This capability enables RDF to act as a “universal solvent” for all information.

Once converted to this “canonical” form, RDF can then act as a single representation around which to design applications and other converters (for “round-tripping” to legacy systems, for example), as illustrated by this diagram:

Generic tools can then be driven by the RDF data model, which leads to fewer applications required and lower overall development costs.

Lastly, RDF can represent simple assertions (“Jane runs fast”) to complex vocabularies and languages. It is in this latter role that RDF can begin to represent the complexity of an entire domain via what is called an “ontology” or “knowledge graph.”

Example Ontology Growth

Connections Create Graphs

When representing knowledge, more things and concepts get drawn into consideration. In turn, the relationships of these things lead to connections between them to capture the inherent interdependence and linkages of the world. As still more things get considered, more connections are made and proliferate.

This process naturally leads to a graph structure, with the things in the graphs represented as nodes and the relationships between them represented as connecting edges. More things and more connections lead to more structure. Insofar as this structure and its connections are coherent, the natural structure of the knowledge graph itself can help lead to more knowledge and understanding.

How one such graph may emerge is shown by this portion of the recently announced Google Knowledge Graph [3], showing female Nobel prize winners:

Unlike traditional data tables, graphs have a number of inherent benefits, particularly for knowledge representations. They provide:

  • A coherent way to navigate the knowledge space
  • Flexible entry points for each user to access that knowledge (since every node is a potential starting point)
  • Inferencing and reasoning structures about the space
  • Connections to related information
  • Ability to connect to any form of information
  • Concept mapping, and thus the ability to integrate external content
  • A framework to disambiguate concepts based on relations and context, and
  • A common vocabulary to drive content “tagging”.

Graphs are the natural structures for knowledge domains.

Network Analysis is the New Algebra

Once built, graphs offer some analytical capabilities not available through traditional means of information structure. Graph analysis is a rapidly emerging field, but already some unique measures of knowledge domains are now possible to gauge:

  • Influence
  • Relatedness
  • Proximity
  • Centrality
  • Inference
  • Clustering
  • Shortest paths
  • Diffusion.

As science is coming to appreciate, graphs can represent any extant structure or schema. This gives graphs a universal character in terms of analytic tools. Further, many structures can only be represented by graphs.

Information and Interaction is Distributed

The nature of knowledge is such that relevant information is everywhere. Further, because of the interconnectedness of things, we can also appreciate that external information needs to be integrated with internal information. Meanwhile, the nature of the world is such that users and stakeholders may be anywhere.

These observations suggest a knowledge representation architecture that needs to be truly distributed. Both sources and users may be found in multiple locations.

In order to preserve existing information assets as much as possible (see further below) and to codify the earlier observation regarding the broad diversity of data formats, the resulting knowledge architecture should also attempt to put in place a thin layer or protocol that provides uniform access to any source or target node on the physical network. A thin, uniform abstraction layer – with appropriate access rights and security considerations – means knowledge networks may grow and expand at will at acceptable costs with minimal central coordination or overhead.

Properly designed, then, such architectures are not only necessary to represent the distributed nature of users and knowledge, but can also facilitate and contribute to knowledge development and exchange.

The Web is the Perfect Medium

The items above suggest the Web as an appropriate protocol for distributed access and information exchange. When combined with the following considerations, it becomes clear that the Web is the perfect medium for knowledge networks:

  • Potentially, all information may be accessed via the Web
  • All information may be given unique Web identifiers (URIs)
  • All Web tools are available for use and integration
  • All Web information may be integrated
  • Web-oriented architectures (WOA) have proven:
  • Scalability
  • Robustness
  • Substitutability
  • Most Web technologies are open source.

It is not surprising that the largest extant knowledge networks on the globe – such as Google, Wikipedia, Amazon and Facebook – are Web-based. These pioneers have demonstrated the wisdom of WOA for cost-effective scalability and universal access.

Also, the combination of RDF with Web identifiers also means that any and all information from a given knowledge repository may be exposed and made available to others as linked data. This approach makes the Web a global, universal database. And it is in keeping with the general benefits of integrating external information sources.

Leveraging – Not Replacing – Existing IT Assets

Existing IT assets represent massive sunk costs, legacy knowledge and expertise, and (often) stakeholder consensus. Yet, these systems are still largely stovepiped.

Strategies that counsel replacement of existing IT systems risk wasting existing assets and are therefore unlikely to be adopted. Ways must be found to leverage the value already embodied in these systems, while promoting interoperability and integration.

The beauty of semantic technologies – properly designed and deployed in a Web-oriented architecture – is that a thin interoperability layer may be placed over existing IT assets to achieve these aims. The knowledge graph structure may be used to provide the semantic mappings between schema, while the Web service framework that is part of the WOA provides the source conversion to the canonical RDF data model.

Via these approaches, prior investments in knowledge, information and IT assets may be preserved while enabling interoperability. The existing systems can continue to provide the functionality for which they were originally designed and deployed. Meanwhile, the KR-related aspects may be exposed and integrated with other knowledge assets on the physical network.

Democratizing the Knowledge Function

These kinds of approaches represent a fundamental shift in power and roles with respect to IT in the enterprise. IT departments and their bottlenecks in writing queries and bespoke application development can now be bypassed; the departments may be relegated to more appropriate support roles. Developers and consultants can now devote more of their time to developing generic applications driven by graph structures [4].

In turn, the consumers of knowledge applications – namely subject matter experts, employees, partners and stakeholders – now become the active contributors to the graphs themselves, focusing on reconciling terminology and ensuring adequate entity and concept coverage. Knowledge graphs are relatively straightforward structures to build and maintain. Those that rely on them can also be those that have the lead role in building and maintaining them.

Thus, graph-driven applications can be made generic by function with broader and more diverse information visualization capabilities. Simple instructions in the graphs can indicate what types of information can be displayed with what kind of widget. Graph-driven applications also mean that those closest to the knowledge problems will also be those directly augmenting the graphs. These changes act to democratize the knowledge function, and lower overall IT costs and risks.

Seven Pillars of the Semantic Enterprise

Elsewhere we have discussed the specific components that go into enabling the development of a semantic enterprise, what we have termed the seven pillars [5]. Most of these points have been covered to one degree or another in the discussion above.

There are off-the-shelf starter kits for enterprises to embrace to begin this process. The major starting requirements are to develop appropriate knowledge graphs (ontologies) for the given domain and to convert existing information assets into appropriate interoperable RDF form.

Beyond that, enterprise staff may be readily trained in the use and growth of the graphs, and in the staging and conversion of data. With an appropriate technology transfer component, these semantic technology systems can be maintained solely by the enterprise itself without further outside assistance.

Summary of Semantic Technology Benefits

Unlike conventional IT systems with their closed-world approach, semantic technologies that adhere to these guidelines can be deployed incrementally at lower cost and with lower risk. Further, we have seen that semantic technologies offer an excellent integration approach, with no need to re-do schema because of changed circumstances. The approach further leverages existing information assets and brings the responsibility for the knowledge function more directly to its users and consumers.

Semantic technologies are thus well-suited for knowledge applications. With their graph structures and the ability to capture semantic differences and meanings, these technologies can also accommodate multiple viewpoints and stakeholders. There are also excellent capabilities to relate all available information – from documents and images and metadata to tables and databases – into a common footing.

These advantages will immediately accrue through better integration and interoperability of diverse information assets. But, for early adopters, perhaps the most immediate benefit will come from visible leadership in embracing these enabling technologies in advance of what will surely become the preferred approach to knowledge problems.

Note: there is a version of this article on Slideshare:

View more presentations from Mike Bergman.

[1] For more on the open world assumption (OWA), see the various entries on this topic on Michael Bergman’s AI3:::Adaptive Information blog. This link is a good search string to discover more.
[2] M.K. Bergman, 2009. Advantages and Myths of RDF, white paper from Structured Dynamics LLC, April 22, 2009, 13 pp. See http://www.mkbergman.com/wp-content/themes/ai3v2/files/2009Posts/Advantages_Myths_RDF_090422.pdf.
[4] For the most comprehensive discussion of graph-driven apps, see M. K. Bergman, 2011. ” Ontology-Driven Apps Using Generic Applications,” posted on the AI3:::Adaptive Information blog, March 7, 2011. You may also search on that blog for ‘ODapps‘ to see related content.
[5] M.K. Bergman, 2010. “Seven Pillars of the Open Semantic Enterprise,” in AI3:::Adaptive Information blog, January 12, 2010; see http://www.mkbergman.com/859/seven-pillars-of-the-open-semantic-enterprise/.
Posted:June 4, 2012

Popular ‘Timeline of Information History’ Expands by 30%

Since its first release four years ago, the History of Information Timeline has been one of the most consistently popular aspects of this site. It is an interactive timeline of the most significant events and developments in the innovation and management of information and documents throughout human history.

Recent requests for use by others and my own references to it caused me to review its entries and add to it. Over the past few weeks I have expanded its coverage by some 30%. There are now about 115 entries in the period ranging from ca 30,000 BC (cave paintings) to ca 2003 AD (3D printing). Most additions have been to update the past twenty or thirty years:

A Timeline of Information History

The timeline works via a fast scroll at the bottom; every entry when clicked produces a short info-panel, as shown.

All entries are also coded by a icon scheme of about 20 different categories:

Book Forms and Bookmaking Calendars Copyrights and Legal Infographics and Statististics
Libraries Maps Math and Symbology Mechanization
Networks New Formats or Document Forms Organizing Information Pre-writing
Paper and Papermaking Printing Science and Technology Scripts and Alphabets
Standardization and Typography Theory Timelines Writing

You may learn more about this timeline and the technology behind it by referring to the original announcement.

Posted by AI3's author, Mike Bergman Posted on June 4, 2012 at 9:28 am in Adaptive Information, Site-related | Comments (1)
The URI link reference to this post is: http://www.mkbergman.com/1012/information-timeline-gets-major-update/
The URI to trackback this post is: http://www.mkbergman.com/1012/information-timeline-gets-major-update/trackback/
Posted:May 28, 2012

Mandelbrot animation based on a static number of iterations per pixel; from WikipediaInsights from Commonalities and Theory

One of the main reasons I am such a big fan of RDF as a canonical data model is its ability to capture information in structured, semi-structured and unstructured form [1]. These sources are conventionally defined as:

  • Structured data — information presented according to a defined data model, often found in relational databases or other forms of tabular data
  • Semi-structured data — does not conform with the formal structure of data models, but contains tags or other markers to denote fields within the content. Markup languages embedded in text are a common form of such sources
  • Unstructured data — information content, generally oriented to text, that lacks an explicit data model or schema; structured information can be obtained from it via data mining or information extraction.

A major trend I have written about for some time is the emergence of the structured Web: that is, the exposing of structure from these varied sources in order for more information to be interconnected and made interoperable. I have posited — really a view shared by many — that the structured Web is an intermediate point in the evolution of the Web from one of documents to one where meaningful semantics occurs [2].

It is clear in my writings — indeed in the very name of my company, Structured Dynamics — that structure plays a major role in our thinking. The use and reliance on this term, though, begs the question: just what is structure in an informational sense? We’ll find it helpful to get at the question of What is structure? from a basis using first principles. And this, in turn, may also provide insight into how structure and information are in fact inextricably entwined.

A General Definition of Structure

According to Wikipedia, structure is a fundamental notion, of tangible or intangible character, that refers to the recognition, observation, nature, or permanence of patterns and relationships of entities. The concept may refer to an object, such as a built structure, or an attribute, such as the structure of society.

Koch curve; example of snowflake fractal; from Wikipedia

Structure may be abstract, or it may be concrete. Its realm ranges from the physical to ideas and concepts. As a term, “structure” seems to be ubiquitous to every domain. Structure may be found across every conceivable scale, from the most minute and minuscule to the cosmic. Even realms without any physical aspect at all — such as ideas and languages and beliefs — are perceived by many to have structure. We apply the term to any circumstance in which things are arranged or connected to one another, as a means to describe the organization or relationships of things. We seem to know structure when we see it, and to be able to discern structure of very many kinds against unstructured or random backgrounds.

In this way structure quite resembles patterns, perhaps could even be used synonymously with that term. Other closely related concepts include order, organization, design or form. When expressed, structure, particularly that of a recognizably ordered or symmetrical nature, is often called beautiful.

One aspect of structure, I think, that provides the key to its roles and importance is that it can be expressed in shortened form as a mathematical statement. One could even be so bold as to say that mathematics is the language of structure. This observation is one of the threads that will help us tie structure to information.

The Patterned Basis of Nature

The natural world is replete with structure. Patterns in nature are regularities of visual form found in the natural world. Each such pattern can be modeled mathematically. Typical mathematical forms in nature include fractals, spirals, flows, waves, lattices, arrays, Golden ratios, tilings, Fibonacci sequences, and power laws. We see them in such structures as clouds, trees, leaves, river networks, fault lines, mountain ranges, craters, animal spots and stripes, shells, lightning bolts, coastlines, flowers, fruits, skeletons, cracks, growth rings, heartbeats and rates, earthquakes, veining, snow flakes, crystals, blood and pulmonary vessels, ocean waves, turbulence, bee hives, dunes and DNA.

Self-similarity in a Mandelbrot set; from Wikipedia

The mathematical expression of structures in nature is frequently repeated or recursive in nature, often in a self-organizing manner. The swirls of a snail’s shell reflect a Fibonacci sequence, while natural landscapes or lifeforms often have a fractal aspect (as expressed by some of the figures in this article). Fractals are typically self-similar patterns, generally involving some fractional or ratioed formula that is recursively applied. Another way to define it is as a detailed pattern repeating itself.

Even though these patterns can often be expressed simply and mathematically, and they often repeat themselves, their starting conditions can lead to tremendous variability and a lack of predictability. This makes them chaotic, as studied under chaos theory, though their patterns are often discernible.

While we certainly see randomness in statistics, quantum physics and Brownian motion, it is also striking how what gives nature its beauty is structure. As a force separate and apart from the random, there appears to be something in structure that guides the expression of what is nature and what is so pleasing to behold. Self-similar and repeated structures across the widest variety of spatial scales seems to be an abiding aspect of nature.

Structure in Language

Such forms of repeated patterns or structure are also inherent in that most unique of human capabilities, language. As a symbolic species [3], we first used symbols as a way to represent the ideas of things. Simple markings, drawings and ideograms grew into more complicated structures such as alphabets and languages. The languages themselves came to embrace still further structure via sentence structures, document structures, and structures for organizing and categorizing multiple documents. In fact, one of the most popular aspects of this blog site is its Timeline of Information History — worth your look — that shows the progression of structure in information throughout human history.

Grammar is often understood as the rules or structure that governs language. It is composed of syntax, including punctuation, traditionally understood as the sentence structure of languages, and morphology, which is the structural understanding of a language’s linguistic units, such as words, affixes, parts of speech, intonation or context. There is a whole field of linguistic typology that studies and classifies languages according to their structural features. But grammar is hardly the limit to language structure.

Semantics, the meaning of language, used to be held separate from grammar or structure. But via the advent of the thesaurus, and then linguistic databases such as WordNet and more recently concept graphs that relate words and terms into connected understandings, we also have now come to understand that semantics also has structure. Indeed, these very structural aspects are now opening up to us techniques and methods — generally classified under the heading of natural language processing (NLP) — for extracting meaningful structure from the very basis of written or spoken language.

It is the marriage of the computer with language that is illuminating these understandings of structure in language. And that opening, in turn, is enabling us to capture and model the basis of human language discourse in ways that can be codified, characterized, shared and analyzed. Machine learning and processing is now enabling us to complete the virtual circle of language. From its roots in symbols, we are now able to extract and understand those very same symbols in order to derive information and knowledge from our daily discourse. We are doing this by gleaning the structure of language, which in turn enables us to relate it to all other forms of structured information.

Common Threads Via Patterns

The continuation of structure from nature to language extends across all aspects of human endeavor. I remember excitedly describing to a colleague more than a decade ago what likely is a pedestrian observation: pattern matching is a common task in many fields. (I had observed that pattern matching in very different forms was a standard practice in most areas of industry and commerce.) My “insight” was that this commonality was not widely understood, which meant that pattern matching techniques in one field were not often exploited or seen as transferable to other domains.

In computer science, pattern matching is the act of checking some sequence of tokens for the presence of the constituents of some pattern. It is closely related to the idea of pattern recognition, which is the characterization of some discernible and repeated sequence. These techniques, as noted, are widely applied, with each field tending to have its own favorite algorithms. Common applications that one sees for such pattern-based calculations include communications [4], encoding and coding theory, file compression, data compression, machine learning, video compression, mathematics (including engineering and signal processing via such techniques as Fourier transforms), cryptography, NLP [5], speech recognition, image recognition, OCR, image analysis, search, sound cleaning (that is, error detection, such as Dolby) and gene sequence searching and alignment, among many others.

To better understand what is happening here and the commonalities, let’s look at the idea of compression. Data compression is valuable for transmitting any form of content in wired or wireless manners because we can transmit the same (or closely similar) message faster and with less bandwidth [6]. There are both lossless (no loss of information) and lossy compression methods. Lossless data compression algorithms usually exploit statistical redundancy — that is, a pattern match — to represent data more concisely without losing information. Redundancy in information theory is the number of bits used to transmit a message minus the number of bits of actual information in the message. Lossless compression is possible because most real-world data has statistical redundancy. In lossy data compression, some loss of information is acceptable by dropping detail from the data to save storage space. These methods are guided by research that indicates, say, how certain frequencies may not be heard or seen by people and can be removed from the source data.

On a different level, there is a close connection between machine learning and compression: a system that predicts the posterior probabilities of a sequence given its entire history can be used for optimal data compression (by using arithmetic coding on the output distribution), while an optimal compressor can be used for prediction (by finding the symbol that compresses best, given the previous history). This equivalence has been used as justification for data compression as a benchmark for “general intelligence.” On a still different level, one major part of cryptography is the exact opposite of these objectives: constructing messages that pattern matching fails against or is extremely costly or time-consuming to analyze.

When one stands back from any observable phenomena — be it natural or human communications — we can see that the “information” that is being conveyed often has patterns, recursion or other structure that enables it to be represented more simply and compactly in mathematical form. This brings me back to my two favorite protagonists in my recent writings — Claude Shannon and Charles S. Peirce.

Information is Structure

Claude Shannon‘s seminal work in 1948 on information theory dealt with the amount of information that could be theoretically and predictably communicated between a sender and a receiver [7] [8]. No context or semantics were implied in this communication, only the amount of information (for which Shannon introduced the term “bits”) and what might be subject to losses (or uncertainty in the accurate communication of the message). In this regard, what Shannon called “information” is what we would best term “data” in today’s parlance.

A cellular automaton based on hexagonal cells instead of squares (rule 34/2); from Wikipedia

The context of Shannon’s paper and work by others preceding him was to understand information losses in communication systems or networks. Much of the impetus for this came about because of issues in wartime communications and early ciphers and cryptography and the emerging advent of digital computers. But the insights from Shannon’s paper also relate closely to the issues of data patterns and data compression.

A key measure of Shannon’s theory is what he referred to as information entropy, which is usually expressed by the average number of bits needed to store or communicate one symbol in a message. Entropy quantifies the uncertainty involved in predicting the value of a random variable. The Shannon entropy measure is actually a measure of the uncertainty based on the communication (transmittal) between a sender and a receiver; the actual information that gets transmitted and predictably received was formulated by Shannon as R, which can never be zero because all communication systems have losses.

A simple intuition can show how this formulation relates to patterns or data compression. Let’s take a message of completely random digits. In order to accurately communicate that message, all digits (bits) would have to be transmitted in their original state and form. Absolutely no compression of this message is possible. If, however, there are patterns within the message (which, of course, now ceases to make the message random), these can be represented algorithmically in shortened form, so that we only need communicate the algorithm and not the full bits of the original message. If this “compression” algorithm can then be used to reconstruct the bit stream of the original message, the data compression method is deemed to be lossless. The algorithm so derived is also the expression of the pattern that enabled us to compress the message in the first place (such as a*2+1).

We can apply this same type of intuition to human language. In order to improve communication efficiency, the most common words (e.g., “a”, “the”, “I”) should be shorter than less common words (e.g., “disambiguation”, “semantics”, “ontology”), so that sentences will not be too long. As they are. This is an equivalent principal to data compression. In fact, such repeats and patterns apply to the natural world as well.

Shannon’s idea of information entropy has come to inform the even broader subject of entropy in physics and the 2nd Law of Thermodynamics [10]. According to Koelman, “the entropy of a physical system is the minimum number of bits you need to fully describe the detailed state of the system.” Very random (uncertain) states have high entropy, patterned ones low entropy. As I noted recently, in open systems, structures (patterns) are a means to speed the tendency to equilibrate across energy gradients [8]. This observation helps provide insight into structure in natural systems, and why life and human communications tend toward less randomness. Structure will always continue to emerge because it is adaptive to speed the deltas across these gradients; structure provides the fundamental commonality between biological information (life) and human information.

In the words of Thomas Schneider [11], “Information is always a measure of the decrease of uncertainty at a receiver.” Of course, in Shannon’s context, what is actually being measured here is data (or bits), not information embodying any semantic meaning or context. Thus, the terminology may not be accurate for discussing “information” in a contemporary sense. But it does show that “structure” — that is, the basis for shortening the length of a message while still retaining its accuracy — is information (in the Shannon context). In this information there is order or patterns, often of a hierarchical or fractal or graph nature. Any structure that emerges that is better able to reduce the energy gradient faster will be favored according to the 2nd Law.

Still More Structure Makes “Information” Information

The data that constitutes “information” in the Shannon sense still lacks context and meaning. In communications terms, it is data; it has not yet met the threshold of actionable information. It is in this next step that we can look to Charles Sanders Peirce (1839 – 1914) for guidance [9].

The core of Peirce’s world view is based in semiotics, the study and logic of signs. In his seminal writing on this, “What is in a Sign?” [10], he wrote that “every intellectual operation involves a triad of symbols” and “all reasoning is an interpretation of signs of some kind”. A sign of an object leads to interpretants, which, as signs, then lead to further interpretants. Peirce’s triadic logic of signs in fact is a taxonomy of sign relations, in which signs get reified and expanded via still further signs, ultimately leading to communication, understanding and an approximation of “canonical” truth. Peirce saw the scientific method as itself an ultimate example of this process.  The key aspect of signs for Peirce is the ongoing process of interpretation and reference to further signs.

Information is structure, and structure is information.

Ideograms leading to characters, that get combined into sentences and other syntax, and then get embedded within contexts of shared meanings show how these structures compound themselves and lead to clearer understandings (that is, accurate messages) in the context of human languages. While the Shannon understanding of “information” lacked context and meaning, we can see how still higher-order structures may be imposed through these reifications of symbols and signs that improve the accuracy and efficiency of our messages. Though Peirce did not couch his theory of semiosis on structure nor information, we can see it as a natural extension of the same structural premises in Shannon’s formulation.

In fact, today, we now see the “structure” in the semantic relationships of language through the graph structures of ontologies and linguistic databases such as WordNet. The understanding and explication of these structures are having a similarly beneficial effect on how still more refined and contextual messages can be composed, transmitted and received. Human-to-machine communications is (merely!) the challenge of codifying and making explicit the implicit structures in our languages.

The Peirceian ideas of interpretation (context) and compounding and reifying structure are a major intellectual breakthrough for extending the Shannon “information” theory to information in the modern sense. These insights also occur within a testable logic for how things and the names of things can be understood and related to one another, via logical statements or structures. These, in turn, can be symbolized and formalized into logical constructs that can capture the structure of natural language as well as more structured data (or even nature, as some of the earlier Peirce speculation asserts [13]).

According to this interpretation of Peirce, the nature of information is the process of communicating a form from the object to the interpretant through the sign [14]. The clarity of Peirce’s logic of signs is an underlying factor, I believe, for why we are finally seeing our way clear to how to capture, represent and relate information from a diversity of sources and viewpoints that is defensible and interoperable.

Structure is Information

Common to all of these perspectives — from patterns to nature and on to life and then animal and human communications — we see that structure is information. Human artifacts and technology, though not “messages” in a conventional sense, also embody the information of how they are built within their structures [15]. We also see the interplay of patterns and information in many processes of the natural world [16] from complexity theory, to emergence, to autopoiesis, and on to autocatalysis, self-organization, stratification and cellular automata [17]. Structure in its many guises is ubiquitous.

From article on Beauty in Wikipedia; Joanna Krupa, a Polish-American model and actress

We, as beings who can symbolically record our perceptions, seem to innately recognize patterns. We see beauty in symmetry. Bilateral symmetry seems to be deeply ingrained in the inherent perception by humans of the likely health or fitness of other living creatures. We see beauty in the patterned, repeated variability of nature. We see beauty in the clever turn of phrase, or in music, or art, or the expressiveness of a mathematical formulation.

We also seem to recognize beauty in the simple. Seemingly complex bit streams that can be reduced to the short, algorithmic expression are always viewed as more elegant than lengthier, more complex alternatives. The simple laws of motion and Newtonian physics fit this pattern, as does Einstein’s E=mc2. This preference for the simple is a preference for the greater adaptiveness of the shorter, more universal pattern to messages, an insight indicated by Shannon’s information theory.

In the more prosaic terms of my vocation in the Web and information technology, these insights point to the importance of finding and deriving structured representations of information — including meaning (semantics) — that can be simply expressed and efficiently conveyed. Building upon the accretions of structure in human and computer languages, the semantic Web and semantic technologies offer just such a prospect. These insights provide a guidepost for how and where to look for the next structural innovations. We find them in the algorithms of nature and language, and in making connections that provide the basis for still more structure and patterned commonalities.

Ideas and algorithms around loseless compression and graph theory and network analysis are, I believe, the next fruitful hunting grounds for finding still higher-orders of structure, which can be simply expressed. The patterns of nature, which have emerged incrementally and randomly over the eons of cosmological time, look to be an excellent laboratory.

So, as we see across examples from nature and life to language and all manner of communications, information is structure and structure is information. And it is simply beautiful.


[1]  I discuss this advantage, among others, in M. K. Bergman, 2009. “Advantages and Myths of RDF,” AI3:::Adaptive Innovation blog, April 8, 2009. See http://www.mkbergman.com/483/advantages-and-myths-of-rdf/.
[2] The structured Web is object-level data within Internet documents and databases that can be extracted, converted from available forms, represented in standard ways, shared, re-purposed, combined, viewed, analyzed and qualified without respect to originating form or provenance. See further M. K. Bergman, 2007. “What is the Structured Web?,” AI3:::Adaptive Innovation blog, July 18, 2007. See http://www.mkbergman.com/390/what-is-the-structured-web/. Also, for a diagram of the evolution of the Web, see M. K. Bergman, 2007. “More Structure, More Terminology and (hopefully) More Clarity,” AI3:::Adaptive Innovation blog, July 22, 2007. See http://www.mkbergman.com/391/more-structure-more-terminology-and-hopefully-more-clarity/.
[3] Terrence W. Deacon, 1997. The Symbolic Species: The Co-Evolution of Language and the Brain, W. W. Norton & Company, July 1997 527 pp. (ISBN-10: 0393038386)
[4] Communications is a particularly rich domain with techniques such as the Viterbi algorithm , which has found universal application in decoding the convolutional codes used in both CDMA and GSM digital cellular, dial-up modems, satellite, deep-space communications, and 802.11 wireless LANs.
[5] Notable areas in natural language processing (NLP) that rely on pattern-based algorithms include classification, clustering, summarization, disambiguation, information extraction and machine translation.
[6] To see some of the associated compression algorithms, there is a massive list of “codecs” (compression/decompression) techniques available; fractal compression is one.
[7] Claude E. Shannon, 1948. “A Mathematical Theory of Communication”, Bell System Technical Journal, 27: 379–423, 623-656, July, October, 1948. See http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf.
[8] I first raised the relation of Shannon’s paper to data patterns — but did not discuss it further awaiting this current article — in M. K. Bergman, 2012. “The Trouble with Memes,” AI3:::Adaptive Innovation blog, April 4, 2012. See http://www.mkbergman.com/1004/the-trouble-with-memes/.
[9] I first overviewed Peirce’s relation to information messaging in M. K. Bergman, 2012. “Give Me a Sign: What Do Things Mean on the Semantic Web?,” AI3:::Adaptive Innovation blog, January 24, 2012. See http://www.mkbergman.com/994/give-me-a-sign-what-do-things-mean-on-the-semantic-web/. Peirce had a predilection for expressing his ideas in “threes” throughout his writings.
[10] For a very attainable lay description, see Johannes Koelman, 2012. “What Is Entropy?,” in Science 2.0 blog, May 5, 2012. See http://www.science20.com/hammock_physicist/what_entropy-89730.
[11] See Thomas D. Schneider, 2012. “Information Is Not Entropy, Information Is Not Uncertainty!,” Web page retrieved April 4, 2012; see http://www.lecb.ncifcrf.gov/~toms/information.is.not.uncertainty.html.
[12] Charles Sanders Peirce, 1894. “What is in a Sign?”, see http://www.iupui.edu/~peirce/ep/ep2/ep2book/ch02/ep2ch2.htm.
[13] It is somewhat amazing, more than a half century before Shannon, that Peirce himself considered “quasi-minds” such as crystals and bees to be sufficient as the interpreters of signs. See Commens Dictionary of Peirce’s Terms (CDPT), Peirce’s own definitions, and the entry on “quasi-minds”; see http://www.helsinki.fi/science/commens/terms/quasimind.html.
[14] João Queiroz, Claus Emmeche and Charbel Niño El-Hania, 2005. “Information and Semiosis in Living Systems: A Semiotic Approach,” in SEED 2005, Vol. 1. pp 60-90; see http://www.library.utoronto.ca/see/SEED/Vol5-1/Queiroz_Emmeche_El-Hani.pdf.
[15] Kevin Kelley has written most broadly about this in his book, What Technology Wants, Viking/Penguin, October 2010. For a brief introduction, see Kevin Kelly, 2006. “The Seventh Kingdom,” in The Technium blog, February 1, 2006. See http://www.kk.org/thetechnium/archives/2006/02/the_seventh_kin.php.
[16] For a broad overview, see John Cleveland, 1994. “Complexity Theory: Basic Concepts and Application to Systems Thinking,” Innovation Network for Communities, March 27, 1994. May be obtained at http://www.slideshare.net/johncleveland/complexity-theory-basic-concepts.
[17] For more on cellular automata, see Stephen Wolfram, 2002. A New Kind of Science, Wolfram Media, Inc., May 14, 2002, 1197 pp. ISBN 1-57955-008-8.

Posted by AI3's author, Mike Bergman Posted on May 28, 2012 at 10:24 pm in Adaptive Information, Structured Web | Comments (5)
The URI link reference to this post is: http://www.mkbergman.com/1011/what-is-structure/
The URI to trackback this post is: http://www.mkbergman.com/1011/what-is-structure/trackback/
Posted:May 18, 2012

Google logoSome Quick Investigations Point to Promise, Disappointments

It has been clear for some time that Google has been assembling a war chest of entities and attributes. It first began to appear as structured results in its standard results listings, a trend I commented upon more than three years ago in Massive Muscle on the ABox at Google. Its purchase of Metaweb and its Freebase service in July 2010 only affirmed that trend.

This week, perhaps a bit rushed due to attention to the Facebook IPO, Google announced its addition of the Knowledge Graph (GKG) to its search results. It has been releasing this capability in a staged manner. Since I was fortunately one of the first to be able to see these structured results (due to luck of the draw and no special “ins”), I have spent a bit of time deconstructing what I have found.

Apparent Coverage

What you get (see below) when you search on particular kinds of entities is in essence an “infobox“, similar to the same structure as what is found on Wikipedia. This infobox is a tabular presentation of key-value pairs, or attributes, for the kind of entity in the search. A ‘people’ search, for example, turns up birth and death dates and locations, some vital statistics, spouse or famous relations, pictures, and links to other relevant structured entities. The types of attributes shown vary by entity type. Here is an example for Stephen King, the writer (all links from here forward provide GKB results), which shows the infobox and its key-value pairs in the righthand column: Google 'Stephen King' structured result

Reportedly these results are drawn from Freebase, Wikipedia, the CIA World Factbook and other unidentified sources. Some of the results may indeed be coming from Freebase, but I saw none as such. Most entities I found were from Wikipedia, though it is important to know that Freebase in its first major growth incarnation springboarded from a Wikipedia baseline. These early results may have been what was carried forward (since other contributed data to Freebase is known to be of highly variable quality).

The entity coverage appears to be spotty and somewhat disappointing in this first release. Entity types that I observed were in these categories:

  • People
  • Chemical Compounds
  • Directors
  • Some Companies
  • Some National Parks
  • Places
  • Musicians/Musical Groups
  • Actors
  • Some Government Agencies
  • Many Local Businesses
  • Animals
  • Movies
  • Albums
  • Notable Landmarks
  • Who Knows What Else

 

Entity types that I expected to see, but did not find include:

  • Products
  • Most Companies
  • Who Knows What Else
  • Songs
  • Most Government Agencies
  • Concepts
  • Non-government Organizations

This is clearly not rigorous testing, but it would appear that entity types along the lines of what is in schema.org is what should be expected over time.

I have no way to gauge the accuracy of Google’s claims that it has offered up structured data on some 500 million entities (and 3.5 billion facts). However, given the lack of coverage in key areas of Wikipedia (which itself has about 3 million entities in the English version), I suspect much of that number comes from the local businesses and restaurants and such that Google has been rapidly adding to its listings in recent years. Coverage of broadly interesting stuff still seems quite lacking.

The much-touted knowledge graph is also kind of disappointing. Related links are provided, but they are close and obvious. So, an actor will have links to films she was in, or a person may have links to famous spouses, but anything approaching a conceptual or knowledge relationship is missing. I think, though, we can see such links and types and entity expansions to steadily creep in over time. Google certainly has the data basis for making these expansions. And, constant, incremental improvement has been Google’s modus operandi.

Deconstructing the URL

For some time, and at various meetings I attend, I have always been at pains to question Google representatives whether there is some unique, internal ID for entities within its databases. Sometimes the reps I have questioned just don’t know, and sometimes they are cagey.

But, clearly, anything like the structured data that Google has been working toward has to have some internal identifier. To see if some of this might now have surfaced with the Knowledge Graph, I did a bit of poking of the URLs shown in the searches and the affiliated entities in the infoboxes. Under most standard searches, the infobox appears directly if there is one for that object. But, by inspecting cross-referenced entities from the infoboxes themselves, it is possible to discern the internal key.

The first thing one can do in such inspections is to remove that stuff that is local or cookie things related to one’s own use preferences or browser settings. Other tests can show other removals. So, using the Stephen King example above, we can eliminate these aspects of the URL:

https://www.google.com/search?num=100&hl=en&safe=off&q=stephen+king&stick=H4sIAAAAAAAAAONgVuLQz9U3MCvIKwQAcT1q1AwAAAA&sa=X&ei=Q-S2T6e1HYOw2wXR2OXOCQ&ved=0CPMGEJsTKAA&biw=1280&bih=827

This actually conformed to my intuition, because the ‘&stick’ aspect was a new parameter for me. (Typically, in many of these dynamic URLs, the various parameters are separated by one another by a set designator character. In the case of Google, that is the ampersand &.)

By simply doing repeated searches that result in the same entity references, I was able to confirm that the &stick parameter is what invokes the unique ID and infobox for each entity. Further, we can decompose that further, but the critical aspect seems to be what is not included within the following: &stick=H4sIAAAAAAAAAONg . . [VuLQz9U3] . . AAAA. The stuff in the brackets varies less, and I suspect might be related to the source, rather than the entity.

I started to do some investigation on types and possible sources, but ran out of time. Listed below are some &stick identifiers for various types of entities (each is a live link):

Type Entity Infobox Identifier
Place Los Angeles &stick=H4sIAAAAAAAAAONgVuLSz9U3MDYoTDIuAQD9ON7KDgAAAA
Place Brentwook &stick=H4sIAAAAAAAAAONgVuLUz9U3MKwsNEkDAN6nm-sNAAAA
Person Arthur Miller &stick=H4sIAAAAAAAAAONgVmLXz9U3qEwrAADRsxaaCwAAAA
Person Joe DiMaggio &stick=H4sIAAAAAAAAAONgVuLQz9U3SDFMqwAAAy8zdQwAAAA
Person Marilyn Monroe &stick=H4sIAAAAAAAAAONgVuLQz9U3MCkvLAIAW7x_LwwAAAA
Movie Shawshank Redemption &stick=H4sIAAAAAAAAAONgVuLQz9U3MM_KKwEAPYgNDQwAAAA
Movie The Green Mile &stick=H4sIAAAAAAAAAONgVuLUz9U3MC62zC4AAGg8mEkNAAAA
Movie Dr. Strangelove &stick=H4sIAAAAAAAAAONgVuLQz9U3MEopzwIAfsFCUQwAAAA
Animal Toucan &stick=H4sIAAAAAAAAAONgVuLQz9U3MM8wNAcA4g1_3AwAAAA
Animal Wolverine &stick=H4sIAAAAAAAAAONgUeLQz9U3sDAryQIAhJ3RUwwAAAA
Musicians/Groups The Beatles &stick=H4sIAAAAAAAAAONgVuLQz9U3ME82yAIAC_7r3AwAAAA
Albums The White Album &stick=H4sIAAAAAAAAAONgVuLSz9U3MMxIN0nKAADnd5clDgAAAA

 

You can verify that this ‘&stick‘ reference is what is pulling in the infobox by looking at this modified query that has substituted Marilyn Monroe’s &stick in the Stephen King URL string: Google 'Stephen King' + 'Marilyn Monroe' structured results Note the standard search results in the lefthand results panel are the same as for Stephen King, but we now have fooled the Google engine to display Marilyn Monroe’s infobox.

I’m sure over time that others will deconstruct this framework to a very precise degree. What would really be great, of course, as noted on many recent mailing lists, is for Google to expose all of this via an API. The Google listing could become the de facto source for Webby entity identifiers.

Some Concluding Thoughts

Sort of like when schema.org was first announced, there have been complaints from some in the semantic Web community that Google released this stuff without once saying the word “semantic”, that much had been ripped off from the original researchers and community without attribution, that a gorilla commercial entity like Google could only be expected to milk this stuff for profit, etc., etc.

That all sounds like sour grapes to me.

What we have here is what we are seeing across the board: the inexorable integration of semantic technology approaches into many existing products. Siri did it with voice commands; Bing, and now Google, are doing it too with search.

We should welcome these continued adoptions. The fact is, semWeb community, that what we are seeing in all of these endeavors is the right and proper role for these technologies: in the background, enhancing our search and information experience, and not something front and center or rammed down our throats. These are the natural roles of semantic technologies, and they are happening at a breakneck pace.

Welcome to the semantic technology space, Google! I look forward to learning much from you.

Posted by AI3's author, Mike Bergman Posted on May 18, 2012 at 7:43 pm in Adaptive Information, Structured Web | Comments (3)
The URI link reference to this post is: http://www.mkbergman.com/1009/deconstructing-the-google-knowledge-graph/
The URI to trackback this post is: http://www.mkbergman.com/1009/deconstructing-the-google-knowledge-graph/trackback/