Posted:September 9, 2009

Another Quick 20 Papers Added to SWEETpedia

Thanks to all who responded to my last update post, More than 200 Semantic Web-related Papers Using Wikipedia, with suggestions for more papers to add the updated SWEETpedia listing.

Those inputs resulted in another 20 added papers. This listing of semantic Web-related research papers based on Wikipedia contents and structure now numbers some 227 papers. The added entries since the major update last week are now marked as [NEWEST].

Thanks, again, those who commented or emailed suggestions. I will, of course, continue to stockpile further suggestions for subsequent updates.

Posted:September 2, 2009

‘SuperTypes’ and Logical Segmentation of Instances

The Significant Advantages to a Logically Segmented TBox

The Message Understanding Conferences (MUC) were initiated in 1987 and financed by DARPA to encourage the development of new and better methods of information extraction (IE). It was a seminal series that resulted in basic measures of retrieval and semantic efficacy, recall (R) and precision (P) and the combined F-measure, and other core terminology and constructs used by IE today.

By the sixth version in the series (MUC-6), in 1995, the task of recognition of named entities and coreference was added. That initial slate of named entities included the basic building blocks of person (PER), location (LOC), and organization (ORG); to these were added the numeric building blocks of time, percentage or quantity. The very terminology of named entity was coined for this seminal meeting, as was the idea of inline markup [1].

What is a ‘Nameable Thing’?

The intuition surrounding “named entity” and nameable “things” was that they were discrete and disjoint. A rock is not a person and is not a chemical or an event. As initially used, all “named entities” were distinct individuals. But, there also emerged the understanding that some classes of things could also be treated as more-or-less distinct nameable “things”: beetles are not the same as frogs and are not the same as rocks. While some of these “things” might be a true individual with a discrete name, such as Kermit the Frog, or The Rock at Northwestern University, most instances of such things are unnamed.

The “nameability” (or logical categorization) of things is perhaps best kept separate from other epistemological issues of distinguishing sets, collections, or classes from individuals, members or instances.

In a closed-world system it is easier to enforce clean distinctions. The Cyc knowledge base, for example, the basis for UMBEL (Upper Mapping and Binding Exchange Layer), makes clear the distinction between individuals and collections. In the semantic Web and RDF, this can become smeared a bit with the favored terminology shifting to instances and classes, and in pragmatic, real-world terms we (as humans) readily distinguish John Smith as distinct from Jane Doe but don’t generally (unless we’re entomologists!) make such distinctions for individual beetles, let alone entire genera or species of beetles.

Under precise conditions, these distinctions are important. The fact that Cyc, for example, is assiduous in its application of these distinctions is a major reason for the overall coherence of its knowledge base. But, for most circumstances, we think it is OK to accept a distinction between “nameable” things such as frogs and beetles, but also to accept that there may be nameable individuals at times in those groupings such as Kermit that are truly an individual in that more refined sense.

This digression sets the background for a natural progression from that first MUC-6 conference. If we could cluster persons or organizations, why not other categories of distinct and disjoint things such as frogs or beetles or rocks?

From the first six entity categories of MUC-6 we begin to see an expansion to broader coverage. Readers of this blog will recall that I have been a fan for quite some time of the expanded coverage of 64 classes of entities proposed by BBN or the 200 proposed by Sekine [2] (as discussed, for example in the April 2008 Subject Concepts and Named Entities article). Again, the intuition was that real things in the real world could be logically categorized into discrete and disjoint categories.

Thus, “named entities” inexorably moved to become a categorization system, where the degree of familiarity and distinction dictated whether it was the individual (with a unique name, such as Abraham Lincoln or Mt. Rushmore) or groupings such as animal or plant species and their common names (such as beetle or oak) that was the standard “handle” for assigning a name to the “nameable thing”.

While many can argue these individual <–> grouping distinctions and whether we are talking about true, unique, named individuals or names of convenience, I think that (at least for this blog post and discussion), that misses the real, fundamental point.

The real, fundamental point is that some “things” (whether individuals, instances or classes) are distinct from other “things”. Such disjoint distinctions are a powerful concept that should not be lost sight of by “angels dancing on the head of a pin” epistemological arguments. A frog is not a rock, despite neither are “individuals”, and how can we take advantage of that realilty?

What Works for Entities, Works for Concepts

Nearly from the outset of our work with UMBEL as a ‘TBox’ [3] — that is, as a set of 20,000 or so common “subject concepts” — the natural question was what the relation or correspondence was of these concepts to the underlying “things” (entities) that they organized. As we probed the disjoint categories within the Sekine 200 entity types, for example, we began to see significant parallels and overlap. Also gnawing at our sense of order was the rather artificial and arbitrary class of concepts in UMBEL that we termed “Abstract Concepts”.

We introduced Abstract Concepts in the first release of UMBEL. When introduced, we defined “Abstract concepts [as] representing abstract or ephemeral notions such as truth, beauty, evil or justice, or [as] thought constructs useful to organizing or categorizing things but are not readily seen in the experiential world.” In pragmatic terms, Abstract Concepts in UMBEL were often pivotal nodes in the UMBEL subject graph necessary to maintain a high degree of concept interconnectivity.

In any world view that attempts to be more-or-less comprehensive, there is a gradation of concepts from the concrete and observable to the abstract and ephemeral. The recognition that some of these concepts may be more abstract, then, was not the issue. The issue was that there was no definable basis for segregating a concrete Subject Concept from the more Abstract Concept. Where was the bright line? What was the actionable distinction?

Off and on we have probed this question for more than a year, and have looked at what might constitute a more natural and logical ordering and segmentation within UMBEL. After many tests and detailed analysis, we are now releasing the first results of our investigations.

For, like nameable entities or things, we can see a logical segmentation of (mostly) disjoint concepts within the UMBEL TBox. Here are the summary percentages of these high-level splits:

Disjoint Concepts	90%
Attributes	1%
Classifications	9%
TOTAL	100%

(Because the analysis is still being refined, exact counts and percentages for the 20,000 concepts in UMBEL are not provided.)

Why a Logical Segmentation?

As we dove deeper into these ideas, not only could we see the basis for a logical segmentation within UMBEL’s concepts, but manifest benefits from doing so as well. Remember that UMBEL’s concept structure performs two main roles. It: 1) provides a coherent framework for relating and “mapping” other external ontologies; and 2) provides conceptual binding points for organizing entities and instances [4]. Via logical segmentation, we get benefits for both roles.

Here are some of the broad areas of benefit from a logical UMBEL segmentation that we have identified:

Template-driven — as we discuss elsewhere, Structured Dynamics also uses its ontologies to “drive applications” and the user interfaces (UI) that support them. By proper segmentation of UMBEL concepts, we are able to determine to what “cluster” of things (which we call either dimensions or superTypes; see below) a given thing belongs. This identification means we can also determine how best to display information about that “thing”. This determination can include either the attributes or the display templates appropriate for that thing. For example, location-based things or time-based things might invoke map or calendar or timeline type displays. Moreover, because of the logical segmentation of concepts, we can also use the power of the concept graph to infer more generic display templates when specific matches are absent
Computational Efficiency — as the percentages above indicate, once we identify what superType concept to which a given instance belongs, we can eliminate nearly all remaining UMBEL concepts from consideration. This logical winnowing leads to computational efficiencies at all levels in the system. The fastest computational work is not to do it, and when large chunks of data are removed from consideration, many performance advantages accrue
Disambiguation — via this approach we now can assess concept matches in addition to entity matches. This means we can triangulate between the two assessments to aid disambiguation. Because of these logical segmentations, we also have multiple “clusters” (that is, either the concept, type, superType or dimension) upon which to do our disambiguation evaluations, either between concepts and entities or within the various concept clusters. We can do so via either multiple semantic vectors (for statistical-based methods) or multiple features (for machine learning methods). In other words, because of logical segmentation, we have increased the informational power of our concept graph
Structure and Integrity Testing — the very mindset of looking for logical segmentation has led to much learning about the UMBEL structure and OpenCyc upon which it is based. In the process, missing nodes (concepts), erroneous assignments, and superfluous nodes are all being discovered. Further, many of these tests can be automated using basic logical and inference approaches. The net result is a constant improvement to the scope and completeness of the structure. Lastly, these same approaches can be applied when mapping external ontologies to UMBEL, providing similar consistency benefits.

With these benefits in mind, we have undertaken concerted analysis of UMBEL to discern what this “logical segmentation” might be. This investigation has occurred over three concentrated periods over the past year. (Intervening priorities or other work prevented concentrating solely on this task.)

We are now complete with our first full iteraton of investigation. In this post, and then the subsequent release of UMBEL version 0.80 in the coming weeks, the fruits of this effort should be evident. However, it should also be noted that we are still learning much from this new mindset and approach. UMBEL structure refinement may be likely for some time to come.

UMBEL Analysis

Most things and concepts about them are based on real, observable, physical things in the real world. Because most of these things can not occupy both the same moment in time and the same location in physical space, a useful criterion for looking at these things and concepts is disjointedness.

In a broad sense, then, we can split our concepts of the world between those ideas that are disjoint because they pertain to separable objects or ideas and those that are cross-cutting or organizational or classificatory. Attributes, such as color (pink, for example), are often cross-cutting in that they can be used to describe quite disparate things. Inherent classification schemes such as academic fields of study or library catalog systems — while useful ways to organize the world — are not themselves in-and-of the world or discrete from other ideas. Thus, classificatory or organizational concepts are inherently not disjoint.

With the criterion of disjointedness in hand, then, we began an evaluation process of the UMBEL subject concepts. We looked to organizational schema such as the entity types of Sekine or BBN for some starting guidance. We also kept in mind that we also wanted our categories to inform logical clusterings of possible data presentation, such as media types or locations or time.

For terminology, we adopted the term superType to denote the largest cluster designation upon which this disjointedness may occur. As a way to test the basic coherence of these superTypes, we also collected them into larger groups which we termed dimensions.

Our analysis process began with branch-by-branch testing of the UMBEL concept graph using automated scripts, attempting to find pivotal nodes where child instance members were disjoint from other superTypes. This we term the “top-down” method.

This automated analysis was then supplemented with a complete manual inspection of all unassigned and assigned concepts, with a “bottom up” assignment of concepts or corrections to the automated approach. This inspection then led to new insights and identification of missing concepts that needed to be added into UMBEL.

We are still converging between these two methods. Optimally, we should be able to tease out all UMBEL superTypes with a relatively few number of union, intersection, or complement set operations. In its current form, we are close, but there are still some rough spots.

Nonetheless, this analysis method has led us to identify some 33 superTypes [5], clustered into 9 dimensions. Of these, 29 superTypes and 8 dimensions are mostly disjoint. The one dimension of Classificatory includes the four cross-cutting superTypes of attributes and organizational schema that can apply to any of the 29 disjoint superTypes.

UMBEL superTypes

Here is the schema, with the descriptions of each:

Dimension	superType	Description/Sub-types
Natural World	Natural Phenomena	This superType includes natural phenomena and natural processes such as weather, weathering, erosion, fires, lightning, earthquakes, tectonics, etc. Clouds and weather processes are specifically included. Also includes climate cycles, general natural events (such as hurricanes) that are not specifically named, and biochemical processes and pathways.
	Natural Substances	Notable inclusions are minerals, compounds, chemicals, or physical objects that are not the outcome of purposeful human effort, but are found naturally occurring. Other natural objects (such as rock, fossil, etc.) are also found under this superType.
	Earthscape	The Earthscape superType consists mostly of the collection of cartographic features that occur on the surface of the Earth. Positive examples include Mountain, Ocean, and Mesa. Artificial features such as canals are excluded. Most instances of these features have a fixed location in space.Underground and underwater are also explicitly contained.This superType is explicitly disjoint with Extraterrestrial (see below).
	Extraterrestrial	This superType includes all natural things not specifically terrestrial, including celestial bodies (planets, asteroids, stars, galaxies, etc., that can be located within a sky map)
Living Things	Prokaryotes	The Prokaryotes include all prokaryotic organisms, including the Monera, Archaebacteria, Bacteria, and Blue-green algas. Also included in this superType are viruses and prions.
	Protists or Fungus	This is the remaining cluster of eukaryotic organisms, specifically including the fungus and the protista (protozoans and slime molds).
	Plants	This superType includes all plant types and flora, including flowering plants, algae, non-flowering plants, gymnosperms, cycads, and plant parts and body types. Note that all Plant Parts are also included.
	Animals	This large superType includes all animal types, including specific animal types and vertebrates, invertebrates, insects, crustaceans, fish, reptiles, amphibia, birds, mammals, and animal body parts. Animal parts are specifically included. Also, groupings of such animals are included. Humans, as an animal, are included (versus as an individual Person). Diseases are specifically excluded.
	Diseases	Diseases are atypical or unusual or unhealthy conditions for (mostly human) living things, generally known as conditions, disorders, infections, diseases or syndromes. Diseases only affect living things and sometimes are caused by living things. This superType also includes impairments, disease vectors, wounds and injuries, and poisoning
	Person Types	The appropriate superType for all named, individual human beings. This superType also includes the assignment of formal, honorific or cultural titles given to specific human individuals. It further includes names given to humans who conduct specific jobs or activities (the latter case is known as an avocation). Examples include steelworker, waitress, lawyer, plumber, artisan. Ethnic groups are specifically included.
Human Activities	Organizations	Organization is a broad superType and includes formal collections of humans, sometimes by legal means, charter, agreement or some mode of formal understanding. Examples include geopolitical entities such as nations, municipalities or countries; or companies, institutes, governments, universities, militaries, political parties, game groups, international organizations, trade associations, etc. All institutions, for example, are organizations.Also included are informal collections of humans. Informal or less defined groupings of humans may result from ethnicity or tribes or nationality or from shared interests (such as social networks or mailing lists) or expertise (“communities of practice”). This dimension also includes the notion of identifiable human groups with set members at any given point in time. Examples include music groups, cast members of a play, directors on a corporate Board, TV show members, gangs, mobs, juries, generations, minorities, etc.Finally, Organizations contain the concepts of Industries and Programs and Communities.
	Finance & Economy	This superType pertains to all things financial and with respect to the economy, including chartable company performance, stock index entities, money, local currencies, taxes, incomes, accounts and accounting, mortgages and property.
	Culture, Issues, Beliefs	This category includes concepts related to political systems, laws, rules or cultural mores governing societal or community behavior, or doctrinal, faith or religious bases or entities (such as gods, angels, totems) governing spiritual human matters. Culture, Issues, beliefs and various activisms (most -isms) are included
	Activities	These are ongoing activities that result (mostly) from human effort, often conducted by organizations to assist other organizations or individuals (in which case they are known as services, such as medicine, law, printing, consulting or teaching) or individual or group efforts for leisure, fun, sports, games or personal interests (activities)
Human Works	Products	This is the largest superType and includes any instance offered for sale or performed as a commercial service. Often physical object made by humans that is not a conceptual work or a facility, such as vehicles, cars, trains, aircraft, spaceships, ships, foods, beverages, clothes, drugs, weapons. Products also include the concept of ‘state’ (e/g/., on/off)
	Food or Drink	This superType is any edible substance grown, made or harvested by humans. The category also specifically includes the concept of cuisines
	Drugs	This superType is an drug, medication or addictive substance
	Facilities	Facilities are physical places or buildings constructed by humans, such as schools, public institutions, markets, museums, amusement parks, worship places, stations, airports, ports, carstops, lines, railroads, roads, waterways, tunnels, bridges, parks, sport facilities, monuments. All can be geospatially located.Facilities also include animal pens and enclosures and general human “activity” areas (golf course, archeology sites, etc.). Importantly, Facilities include infrastructure systems such as roadways and physical networks.Facilities also include the component parts that go into making them (such as foundations, doors, windows, roofs, etc.)
Information	Chemistry (n.o.c)	This superType is a residual category (n.o.c., not otherwise categorized) for chemical bonds, chemical composition groupings, and the like. It is formed by what is not a natural substance or living thing (organic) substance.
	Audio Info	This superType is for any audio-only human work. Examples include live music performances, record albums, or radio shows or individual radio broadcasts
	Visual Info	This superType includes any still image or picture or streaming video human work, with or without audio. Examples include graphics, pictures, movies, TV shows, individual shows from a TV show, etc.
	Written Info	This superType includes any general material written by humans including books, blogs, articles, manuscripts, but any written information conveyed via text.
	Structured Info	This information superType is for all kinds of structured information and datasets, including computer programs, databases, files, Web pages and structured data that can be presented in tabular form
	Notations & References	Akin to conceptual works, these are codified means of human expression. Examples range from human languages themselves, to more domain-specific cases such as chemical symbols, genetic code (A-G-C-T), protocols, and computer languages, mathematical and set notations, etc.Identifiers (numeric or alphanumeric identifiers for objects, often in a highly patterned way, such as phone numbers, URLs, zip and postal codes, SKUs, product codes, etc.), Units (any of the various ways in which measurement, space, volume, weight, speed, intensity, temperature, calories, siesmic intensity or other quantitative descriptions of phenomena can be made) and key reference types are also included in this superType
	Numbers	This unique superType is for any abstract representation of numbers and numerics
Human Places	Geopolitical	Named places that have some informal or formal political (authorized) component. Important subcollections include Country, IndependentCountry, State_Geopolitical, City, and Province.
	Workplaces, etc.	These are various workplaces and areas of human activities, ranging from single person workstations to large aggregations of people (but which are not formal political entities)
Time-related	Events	These are nameable occasions, games, sports events, conferences, natural phenomena, natural disasters, wars, incidents, anniversaries, holidays, or notable moments or periods in time
	Time	This superType is for specific time or date or period (such as eras, or days, weeks, months type intervals) references in various formats
Descriptive	Attributes	This general superType category is for descriptive attributes of all kinds. Think of the specific attributes in Wikipedia “infoboxes” to understand the purpose and coverage of this superType. It includes colors, shapes, sizes, or other descriptive characteristics about an object
Classificatory	Abstract-level	This general superType category is largely composed of former AbstractConcepts, and represent some of the more abstract upper-level nodes for connecting the UMBEL structure together. This superType also includes theories or processes or methods for humans to do stuff or any human technology
	Topics/Categories	This largely subject-oriented superType is a means for using controlled vocabularies and classification schemes for characterizing what content “is about”. The key constituents of this category are Types, Classifications, Concepts, Topics, and controlled vocabularies
	Markets & Industries	This superType is a specialized classificatory system for markets and industries. It could be combined with the superType above, but is kept separate in order to provide a separate, economy-oriented system.

These may undergo some further refinement prior to release of UMBEL v 0.80, and some of the definitions will be tightened up.

(Note: It should also be mentioned that some of these superTypes further lend themselves to further splits and analysis. The Product superType, for example, is ripe for such treatment.)

Distribution of superTypes

The following diagram shows the distribution of these 20,000 UMBEL concepts across major area. By far the largest superType is Products, even with further splits into Food and Drinks and Pharmaceuticals. The next largest categories are Person and Places and Events superTypes, with Organizations and Animals not far behind:

Even in its generic state, UMBEL provides a very rich vocabulary for describing things or for tying in more detailed external ontologies. There are nearly 5,000 concepts across products of all types, for example.

Possible Overlaps (non-disjoint) between superTypes

You may recall that our analysis showed 29 of the superTypes to be “mostly disjoint.” This is because there are some concepts — say, MusicPerformingAgent — that can apply to either a person or a group (band or orchestra, for example). Thus, for this concept alone, we have a bit of overlap between the normally disjoint Person and Organization superTypes.

The following shows the resulting interaction matrix where there may be some overlap between superTypes:

This kind of interaction diagram is also useful for further analyzing the concept graph structure, as well.

Even Where Overlaps Occur, They are Minor

Of the 29 “mostly” disjoint superTypes, only a relatively few show potential interactions, and then only in minor ways. We can illustrate this (drawn to scale) for the interaction between the Product, Food & Drink and Drug (Pharmaceuticals) superTypes, with the fully disjoint Organization superType thrown in for comparison:

Across all 20,000 concepts, then, fully 85% are disjoint from one another (5% is lost due to overlaps between “mostly” disjoint superTypes). This is a surprising high percentage, with even better likelihood to deliver the benefits previously noted.

Interim Conclusions and Observations

These are exciting findings that bode well for UMBEL’s ongoing role and usefulness. Also, the very detailed analysis that has led to these interim findings very much reaffirms the wisdom of basing UMBEL on Cyc. Cyc showed itself to be admirably coherent and remarkably complete. (It also appears that the first versions of UMBEL were also extracted well in terms of good coverage.)

This approach now gives us an understandable and defensible basis for logical segementation of UMBEL. It also provides a much-desired alternative to the earlier Abstract Concepts, which will now be dropped entirely as a schema concept.

One area deserving further attention is in the Attribute superType. We are in the process, for example, of analyzing attributes across Wikipedia and need to look through a slightly different lens at this superType [6]. This area is further important in its strong interaction with the Instance Record Vocabulary that is accompanying this effort on the entity side.

Another lesson for us has been to back away from the terminology of named entity, introduced at MUC-6. The expansions of that idea into other “nameable” things has caused us to embrace the “instance” nomenclature, as evidenced by our emerging IRV.

It is rewarding to prepare this next iteration release of UMBEL with its new mindset of logical segmentation and disjointedness. But — what is also clear — there are many treasures left to mine still hidden in the inherent structure of UMBEL and its Cyc parent.

[1] The original labels were ENAMEX for entity named expression and NUMEX for numeric expression. The markup format specified was also SGML. For an interesting history of this MUC-6 watershed, see Ralph Grishman and Beth Sundheim, 1996. Message Understanding Conference – 6: A Brief History, in Proceedings of the 16th International Conference on Computational Linguistics (COLING), I, Kopenhagen, 1996, 466–471.

[2] In a named entity, the word named applies to entities that have a “rigid designators” as defined by Kripke for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as certain natural kind of terms like biological species and substances.

Sekine’s extended hierarchy proposed in 2002 is made up of 200 subtypes, with 32 larger clusters within that. Here is the top level of the Sekine type system:

Name-Other	Title	Timex	Frequency
Person	Unit	Periodx	Rank
Organization	Vocation	Numex-Other	Age
Location	Disease	Money	School Age
Facility	God	Stock Index	Latitude Longitude
Product	ID Number	Point	Measurement
Event	Color	Percent	Countx
Natural Object	Time-Other	Multiplication	Ordinal Number

Though developed separately and for different purposes, BBN categories also proposed in 2002 consists of 29 types and 64 subtypes. Here are the BBN types (Note: BBN claims 29 types because there are double entries or considerations for the first five entries):

Person	Time	Animal
NORP (adjectival GPEs)	Percent	Substance
Facility	Money	Disease
Organization	Quantity	Work of Art
GPE (geopolitical places)	Ordinal	Law
Location	Cardinal	Language
Product	Events	Contact Info
Date	Plant	Game

Of course, other entity extraction systems have similar clusterings and approaches. Though less formal in the sense of a hierarchy or purported complete entity coverage, here for example is the listing of entity types within Calais:

Anniversary	FaxNumber	NaturalFeature	RadioProgram
City	Holiday	OperatingSystem	RadioStation
Company	IndustryTerm	Organization	Region
Continent	MarketIndex	Person	SportsEvent
Country	MedicalCondition	PhoneNumber	SportsGame
Currency	Movie	Position	SportsLeague
EmailAddress	MusicAlbum	Product	Technology
EntertainmentAwardEvent	MusicGroup	ProgrammingLanguage	TVShow
Facility	NaturalDisaster	ProvinceOrState	TVStation
		PublishedMedium	URL

See further the Wikipedia entry on named entity recognition.

[3] We use the reference to “TBox” in accordance with our working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”

[4] UMBEL also provides a SKOS-based vocabulary extension for describing other domains and mappings between classes and instances. This purpose, however, is outside of the scope of this current article.

[5] As a reference roadmap, UMBEL was specifically designed not to include meronymous (part of) relationships (see further this reference). Thus, all “part of” type concepts were assigned to the whole superType category for which they are a part. Thus, “animal parts” are assigned to the superType Animal; “car parts” to the superType Product.

[6] For a general discussion of attributes and their relation to entities, see Satoshi Sekine, 2008. Extended Named Entity Ontology with Attribute Information, in Proceedings of the 6th edition of the Language Resources and Evaluation Conference (LREC 2008). Marrakech, Morocco. See http://www.lrec-conf.org/proceedings/lrec2008/pdf/21_paper.pdf.

Posted:September 1, 2009

More than 200 Semantic Web-related Papers Using Wikipedia

The Third Major Update in the SWEETpedia Series

The need to undertake some recent research has provided the occasion for me to update the AI3‘s listing — called SWEETpedia — of (largely) academic papers pertaining to the use of Wikipedia for semantic Web-related topics. These papers cover such things as information extraction, named entity recognition, word sense disambiguation, concept hierarchies, ontology and question/answer support, and so forth.

Please go here to see this alphabetized and updated SWEETpedia listing of 207 papers. It is really quite impressive, and represents 44 new papers since my last (and second) update nearly a year ago. While the pace of academic attention seems to be tailling off a bit — now that the usefulness of Wikipedia to these topics has become clear — the quality remains high.

As with past listings, I encourage any researchers that have had their paper inadvertently missed to comment on this blog post and I will make sure the oversight is corrected in the next listing.

Oh, by the way, there are many strategies I employ to find these papers, but here are a few tips you can apply on your own depending on your specific interests:

First peruse SWEETpedia, of course!
The listing of academic papers involving Wikipedia on Wikipedia itself
The search engine query of such as wikipedia filetype:pdf “named entities” (this is a Google example; replace with your own specific topics and consider using Scholar search and restrictions by posting or publication dates), or
Custom searches within useful aggregation sites, such as the ACL Anthology.

Again, please provide any missing papers by commenting below. And, of course, enjoy the updated SWEETpedia listing!

Posted:August 23, 2009

Minor Disruptions

In the Future, All of us May be SysAdmins

OK, well, I just finished moving and upgrading some dozen Web sites and wikis, including this one — my main blog — over the weekend, from fixed stuff to the “clouds“. Believe you me, there were some pretty massive changes required.

For someone like me who is relatively clueless about such things, the process has been interesting (to say the least).

It seems like our modern era either involves moving digital things or converting digital things. As for moving, we all experience that laptop or hard drive dying, and then the move. (The Death of a Laptop actually happened to my wife this past week.) But it also is changing providers and venues — what caused me to move all of these Web sites.

Shedding the Snake Skin

So, the mainstream digital age has existed for what, now, some 40 years? How many data formats have we transitioned (ASCII, EBCDIC, UTF-8, an immense number)? And, how many systems and environments have we transitioned?

At the risk of dating myself, when I was in college we still used slide rules; truly the end of an era. Just a year or two later everyone transitioned to having TI or HP calculators, some they wore on their hips like some PDAs and cell phones today.

I won’t bore everyone with my own transition from my first computer (an HP 9100 with 4K RAM and program listings on cash register tapes) through many others including a DEC Rainbow PC with CP/M (a beauty!). For many years, as we moved into the PC era and IBM legitimized the shift, every computer I bought seemed to cost about $3000. Each one was more capable, etc., but they all cost the same.

And, then, about the late 1990s, that changed. In fact, my last capable desktop machine cost way south of $1000.

But, I digress.

What has been the real constant across these decades has been system and data migration. Granted, many of the docs and many of the systems in my own experience from 30 yrs ago have no relevance today (god, do I miss WordPerfect with its embedded, editable codes!), but actually an important minor portion do.

For these, I need to move both apps and data (with readable formats) for each generational transition.

I know that organizations, like the Library of Congress in its NDIIPP program, need to worry about digital preservation, potentially for millenia. These are worthwhile concerns.

But, from my own more prosaic standpoint, I see this issue with my own lens and own bas relief. I am constantly moving apps and data, each transition much like a snake shedding its skin.

It makes one wonder about the effort and process by which the entire meaningful cultural history of our species continues to adapt and transition forward.

Getting Back to Real

Hmmm. All of us have seen these transitions and the loss of productivity they bring in that shift. (Some might argue that the lack of productivity gains from computers until this decade was due to such transitions, which at least now with the Web we see a more common migration framework.)

I think we have no choice but to transition to the next latest and greatest as it emerges. Automated means at acceptable cost for doing such transitions will also be attractive.

But the real point, I think, is that such transitions are inevitable. Faster apps: Check! Better apps: Check! Easier data exchange: Check!!

Living with transition thus becomes a clear constant for all us as we move forward. And, part of that is accepting downtime to screw around moving the keepable old to the potentially useful new.

After this weekend, I’m now ready for a couple of days off before the real work week begins (yeah, right, keep dreaming).

Posted:August 17, 2009

Confronting Misconceptions with Adaptive Ontologies

Ontology Best Practices for Data-driven Applications: Part 4

The earlier portions of this occasional series have set the groundwork for the role of ontologies in data-driven applications. In this part, I address many of the current misconceptions of what ontologies do or do not do. For, as practiced by Structured Dynamics, our adaptive TBox-level ontologies [1] are definitely not your grandfather’s Oldsmobile.

To share the punch line early, these modern ontologies are fast to develop, easy to change, adaptive to new knowledge and perceptions, robust and flexible. Indeed, it is the structure and nature of these adaptive ontologies that is the heart and secret of data-driven applications.

Any knowledge worker can understand and refine the organization and relationship of information via these structures. And, most importantly, the resulting ontologies are sufficient to drive the generic applications that are based on them. Focusing on data and structure now becomes the emphasis. We can now remove prior bottlenecks arising from the need to customize applications, configure report writers, or wait for IT to generate SQL queries.

But, not all ontologies are created equally and not all practitioners explain or see them in the same way. The purpose of this Part 4 in our series is to present many of the misconceptions, offering a score of takeaway messages for how properly considered and constructed ontologies can achieve these benefits.

Misconception: No ‘Big Bang’ Needed

To be sure, there are many very large and comprehensive ontologies. Some are focused on specific applications or domains; some are general; and some are the result of large and well-funded projects [2]. I am not arguing that such efforts do not have their role and place. But when viewed as exemplars or notable cases, these complex and comprehensive ontologies can create a misconception that such a scope is an imperative of proper ontology design.

I believe quite the opposite to be true.

An incredible strength of RDF and OWL ontologies is that they can be built incrementally. So long as additions are coherent with some degree of self-consistency in terms of the world view in which they are represented, any of an ontology’s constituent concepts, predicates or entities and datasets can be added and enhanced as needed. This makes ontologies a very different cat from relational schema, which are notoriously brittle with expensive re-architecting required anytime that scope or schema change.

Enterprise consultants that advocate “big” upfront ontology development efforts are doing their clients a massive disservice. They are also cynically playing on the experience with relational schema. As soon as the marketplace begins to realize that ontologies are incredibly plastic and malleable, this huge advantage of ontologies over the relational model for data federation will ring clear.

Takeaway Message #1: Ontologies can (and should!) start small.

Takeaway Message #2: Ontologies can (and should!) grow incrementally.

Misconception: No ‘One Ring to Rule Them All’

As a practitioner, two of the most boring arguments I hear are: Ontology X is better than other ontologies and here is why; and, Use of some reference or upper ontology reduces choice and freedom. Both arguments are somewhat grounded in the ‘one ring to rule them all’ mindset — though coming from opposing perspectives — that I think fundamentally misreads the role and purpose of ontologies.

Ontologies provide an organizing context for relating disparate information together and for making meaningful inferences. Without such a framework these purposes can not be achieved. But the framework itself is a function of the world view, context and domain scope at hand. As a result, there is only context, and not some single, universal “truth.” As they say, it all depends.

The trick, then, to properly designed ontologies is to maintain internal coherence and self-consistency [3]. When done, it is then possible to relate disparate information and data to other data and to make intelligent business inferences.

So, the use of an ontology does not limit freedom. It sets the context for making connections and setting relations. And, as long as it is coherent, the “correct” ontology is the one that best captures the scope and domain at hand. Arguing for one ontology v another is wasted energy. Just get on with it.

Takeaway Message #3: There is no single “truth”, only coherence and relevant context.

Misconception: No Such Thing as an ‘Ontological Commitment’

One of the more pernicious ideas promoted by some practitioners or advocates is the idea of ‘ontological commitment.’ Though some definitions are relatively benign, such as the one offered by the Stanford Knowledge Systems Laboratory (KSL) [4], the unfortunate use of the term “commitment” implies permanence and immutability. (In fact, most definitions of this phrase affirm this interpretation.)

This is really unfortunate, as it again tends to reinforce the inaccurate analogies with brittle and inflexible relational schema.

A much better way to view ontologies is not as a “commitment,” but as a vehicle for developing a common world view within the enterprise. Under this viewpoint, ontology development is somewhat analogous to master data management (MDM) or corporate taxonomies [5]. In this broader sense, then, ontology development can become a means for developing and refining a common language within the enterprise through consensual or community processes.

For the reasons as noted above, as language or conceptual relationships or understandings change, so can the vocabulary or structural character of the ontology change. There is no “lock in”; there is no “commitment”. As long as it is coherent, the ontology can morph to reflect the scope and understandings of the current snapshot in time.

This flexibility results from the fact that the ontologies, properly constructed, can drive a generic set of tools and applications that express themselves based on the underlying structure and vocabulary within those ontologies. The ontologies can thus change at will without any adverse effects whatsoever on the applications based on them.

This data-driven aspect, as noted throughout this series, is quite different from any prior paradigm. So, under this view ontologies have considerably more focus and importance than even some of the strongest ontology advocates claim, yet paradoxically without the theoretical bloat or heaviness many purport. Like human languages, our language and concepts within ontologies change as our world and perceptions change.

Takeaway Message #4: There is no “lock-in” with ontologies; they may be modified and changed at will.

Takeaway Message #5: Like corporate taxonomies or MDM, ontologies provide a framework for enterprises to develop internally consistent common languages or vocabularies.

Takeaway Message #6: Unlike corporate taxonomies or MDM, ontologies can drive directly generic tools and applications.

Misconception: No Need for Completeness or Comprehensiveness

Ontology development is not some imperative for conceptual “truth”; rather, it is a very adaptable means for stating, testing and refining stuff. Like agile development for software, this refining approach can and should proceed incrementally. Too often ontology efforts get caught like deer in the headlights awaiting some “completeness” threshold before release.

One means to promote this approach is to tackle single datasets or data stores individually before moving on. Having a sense of the eventual scope is useful, of course. But it is also quite acceptable to only fill out those portions of the structure with data available at hand.

These observations reflect a prejudice to action and release, rather than theory. If mistakes are made, fine: simply correct them.

Takeaway Message #7: Understand the full scope, but only build out for the data in hand.

Misconception: No Need for Predicate Bloat

It is advisable to keep relationships (predicates) simple at first. Because, again, like human languages, keeping the verbs simple until fluency is gained is another best practice.

While all of us can see nuances and subtleties heading into a project, trying to accommodate those predicates (relationships) at the outset can introduce unnecessary complexity. This is not an advocacy in any way for inaccurate predicates, but perhaps to err on the side of the general and broader at first.

For organizations familiar with taxonomies, the SKOS vocabulary is a good focus, and there are some other standard starting ontologies that provide a good starting base of predicates [6]. Then, as you work with your data and its requirements, you can later expand to more sophisticated relationships.

In taking this approach you will still see immediate benefits due to the value of connected data through the Linked Data Law [7]. But, at the same time, you will be embracing a simpler language to start and then gain fluency.

Takeaway Message #8: Use simple, well-defined and documented predicates (properties or attributes).

Takeaway Message #9: You are building a common language for the enterprise; do so purposefully.

Misconception: No Need for Expensive Up-front Engineering

All of these observations lead to the conclusion that upfront ontology development need not be expensive. Any consultant selling six-figure ontology development to businesses ought to be seriously challenged. Start small and focused. Frankly, a simple spreadsheet taxonomy or quick conversion of existing XML or metadata or vocabulary standards is A-OK to get started.

Takeaway Message #10: Start small with stakeholders to build acceptance and best practices.

Takeaway Message #11: Start immediately to organize and federate existing information.

Misconception: No Need to Reinvent the Wheel

While it is true that the usefulness of ontologies as advocated by Structured Dynamics is greater than other constructs, these ontologies still just represent a more capable representation of knowledge structures that have been around in various other forms for years. For decades enterprises have created schema, taxonomies, controlled vocabularies, standards, and other knowledge structures that represent untold time, dollars and effort. It would be a waste to not fully leverage these sunk investments.

Further, many ontologies and interoperable structures also exist external to the enterprise, many open source and freely available. And, even if not all are already in proper ontological form, like internal structures these other constructs can be relatively easily leveraged and turned into ontology-ready form.

So, what we are doing with adaptive ontologies is not creating new structures or new representatiions from scratch, but leveraging the expressions of our current world views. These have been hard-earned, codified over years of effort, and are legacy expressions of the enterprise’s knowledge base.

In this vein, then, there is already much richness available to any organization upon which to embark on their ontology efforts. Use them, and gain great leverage.

Takeaway Message #12: Aggressively mine and re-use existing knowledge and structure.

Takeaway Message #13: Leverage and re-use appropriate portions of the “best” existing, external ontologies.

Misconception: No Requirement to Displace Existing Assets

Continuing in this same spirit, it is a mistake to see adaptive ontologies and the associated systems advocated by Structured Dynamics as a replacement for existing data assets. Rather, the idea and advantage is to keep data records in situ as much as possible. These are already performing investments that can be left largely as is. The role of the adaptive ontologies is to act as a federation layer that bridges across these existing assets.

This leverage of existing data assets can occur via the architecture of the system (generally Web-oriented architecture [8]) and a design of the data system and structures providing proper allocation between the ABox and TBox [1].

All of this maintaining of existing assets is aided by the ability to convert in-place data to ontology-ready RDF form. This is a separate topic in its own right and one I discuss elsewhere [9]. There is also a need to make sure that the attributes of the underlying instance records (generally, the columns within a relational table) are also properly modeled within the adaptive ontology. This is part of the best practices guidelines.

Of course, how much of the existing assets can be leveraged “as is” and what degree of modification or conversion might be necessary needs to be evaluated on a case-by-case basis. Generally, however, these mappings can be pretty straightforward and leave in place all existing hardware, software and administration procedures.

Takeaway Message #14: Leverage your existing databases as rich sources of instance records (“ABox”).

Takeaway Message #15: Explicitly design your TBox ontologies to be an interoperability layer over these existing record stores.

Takeaway Message #16: Reconcile the semantics across the enterprise’s data stores at this interoperable TBox layer.

Misconception: No Closed World Assumptions

A closed world assumption holds that any statement that is not known to be true is false. Most enterprise database and transaction systems are based on this premise. It works well where there is complete coverage of the entities within a knowledge base, such as the enumeration of all customers or all products of an enterprise.

Yet, in the real (“open”) world there is no guarantee or likelihood of complete coverage. Thus, under an open world assumption the lack of a given assertion or fact being available neither implies whether that possible assertion is true or false: it simply is not known.

An open world assumption is one of the key factors for enabing adaptive ontologies to grow incrementally. It is also the basis for enabling linkage to external (and surely incomplete) datasets.

In fact, systems designed around the open world assumption can still achieve closed world reasoning where the circumstances and completeness of the knowledge base permit. But, rather than being a logical outcome of the framework, such completeness axioms need to be explicitly stated. Thus, open world systems can achieve the same ends as closed ones where applicable, but with greater flexibility and extensibility.

Takeaway Message #17: No enterprise is an island; design according to the open world assumption.

Misconception: No Restriction to a Dedicated Priesthood

Consultants make their money and academics their reputation by often making things more obscure and jargon-laden than they need be. Ontologies — heck, even the name itself — is no exception.

But what we have laid out as general guidelines herein and their reduction to practice does not require a priesthood. Sure, there are some things to learn and some practices to follow, but these are certainly easier to understand and master than, say, a programming or scripting language. Adaptive ontologies done right can be a participatory activity within most any organization.

Some guidance and mentoring would certainly be helpful. Make sure to pick the right individuals that truly embrace these perspectives.

Also helpful would the assistance of groups skilled in team building and group participation [10].

Takeaway Message #18: Engage all knowledge stakeholders in ontology creation, review and refinement.

Takeaway Message #19: Use selected ontology engineers to help ensure consistency, but not necessarily structure.

Design for Data-driven Apps

The above addresses misconceptions related to how the market perceives current ontologies or how some advocates push the concept. But there are some unique perspectives that Structured Dynamics brings to ontology development specific to the purpose of data-driven applications. From a best practices standpoint, these considerations should also be included.

In order to properly “drive” applications and user interfaces and reports, specific design attention needs to be give to:

Linked data, and the use and accessibility of URIs as resource identifiers
Context- and instance-sensitive data display, including templates, and
Driving user interfaces via the inclusion of preferred and alternate labels in the ontology.

Of course, there are other considerations that come to bear. But these lend themselves to some rather simple checklist guidelines during ontology development and maintenance.

Takeaway Message #20: Follow some relatively straightforward best practices to gain all of the advantanges of adaptive ontologies.

This post is part of an occasional AI3 series on ontology best practices.

[1] We use the reference to “TBox” in accordance with our working definition for description logics:

"Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts."

[2] Chemicals, petroleum and pharmaceuticals are renowned for large-scale, vertical ontologies. Examples of general or upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc, BFO (Basic Formal Ontology) and UMBEL (Upper Mapping and Binding Exchange Layer). Many of the large exemplar ontology projects are funded under EU auspices; see write-ups for the 7th ICT (Information and Communications Technologies) program for the EU and prior ICT projects for more information.

[3] See, for example, my posting on When is Content Coherent? from about one year ago.

[4] See, for example, the Stanford KSL discussion on What is an Ontology? One part of that document explains ontological commitments as “agreements to use the shared vocabulary in a coherent and consistent manner,” which is benign enough. But other discussions and venues imply much more viz. the “commitment” term. This same Stanford source is also a useful for general philosophical discussions of ontologies.

[5] With respect to corporate taxonomies, see for example, Trish O’Kane, “United by a Common Language: Developing a Corporate Taxonomy“. Information Management Journal. FindArticles.com. 15 Aug, 2009. http://findarticles.com/p/articles/mi_qa3937/is_200607/ai_n17176092/.

[6] Some of the standard starting vocabularies that Structured Dynamics recommends include many of the ones listed on this useful ontology table from Freebase, and specifically include Dublin Core, Friend-Of-A-Friend (FOAF), GeoNames, SIOC, SKOS, RDF Schema, XML Schema, OWL, UMBEL, and BIBO. These are typically supplemented with domain-specific ontologies appropriate to the scope at hand.

[7] The Linked Data Law states the value of a linked data network is proportional to the square of the number of links between data objects. It is a derivative of Metcalfe's law, which states that the value of a telecommunications network is proportional to the square of the number of users of the system (n²), where the linkages between users (nodes) exist by definition. For information bases, the data objects are the nodes. Linked data works to add the connections between the nodes. This concept was first presented in ago in What is Linked Data? and then formalized in [9].

[8] In WOA, discrete functions are packaged into modular and shareable elements (services), then made available in a distributed and loosely coupled manner using Representational State Transfer. REST provides principles for how resources are defined and used with simple interfaces without additional messaging layers. REST is a foundation to the HTTP protocol and a key reason for the success and scalability of the Web.

[9] See further my posting, Structure the World.

[10] As a matter of full disclosure, Structured Dynamics does not have expertise nor strengths in these areas.

Main Links

Search

Author: Mike Bergman