Posted:October 8, 2020

CWPK #50: Querying External Sources

The Nearly Infinite Usefulness of SPARQL

We are now two-thirds of the way through our CWPK series. One reason we have emphasized ‘roundtripping‘ in Cooking with Python and KBpedia is to accommodate the incorporation of information from external sources into KBpedia. From hierarchical relationships to annotations like definitions or labels, external sources can be essential. Of course, one can find flat files or spreadsheets or CSV files directly, but often times we need specific information that can only come from querying the external source directly. Two of the ones we heavily rely on in particular — Wikidata and DBpedia — provide this access through SPARQL queries. We first introduced SPARQL in CWPK #25.

External SPARQL queries are the basis of getting instance data, values for instance attributes, missing fields like altLabels and skos.definition, existing crosswalks or mappings, longer descriptions, subsumption relations, related links, and interesting joins and intersections across external knowledge base content. Often, one is able to specify the format (serialization) of the desired results.

The outputs from these external queries can be manipulated as strings, and then written to flat files useful for ingest into the various build routines. Of course, it is important that the format and CSV-nature of the results be maintained in a form that the build routines expect. One may alter the build formats or the extract formats, but to work they need to match on both ends.

So, what we provide in today’s installment are some guidelines and recipes for using SPARQL to obtain information you need and to write them to flat files. Because of their importance, we emphasize Wikidata and DBpedia (also a stand-in for Wikipedia) in our examples. Once populated, you may need to do some intermediate wrangling of these files to get them into shape for direct import. We covered that topic in brief in CWPK #36, but really do not address file wrangling further here. There are way too many varieties to cover the topic in a meaningful way, though we certainly have examples in today’s installment and across the entire CWPK series that should provide a useful foundation to your own efforts.

Choosing Access Method

There are not that many public SPARQL endpoints available, and some are not always up and available. But the endpoints that do exist, with their identification in the Query Sources section at the conclusion of today’s installment, are often comprehensive and with high value. The two we will be emphasizing today, Wikidata and DBpedia (and, by extension, the linked open data (LOD) cloud beyond that), are among the most valuable. (Of course, many endpoints, like ones specific to a particular organization, are private, and can be parts of valuable, distributed information ecosystems.) Another notable endpoint worthy of your attention is the LOD endpoint maintained by OpenLink Software.

It is possible to query many of these sources directly online with an HTML interface, often also providing a choice of the output format desired. In some of the examples below, I provide a Try it! link that takes you directly to the source site and uses their native SPARQL interface. (Also, inspect the URI links for these Try it! options, since it shows how SPARQL gets communicated over the Web.) You may often find this is the fastest and cleanest way to get useful results, and sometimes better formatted than what our home-brewed options below produce. Your mileage may vary. In any case, it is useful to learn how to conduct direct SPARQL capabilities from within cowpoke. For that reason, I emphasize our home-brewed examples below.

Setting Up This Installment

Like we have been emphasing of late, we begin today’s installment with our standard start-up instructions:

from cowpoke.__main__ import *
from cowpoke.config import *
from owlready2 import *
from SPARQLWrapper import SPARQLWrapper, JSON
from rdflib import Graph

#sparql = SPARQLWrapper('http://dbpedia.org/sparql')
sparql = SPARQLWrapper('https://query.wikidata.org/sparql')
graph = world.as_rdflib_graph()

Of course, we actually have a very capable query method to our own internal stores:

form_1 = list(graph.query_owlready("""
  PREFIX rc: <http://kbpedia.org/kko/rc/>
  PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
  SELECT DISTINCT ?x ?label
  WHERE
  {
    ?x rdfs:subClassOf rc:Eutheria.
    ?x skos:prefLabel  ?label. 
  }
"""))

print(form_1)

Wikidata Queries

For the following Wikidata queries, Run these assignments first:

from SPARQLWrapper import SPARQLWrapper, JSON
from rdflib import Graph

sparql = SPARQLWrapper('https://query.wikidata.org/sparql', agent='cowpoke 0.1 (github.com/Cognonto/cowpoke)')

We need to assign an ‘agent=’ because of limits Wikidata occasionally puts on queries. If you do many requests, you may want to consider adding your own agent defintion.

One of the techniques I use most heavily is the VALUES statement. This construct allows a listing of IDs to be passed to the query source. Depending on various endpoint limits, you may be able to list 1000 or more IDs in such a listing; experience with a given endpoint will dictate. If you use the VALUES construct, just make sure you are using the proper format and prefix (wd: in this instance for a Q item within Wikidata) in front of each value.

Parent Class from Q IDs

The first query is to obtain the parent class from submitted listing of Q items. You may also Try it! directly from Wikidata:

sparql.setQuery("""
PREFIX schema: <http://schema.org/>
SELECT ?item ?itemLabel ?wikilink ?itemDescription ?subClass ?subClassLabel WHERE {
  VALUES ?item { wd:Q25297630
  wd:Q537127
  wd:Q16831714
  wd:Q24398318
  wd:Q11755880
  wd:Q681337
}
 ?item wdt:P910 ?subClass.
 SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

#for result in results["results"]["bindings"]:
#    print(result["item"]["value"])

print(results)

Notice that once we set our SPARQL endpoint and user agent, we are able to cut-and-paste different SPARQL queries between the opening and ending triple quotes (“””). The bracketing statements around that can be used repeatedly for different queries.

Go ahead and toggle between the print statements above to see how we can start varying outputs. Chances are you will need to do some string manipulation before your flat files are ready for ingest, but we can vary these specifications to get the initial output closer to our requirements.

subClass and Instance listings for Q ID

Try it! as well.

sparql.setQuery("""
SELECT ?subclass ?subclassLabel ?instance ?instanceLabel
WHERE
{
 ?subclass wdt:P279 wd:Q183366.
 ?instance wdt:P31 wd:Q183366.
 SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY xsd:integer(SUBSTR(STR(?subclass),STRLEN("http://www.wikidata.org/entity/Q")+1))
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

#for result in results["results"]["bindings"]:
#    print(result["item"]["value"])

print(results)

Useful Q Item Attributes

Try it!

sparql.setQuery("""
PREFIX schema: <http://schema.org/>

SELECT ?item ?itemLabel ?class ?classLabel ?description ?article ?itemAltLabel WHERE {
  VALUES ?item { wd:Q1 wd:Q2 wd:Q3 wd:Q4 wd:Q5 }
  ?item wdt:P31 ?class;
        wdt:P5008 ?project.
#  ?article rdfs:comment ?description.
  
   OPTIONAL {
    ?article schema:about ?item.
    ?article schema:isPartOf <https://en.wikipedia.org/>.
  }
  
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

print(results)

Get English Wikipedia Article Names from Q ID

Try it!

sparql.setQuery("""
SELECT DISTINCT ?lang ?item ?name WHERE {
 VALUES ?item { wd:Q1
wd:Q2
wd:Q3
wd:Q4
wd:Q5 
}
 ?article schema:about ?item; schema:inLanguage ?lang; schema:name ?name .
 FILTER(?lang in ('en')) .
 FILTER (!CONTAINS(?name, ':')) .
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

print(results)

Listing of Q IDs from Property

Try it!

sparql.setQuery("""
 SELECT
 ?item ?itemLabel
 ?value ?valueLabel
 # valueLabel is only useful for properties with item-datatype
 WHERE 
 {
 ?item wdt:P2167 ?value
 # change P2167 to desired property        
 SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
 }
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

print(results)

subClass and Instance Listings for Q ID

Try it!

sparql.setQuery("""
SELECT ?subclass ?subclassLabel ?instance ?instanceLabel
WHERE
{
 ?subclass wdt:P279 wd:Q183366.
 ?instance wdt:P31 wd:Q183366.
 SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY xsd:integer(SUBSTR(STR(?subclass),STRLEN("http://www.wikidata.org/entity/Q")+1))
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

print(results)

Missing Q Data from Wikidata

Try it!

sparql.setQuery("""
PREFIX schema: <http://schema.org/>
PREFIX w: <https://en.wikipedia.org/wiki/>
SELECT ?wikipedia ?item WHERE {
VALUES ?wikipedia { w:Tom_Hanks }
?wikipedia schema:about ?item .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

print(results)

Q ID from Wikipedia ID

Try it!

sparql.setQuery("""
PREFIX schema: <http://schema.org/>
PREFIX w: <https://en.wikipedia.org/wiki/>
SELECT ?wikipedia ?item WHERE {
VALUES ?wikipedia { w:Euthanasia
w:Commercial_art_gallery
}
 ?wikipedia schema:about ?item .
 SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

print(results)

schema.org ← → Wikidata Mapping

Try it!

sparql.setQuery("""
SELECT ?wd ?wdLabel ?type ?uri ?prefix ?localName WHERE {
 {
   { ?wd wdt:P1628 ?uri . BIND("equivalent property" AS ?type) } UNION
   { ?wd wdt:P1709 ?uri . BIND("equivalent class" AS ?type) } UNION
   { ?wd wdt:P2888 ?uri . BIND("exact match" AS ?type) } UNION
   { ?wd wdt:P2235 ?uri . BIND("superproperty" AS ?type) } UNION
   { ?wd wdt:P2236 ?uri . BIND("subproperty" AS ?type) } 
 }
 BIND( REPLACE(STR(?uri),'[^#/]+$',) AS ?prefix)
 BIND( REPLACE(STR(?uri),'^.*[#/]',) AS ?localName)
 # filter by ontology (otherwise timeout expected)
 FILTER(?prefix = "http://schema.org/")
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
} ORDER BY ?prefix ?localName
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

print(results)

Main Topic of Q ID

Try it!

sparql.setQuery("""
PREFIX schema: <http://schema.org/>
 SELECT ?item ?itemLabel ?mainTopic ?mainTopicLabel WHERE {
   VALUES ?item { wd:Q13307732
 wd:Q8953981
 wd:Q1458376
 wd:Q8953071
 }
  ?mainTopic wdt:P910 ?item.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
 }
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

print(results)

DBpedia Queries

DBpedia is a bit more tricky to deal with.

Again, we set up our major call, to be followed by a series of SPARQL queries to DBpedia:

from SPARQLWrapper import SPARQLWrapper, RDFXML
from rdflib import Graph

sparql = SPARQLWrapper("http://dbpedia.org/sparql")

Languages in DBpedia with schema.org Language Code

In this query, we are looking for items that have been already mapped or characterized in a second ontology (schema.org).

Try it!

sparql.setQuery("""
    PREFIX dbo: <http://dbpedia.org/ontology/>
    PREFIX schema: <http://schema.org/>

    CONSTRUCT {
      ?lang a schema:Language ;
      schema:alternateName ?iso6391Code .
    }
    WHERE {
      ?lang a dbo:Language ;
      dbo:iso6391Code ?iso6391Code .
      FILTER (STRLEN(?iso6391Code)=2) # to filter out non-valid values
    }
""")

sparql.setReturnFormat(RDFXML)
results = sparql.query().convert()
print(results.serialize(format='xml'))

Missing Definitions

Try it!

sparql.setQuery("""
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX : <http://dbpedia.org/resource/>
SELECT ?item, ?description WHERE {
  VALUES ?item { :Child_prostitution
  :Ice_Hockey_World_Championships
  :Major_League_Soccer
  :Tamil_language
  :Acne }
  ?item rdfs:comment ?description .

  FILTER ( LANG(?description) = "en" ) 
} 
""")

sparql.setReturnFormat(RDFXML)
results = sparql.query().convert()
print(results.serialize(format='xml'))

Get URIs from Aliases

Try it!

sparql.setQuery("""
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT ?x ?redirectsTo WHERE {
 VALUES ?wikipedia { "Abies"@en 
 "Abolitionists"@en}
 ?x rdfs:label ?wikipedia .
 ?x dbo:wikiPageRedirects ?redirectsTo
}
""")

sparql.setReturnFormat(XML)
results = sparql.query().convert()
print(results.serialize(format='xml'))

Of course, SPARQL is a language unto itself, and it takes time to become fluent. The examples above are closer to baby-talk than Shakespearean speech. Nonetheless, one begins to gain a feel for the power of the language.

As we move forward, we will try to leverage SPARQL as the query language to our knowledge graph, since it provides the most powerful and flexible language for doing so. There will obviously be times when direct Python calls are more direct and shorter to implement. But the most flexible filters and intersections will come from our use of SPARQL.

Query Resources

A partial, but useful, list of public SPARQL endpoints is provided by:

An assessment of their current availability is provided by:

Here are the top 100 named graphs available with their triple counts:

Wikidata provides its own listing of 100 SPARQL endpoints:

There is an excellent (and growing) compilation of useful SPARQL queries to Wikidata available from:

Two smaller, but similarly useful resource for DBpedia queries, are available from:

The latter also provides some SPARQL construction tips.

Example OpenStreetMap SPARQL examples are available from:

NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site. The cowpoke Python code listing covering the series is also available from GitHub.
NOTE: This CWPK installment is available both as an online interactive file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment — which is part of the fun of Python — and to notify me should you make improvements.

Schema.org Markup

headline:
CWPK #50: Querying External Sources

alternativeHeadline:
The Nearly Infinite Usefulness of SPARQL

author:

image:
https://www.mkbergman.com/wp-content/uploads/2020/07/cooking-with-kbpedia-785.png

description:
What we provide in today's 'Cooking with Python and KBpedia' installment are some guidelines and recipes for using SPARQL to obtain information you need from external sources and to write them to flat files. Because of their importance, we emphasize Wikidata and DBpedia (also a stand-in for Wikipedia) in our examples.

articleBody:
see above

datePublished:

Leave a Reply

Your email address will not be published.