« Working Ontologist at Java One | Main | Working Ontologist in Korean! »

August 07, 2008

RDF as self-describing data

From time to time, someone will give me access to an RDF data set for me to 'have a look at'. One of the advantages of how RDF works is that it is possible to query a dataset without knowing anything about the data set at the outset. There are some simple queries that you can get started with to show how this works. As an example, let's check out the dbpedia (query web page available at http://www.dbpedia.org/sparql). When I first learned about this, Orri Erling just gave me a link; he told me nothing about the dataset.

The dbpedia page starts out with a simple sample query:

SELECT DISTINCT ?Concept WHERE {[] a ?Concept}

So let's start by running that. It is a bit of an advanced query, since the query graph includes a blank node; if you aren't comfortable with blank nodes in queries, think of

SELECT DISTINCT ?Concept WHERE {?x a ?Concept} 

instead.

This gives us all classes that have any members. There are a lot of these, maybe even too many. But we can get a feeling for the sort of thing that dbpedia talks about.

Another useful first query is

SELECT DISTINCT ?p WHERE {?s ?p ?o}

This gives you all the properties that are used in this data set.

Those are starting points - but lets go a bit further. Suppose we had a class that we were interested in. For example, when I ran the default query, one of the answers on the first page was http://dbpedia.org/class/yago/Airline102690270. So perhaps we can learn about airlines.

So, let's see the airlines that dbpedia knows about. So now I make a new query, based on what I learned in the previous one.

SELECT ?air WHERE {?air a <http://dbpedia.org/class/yago/Airline102690270>}

I get a lot of answers, including http://dbpedia.org/resource/Delta_AirElite_Business_Jets

Well - now what does dbpedia know about this Delta subsidiary? We can find out with a query like this:

SELECT ?p ?o WHERE {<http://dbpedia.org/resource/Delta_AirElite_Business_Jets> ?p ?o}

Among the answers here, we get

?p?o
http://dbpedia.org/property/headquarters http://dbpedia.org/resource/United_States

This is interesting - are there other airlines that have headquarters in the United States? Let's find out

SELECT ?other
WHERE {?other a <http://dbpedia.org/class/yago/Airline102690270> .
?other <http://dbpedia.org/property/headquarters> <http://dbpedia.org/resource/United_States> .}

We get quite a list of airlines.

We can continue in this way in a number of directions; find other places where certain airlines have headquaters, find other things that have US headquarters, etc.

What is special about RDF / SPARQL that allowed this to happen? There are a few things here - we were able to query the schema using the same query language as we did for the data. The pattern

{?x a ?Concept}
returns the set of (nonempty) classes in the data set - a schema-level result. If this were a relational database, this would be akin to querying to find out the tables in the database.

We can even mix schema and data in the same query. For instance, the pattern

{<http://dbpedia.org/resource/Delta_AirElite_Business_Jets> ?p ?o}
tells us all the properties that correspond to Delta Air Elite Jets, as well as the values of those properties. This is like querying for the columns in a table that are filled in for a particular row, along with the values in those cells.

This is a real sense in which an RDF store is 'self-describing' - there is no need to know about traditional metadata (schemas) before exploring a data set.

Comments

My Photo
Blog powered by TypePad

My Books

TopQuadrant

Little adoe about Nothing

Composing the Semantic Web