Selling and overselling the Semantic Web

We got on the train together.  I had just finished a four-day training/consulting session with a company doing information integration for international security.  She was doing a master's degree, with a thesis about Ontologies.  Like a good grad student, she was a voratious reader.  She had read white papers, research papers, books, web pages, magazine articles, and anything else she could get her eyes on.  The more she read, the more confused she became. 

It is hard to be surprised at this.  It seem that just about everyone is jumping on the Semantics bandwagon.  Is an Ontology a top-down way to organize all human knowledge?  Or just a glorified ERD?  Or a controlled vocabulary?  Will it take an Ontologist to make them?  Or will they be something that everyone can do, like a web page?  Will Ontologies make the web come alive as a sentient, intelligent being?  You can find someone who has seriously puported variants of all of these, all using the name "Ontology".

So I just sorted it all out for her before we got to Elephant and Castle.

Well, not really.  There are just too many contradictions.  Is the Semantic Web about a top-down organization of everything?  Or a wooley free-for-all?  Are vocabularies controlled or not?  Is content authored or automatically generated?

But here's what I was able to offer.  One story, about a web of information.  A story about information sharing.  A story that builds on the success of things like Wikipedia and the World Wide Web.

In my story there is no need for natural language processing.  Inference plays a key role, but not an analytic one; it is  just a way to connect information together.  Upper ontologies are largely irrelevant, but reusable ontologies are not.

Will this technology story solve every problem?  No.  It will not diagnose diseases, it will not automatically index your library. It will not make your search engines obsolete.

But no technology story can do all that - the best we can hope for is a story that is coherent (it actually makes sense), feasible (it can be done with extant technology), and, perhaps most importantly, valuable (it provides some real business value).  I think I have such a story - and in that story, there happens to be no need for natural language processing, upper ontologies, or highly sophisticated inferencing. 

So, what is that story?  I did finish the story before we got to Elephant and Castle, but that's a 45 minute ride.  I can't fit the whole thing into a blog entry.  But I can fit it into a book. 

Yes, this whole entry was a troll for the book.  Check it out.

Should every web page be semantic-backed?

A few months ago, when SPARQL made it up to Candidate Recommendation at the W3C, the announcement elicited considerable discussion on the techie forum /.  A common sentiment expressed there was that RDF and SPARQL are just some weird fad, and that relational tables are here to stay.  In ten years, we'll still be coding in SQL, and nobody will remember what RDF stands for.

This may well be true; after all, we all know that technological superiority is not particularly well correlated with longevity.  But I can't help but notice something that I thought was a passing trend, but seems to be more prevalent than I thought.

That is the trend of database designers out-foxing their relational models, so that they can have a more flexible schema.

Here's how it starts.  You build a database schema.  You design it to your current specs as you know them, e.g., the important parameters of your product lines, your selection criteria for candidates, the parameters of the analytics your business needs, etc. 

Then something changes - new analytics, new product lines and features, more criteria become relevant, etc.  So you are faced with a puzzle - do I change my schema, and migrate?  Do I hack in some information into an under-utilized field?  Do I ignore the business changes?  None of these options are nice.

So, the forward-thinking database architect comes up with a novel idea.  Let's have a table that indexes all the parameters for a product/candidate/analytic etc.  Now, I can add new parameters just by adding rows to a table.  In the relational model, it is easy to add rows, difficult to add columns. 

The reaction to this by your database buddies is incredulity at first.  They think you are joking.  Then they ask how you are going to query it - and you have a complex, but workable answer to that.  Then they ask how efficient those queries are.  This is more problematic, but you have a plan.

One DBA told me that they nicknamed this solution as "Stealth Columns"; other projects were scrapped at this point.  I had one such DBA, when asked the "Can you efficiently query this?" turn to me and ask, "I'm hoping you can migrate this into a triple store and query it quickly.  Right??"

This last fellow had caught on.  He realized that he was not the first to come up with this solution, nor the last.  And that there was a reason why it had not caught on as standard practice - because it was not workable in the long term.  The tricks he had to play with his queries made his team's experience in indexing RDBs useless.  He basically had to re-train them, in a solution that nobody else had got to work well.

RDF is an elegant solution to this problem, and a standard one.  One for which someone else is doing the hard work of optimizing.  One for which it is possible to find people educated in how to use it (not as many as are currently educated in SQL, but a lot more than know how to manage any particular home-grown solution).  And RDF resolves all the issues that brought our DBA to this point; it is just as easy to add columns as rows.  Schema are as flexible as data.

The governance of such a system poses new problems that a relational model does not; after all, now that you can extend the schema, there are parts of the system that were once sacrosanct that now can be modified.  But this problem was there for the home-grown solution, too; the only barrier there was that the team was so confused by the model that nobody could change it at all.  The governance issues are part of any solution to the flexible schema problem; if your schema is flexible, then you'll need some way to manage that flexibility.

Semantic Web at DAMA

Wilshire conferences is putting on the DAMA International Symposium and MetaData Conference this week here in San Diego.  A lot of the issues that data management folks care about (master data management, data federation, identity management) are also dealt with by the Semantic Web.  And there is a pretty good showing of Semantic Web companies here - my own TopQuadrant is giving two talks and exhibiting.  Zepheira CTO Eric  Miller gave a  talk in the SOA track about the background of Semantic Web and why DAMA folks should be interested.

 

This is also the first  public venue where the book that Jim Hendler and  I have written, Semantic Web for the Working Ontologist, is featured in the Morgan-Kaufmann booth.   It is nice being able to point to a written description that includes not only the basics of the motivation of the semantic web, but also the details of how the standards approach the goals and desiderata of the Semantic Web.  Yes, I have my elevator speech down pat, but now I can back it up with a docoument.  Hooray!

Many thanks to Henry Story for plugging the book on his blog!!

What will 2008 mean for the Semantic Web?

A journalist asked me this question for an end-of-year piece. I thought about it, and suggested discussing it with all my most educated mature friends.  Fortunately for me, only my young friends had time for the discussion, so I got a fresh viewpoint.  Here are some of the things that we came up with.

First, let's look at some trends in how young people use the web. Previously esoteric features like RSS feeds are becoming understood by mainstream users.  Microformats for structuring information are starting to spread. We are seeing mashups of data from one web site appearing in another. All in all, there is an expectation that the web is no longer a set of individual places where you go to get things (pictures on Flickr, journal on Livejournal, friends on Myspace), but that all of these things are just aspects of a single integrated experience that is "the web".

In social networking, we are seeing pressure to move to more and more open envioronments. Just about any Livejournal page today is abuzz with two topics: 1) Does the recent Livejournal 'adult tagging' policy constitute patrolling Livejournal as a Nannystate? and 2) Will the recent purchase of Livejournal by a foreign company compromise the privacy of information that has been confided to our journals?  Both of these topics result in a single technology interest - is there a way to extract my Livejournal into a neutral form, that I can archive, and if necessary, migrate to some other system?

What do these trends mean for the semantic web? It means that the time is right for a mass end-user application like Twine which will allow a user to create and manage their own information from hundreds of sites, in a customizable and personal repository of information.  It means that a generation of users will expect that information can move smoothly from one site to the next.

Does this mean that 2008 will be the Year of the Semantic Web?  After all, Twine itself is based on the W3C Semantic Web standards - if Twine takes off, applications all over the web will scramble to be Twine-compliant (that is, RDF-compliant), and this will be a boon to Semantic Web applications everywhere. 

But as we all know, the best technology doesn't always win the day, so the future of the Web 3.0 might not include Twine and the Semantic Web standards - it might be built out of a hodge-podge of technologies for mashups, RSS integration, social network exchange, tagging, etc., and leave the elegant idea of RDF triples behind.  Or maybe the technological difficulties with RDF (e.g., its problematic XML syntax) will make it qualify as an inferior technology standard that can win the day.

Let's turn our attention for a moment to the enterprise - Things like Twine and social networking are usually thought of as being the domain of the web at large, not of the workplace.  But the workplace has its own integration challenges, based on issues around data federation, M&A, heterogeneous data streams and the like. What will 2008 mean for the Semantic Web in the enterprise?

My young friend pointed something important out to me.  In another entry, I have pointed out the relevance the amazing phenomenon of Wikipedia to the enterprise.  There is a whole cadre of people out there who understand and appreciate the value of a shared information set, and the value to themselves of contributing to it.  Today, when I suggest to an enterprise that it might be possible to motivate knowledge workers to do some work themselves to organize their corpus of unstructured information, I am told with a sad shake of the head that "... you just don't understand.  Our engineers/researchers/analysts will never do that.  It just won't happen!"

My young friend suggested otherwise.  Sure, the generation of over-40 engineers might never do that (after all, 40 is the new 30, so by virtue of being over 40, none of us can be trusted!), that there is a new generation of people entering the workforce who have a different relationship with Wikipedia than their elders.  These are the people who used Wikipedia to cheat on their homework in high school. 

Yes, the first wikipedia high school generation is graduating from college and entering the workforce, and they will expect their information infrastructure to have the same interactive power that they have used in school.  That the notes they take can be posted to the ether, and that someone will correct and enhance them.  You can bet that this generation of engineers/researchers/analysts will appreciate the value of marking up their information.  Will this happen in 2008?  Probably not - but 2008 will be the year when it starts, when new expectations start pushing at the enterprise.

Of course, the most important thing that 2008 will bring to the Semantic Web will be the publication of Semantic Web for the Working Ontologist - the book that I have been working on with Jim Hendler for the past two years, is finally in copy editing, and will appear in 2008 (hopefully by May).  With this book, we can begin to train a whole new generation of information managers and data modelers in how the W3C standards can be used to model integrated information.

Discovery?

In a recent announcement, Insilico Discovery makes some pretty impressive claims about their data merging tools in RDF/SPARQL.  I was looking forward to the Insilico talk at this morning's RDF to RDB, but it was scheduled for first thing this morning, and didn't happen when it was scheduled.   :(

Late morning at RDF to RDB workshop

Ashok Malhotra from Oracle gave an interesting talk about his vision for connecting RDF(SPARQL) to relational databases. His approach is quite different from the most basic use of things like D2RQ and SquirrelRDF, which maps the tables in an RDB to classes in OWL. In the simplest form of this, abstractions over the base tables can be described in OWL, leaving the problem of generating appropriate queries to the inference engine.

In contrast to this approach, Malhotra suggests that each class be mapped to the database by writing a special purpose query.  This is a knowledge-intensive approach, but the effort you spend here will be payed back in improved performance later on, since you can define abstractions in optimizable SQL.

It isn't clear to me whether there is still a need for OWL inferencing at this point - if you can define your classes as abstractions specified in SQL, do you really need to describe the relations between those classes and other abstractions in OWL?  Malhotra agrees that this is an important issue, but it is still an open issue in his approach.

Another issue we haven't discussed is the decidability, provability and explanation capability of such a solution.  One of the benefits of using OWL (at least, OWL-DL) is that you have a decidable language and you can prove T-Box assertions about your model.  While I think this might be formally possible in SQL, it certainly isn't a normal thing to do, and it can't be done for queries that cross databases.  But how important is this?

At the moment, it seems to me that this approach is quite a bit at odds with the approaches taken by D2RQ and SquirrelRDF, but perhaps that is just my own narrow-minded view, and there is a synthesis of the two approaches.

Brodie does it again

Michael Brodie is giving the keynote talk at the W3C Workshop on RDF Access to Relational Databases.  His presentation begins with one of those very dramatic, music-backed fact videos.  The music is familiar, but the listed facts are a bit more focused than usual.  The basic message is about the amount of data stored in databases and on the web, and how quickly this is expanding.  As usual, the talk itself is interactive and entertaining (even if the audience keeps guessing correctly his 'surprise' numbers).  As usual, the talk includes information about all sorts of technology and trends; second life, gaming, mobile networks, etc.  It's great that there is someone whose job it is to keep on top of all this technology and make sense of it.

A particular comment that he makes early in this talk has to do with what computing is for - there is a quote about how computing in the 20th century was about what computers could do, while computing in the 21st century is about what people can do.  This sums up in a short sentence why I feel that anyone who says that the Semantic Web is just AI warmed over have missed the point of the Semantic Web. 

MIN and MAX in SPARQL

It is a bit embarassing when I teach SPARQL to someone with a background in SQL.  Once they figure out how it works, they start to appreciate it as a powerful language for representing just what information you want from a graph.  Then they start asking about some of the commonly used query operations from SQL - things like aggregators and grouping operations like COUNT, SUM, MAX, MIN and AVG.

The SPARQL specification at least has some guidance for how to do negation in SPARQL, though this idiom is a bit more difficult than simply being able to have NOT as a keyword.

As part of a recent exercise, I wondered whether it was possible to use a similar trick to define MAX and MIN in SPARQL.  I was surprised to find that it is possible.

Suppose we have data on the members of a prominent family, where we have the year of birth represented in triples of the form

:person1 :birth-year 1888 .
:person2 :birth-year 1890 .
:person2 :birth-year 1915 .

etc.

How would we find the name (rdfs:label) of the oldest known member of the family?  Here is a solution just using current SPARQL:

 

SELECT DISTINCT  ?label ?by
WHERE {?kennedy a :Person .
       ?kennedy  rdfs:label  ?label .
       ?kennedy :birth-year ?by .
       OPTIONAL {?older a :Person .
                 ?older :birth-year ?oby .
                 FILTER (?oby < ?by)}
       FILTER (!bound (?older))
}

How does this work?  The first three triples match any member of the family for which a birth year is known.

The pattern inside the OPTIONAL clause also matches a family member, and gets their birth year. We use the variable name ?older for this person because of the FILTER clause; we retain only the bindings of ?older that have an earlier birth year than ?kennedy.

Now here's how we get a max out of this: what happens if we can't find anyone with an earlier birth year? Then all matches to the pattern inside the OPTIONAL braces will be filtered out, and no bindings will remain for ?older.

Back outside the OPTIONAL, we filter based on the binding of ?older; if ?older is not bound, then we didn't find anyone with an earlier birth year.

Who is the person for whom nobody else has an earlier birth year?  That's the oldest, of course.

Left as an exercise for the reader:

 

  1. Suppose also have :death-year represented in the same way as :birth-year, but there is no triple in case the person is still living.  How do we modify this query to find the oldest living family member?  Or the youngest dead kennedy?
  2. What happens if the oldest (youngest) is not unique?  What would you expect to happen?  What does this query do?

Andy Warhol moment

I get more attention from other people's blogs than I do on my own - maybe if I were to blog more often, this wouldn't be the case. Today, Henry Story mentioned a blog entry by Nova Spivack who mentioned a comment I made at a SD Forum Semantic Web SIG meeting a couple weeks back.  This is the most quoted I've been in ages!  Thanks Nova & Henry!

Flickr read my blog!

Well, I can pretend that they did.  At least, the main complaints that I outlined here  have been addressed.  If you look at an RSS feed from Flickr now, the geocode information is now available in a sensible form!

Excuse me, I have some cool demos to put together . . .

My Photo

My Online Status

Blog powered by TypePad

TopQuadrant

Composing the Semantic Web

Little adoe about Nothing

My Books