Monday, August 25, 2008

RDF and Data 2.0 

I have been talking to my friend Bryan Thompson after he got back from his talk at OSCON Cloud Computing with bigdata, and it got me thinking again about RDF data storage, which seems to be getting increasing mindshare lately. Bryans point is that his approach to RDF provides a next generation data model that generalizes all the Google Big Table clones, like Cassandra or HBase that are starting to pop up everywhere. What follows are my thoughts on what it might take to get wide spread acceptance for something like an RDF store in the Enterprise.

Just as an RSS reader accumulates feed data, an RDF store could be set up to accumulate Data 2.0 feeds. The RDF would be schema free and possibly untyped and could be mashed up from a variety of sources. Just as RSS data is typically organized into a structure of folders, RDF could also be supplemented with metadata depending on the source. This metadata could include information on data types, sources, providence etc. There would be no worry about nulls, or unexpected properties or, conflicting schema or type information.

A human user with an intuitive understanding of the semantics of the data could then navigate through the network of loose relationships or use SPARQL to do set based queries. Interestingly the first large scale RDF startups, like Freebase from Metaweb Technologies Inc, are using RDF to provide a flexible database for human facing “smart navigation” of mashed up data, for example see Parallax. Another example is Radar Networks Twine.

The problem comes when you try and run automated processes against this data. While my initial concerns about RDF was that it represented data at too granular a level of detail, and that the lack of top down structure made it hard for a human user to write efficient queries against the data, I have now come to a different conclusion. I’m coming around to the idea that the traditional entity model (of relational tables) is really designed to match machine processing requirements, not just present a simple mental model for human designers and system architects. An Entity Model of relational tables provides the schema needed to define a simple mechanical process to iterate over data, without throwing an exception.

The structure of a primary key with well defined one-to-one strongly typed null or non-nullable properties and foreign keys to one-to-many relationships is exactly the schema requirements of a reliable, iterative mechanical process across data. This is further supported by the closure of SQL table processing producing tables for further processing. The entity model enforces the constraints needed to keep a mechanical process from crashing.

That is why DBAs are so fanatical about defining good relational structure, and why the design of the entity model for a database with many diverse processing requirements can cause such a difficult political negotiation, that often includes a lot of heated debate over entity relational structure. Unfortunately, once the social contracts of these political truces are written into the structure of the database schema, the understanding of the forces and compromises that went into its structure is often lost. This is the double edged sword of database structure. It protects the existing mechanical processes, but can be brittle to change. So while RDF supports a flexible loosely-typed mash-up of diverse data streams, a table based entity model supports the constraints on data required for simple and reliable iterative processes.

For reliable high volume processing of RDF data, a way would be needed to test against a more traditional schema hardened entity model. Schema violations could then be added to the RDF data as structured metadata. This keeps a nice clean separation between source data and potentially numerous alternative processing schemas. These entity models could be virtual views or physically copies of the original RDF data.

Similar virtualization to provide more flexible structuring of data are already becoming increasingly common through the use of tools like Hibernate and many other open source ORM tools (in the Java space) and LINQ, Entity Framework and NHibernate, amongst other (in the .NET space). The use of sparse columns in SQL Server is another approach to providing some of the flexibility of RDF in a more traditional Database context.

This suggests the possibility of directly accessing virtualized entity table structured views from an RDF store. You would need to write drivers that translated the virtual entity structure into SPARQL instead of SQL queries. Java and .NET programs could then access the data just as they would relational tables. Interestingly SPARQL produces table structured data more naturally than it does RDF relations.

Virtuoso has a very interesting solution to this problem, where they provide RDF views of SQL data, see: Virtuoso's Meta Schema Language for Declaratively generating RDF Views of SQL Data. I think Oracle is also trying to integrate RDF into its SQL framework. This seem like it has a lot of immediate potential. Using the Linked Data Web metaphor immediately opens up a lot of data for mashing on the Web, and is consistent with the first movers being human data browsers who want a smarter more semantic interface. I'm wondering, however, if ultimately providing a SQL view of RDF optimized data isn't a better approach than trying to provide an RDF view of SQL data.

A more ETL (extract, transform and load) workflow based approach would also be possible that exports RDF to traditional database schema storage. This would require traditional database ETL tools like Informatica to support RDF datastores as sources and targets.

Anyways, just more food for thought.

Topics: RDF | Data2.0

Links to this post:


Comments: Post a Comment

This page is powered by Blogger. Isn't yours?