Wednesday, October 26, 2005
Managing Semi-Structured Data
Traditional tools require the data schema to be developed prior to the creation of the data. Unfortunately, sometimes the data schema emerges only after the software is already in use—and the schema often changes as the information grows. A typical example is the information contained in the item descriptions on eBay. It seems impossible for the eBay developers to define an a priori schema for the information contained in such descriptions. Today, all of this information is stored in raw text and searched using only keywords, significantly limiting its usability. The problem is that the content of item descriptions is known only after new item descriptions are entered into the eBay database. EBay has some standard entities (e.g., buyer, date, ask, bid...), but the meat of the information—the item descriptions—has a rich and evolving structure that isn’t captured.
Traditional software design methodology does not work in such cases. One cannot rigidly follow the steps:
- Gather knowledge about the data to be manipulated by the software components being designed.
- Design a schema to model this information.
- Populate the schema with data.
We need software and methodologies that allow a more flexible process in which the steps are interleaved freely, while at the same time allowing us to process this information automatically.