Wednesday, August 31, 2005

Identity Semantics in XBRL and SDMX 

I have recently been working with two XML Languages. One is SDMX which is used for representing statistical data, and the other is XBRL which I have been tangentially involved with for representing Business and Accounting data. Although these two standards originally started out taking very different approaches to solving two different problems, they are increasingly being seen as competitors, particularly in the international banking industry. This competition has increased since SDMX has applied for ISO certification. Apparently the ISO is a little nervous about having two overlapping standards, and would like them to resolve there differences in some coordinated way.

But what struck me, was the parallels between XBRL and topic maps and its global identifier based approach to subject (business fact) identity. Although SDMX initially had much more modest representational goals, it took a fundamentally different combinatorial or multidimensional approach to the identification of statistical observations. Now in its 2.0 version it is considering how to add concepts for a more flexibly structure of relationships between the characteristics of statistical series, like what can be done in XBRL.

When you are first exposed to XBRL, it can be a very jarring experience. The approach seems very complicated and heavily dependent on XLINK link bases, which are not widely used or understood. For example one paper (2) gave this description.

A counter-intuitive approach? A well constructed XML schema can be a complex document, but in general schemas are comprehensible to an appropriately knowledgeable reader. The same frequently cannot be said of an XBRL taxonomy. The removal of all structure from the schema and its placement in [XLink] linkbases makes for a schema that gives little idea of the sense of the data it is designed to constrain. The linkbases themselves are equally hard to read, precisely because the concepts they structure are in the schema! In short, without software designed to process it, XBRL is very difficult to work with. So why on earth would anyone want to use a format like this?
Requirements for flexibility. The answer to this question lies in the stringent requirements that I mentioned earlier. The "X" in XBRL is more than just a moniker stolen from its parent technology - XBRL strives to provide an unusual degree of flexibility. At the heart of this lies an interesting approach to a very general problem of data modelling in XML (and indeed data modelling more generally). What is the best way to define a format such that it captures data with the maximum possible precision, but at the same time can be extended/rearranged with the minimum impact on backwards compatibility?
The article goes no to describe how it approaches identity.
Everything's global. A corollary of the limited [use of] hierarchy is the requirement that all concepts defined in an XBRL Taxonomy Schema must be defined globally. This results in a very different style of schema to that which most people are used to - an enormously long list of global element definitions, which often have long names reflecting their very precise level of applicability. As an example, the IFRS (International Financial Reporting Standards) taxonomy contains concepts with names like:
Link bases are then used to structure a graph of relationships between these global identifiers.

SDMX on the other hand, takes a very hierarchical approach to structuring XML, with Message/Group/Dataset/Series/Observation forming a strict hierarchy. Identity on the other hand is defined by a fixed number of dimensions (property sets, OLAP) whose members are drawn from code lists and organized into a Key family (1) or multidimensional space.

We now have a more sophisticated understanding of what a key family does: it specifies a set of concepts which describe and identify a set of data. It tells us which concepts are dimensions (identification and description), and which are attributes (just description), and it gives us an attachment level for each of these concepts, based on the packaging structure (Data Set, Group, Series, Observation). It also tells us which code lists provide possible values for the dimensions, and gives us the possible values for the attributes, either as code lists or as numeric or free text fields.
The interesting thing about key family structures is that they can be used as metadata to automatically generate web pages for navigational search in a very general, yet very efficient and intuitive way.

While the Federal Reserve started out with a very elaborate ontology of global identifiers, they are now trying to shift to a more dimensional approach to statistical time series identification so that non-expert users can find statistical data.

However, this still begs the question of how does a novice user find the right key family. This is where things like the folksonomy debate , the beta “Google Select” (3) and del.icio.us become interesting. Anyways, not to make this message to long, I’m working on a system that will use free text search on a loose collection of word vectors, to point a use towards a key family based menu system, to select a set of global identifiers, for downloading statistical data.

Interesting stuff with a lot of strong parallels to topic map ideas

Topics: Identity | Meaning | TopicMaps | SDMX | XBRL | Google

Links to this post:


Comments: Post a Comment

This page is powered by Blogger. Isn't yours?