Wednesday, August 31, 2005

The meaning of Entropy 

Jean-Bernard Brissaud
Clarifying the meaning of entropy led us to distinguish two points of view: the external one, which is the one of the observer of the studied system, and the internal one, which is the one of the system itself.

The external point of view leads to largely admitted associations: entropy as lack of information, or indetermination about the microscopic state of the studied system.

The internal point of view, the one we should have if we were the studied system, leads to interpretations more rarely seen and yet useful. Entropy is seen as a measure of information or freedom of choice.

These two analogies fit well together, and are tied by the duality of their common unit: the bit. A bit of information represents one possibility out of two, a bit of freedom represents one choice out of two.

. . .

Entropy is often assimilated to disorder, and this conception seems to us inappropriate. Instead, temperature is a good measure of disorder since it measures molecular agitation, the part of the motion which doesn't contribute to a possible global motion.

To assimilate entropy with disorder leads to another, unwise, definition of order, as absence of freedom, since entropy measures freedom.

Topics: Entropy | Information | Meaning

links to this post (0) comments

Identity Semantics in XBRL and SDMX 

I have recently been working with two XML Languages. One is SDMX which is used for representing statistical data, and the other is XBRL which I have been tangentially involved with for representing Business and Accounting data. Although these two standards originally started out taking very different approaches to solving two different problems, they are increasingly being seen as competitors, particularly in the international banking industry. This competition has increased since SDMX has applied for ISO certification. Apparently the ISO is a little nervous about having two overlapping standards, and would like them to resolve there differences in some coordinated way.

But what struck me, was the parallels between XBRL and topic maps and its global identifier based approach to subject (business fact) identity. Although SDMX initially had much more modest representational goals, it took a fundamentally different combinatorial or multidimensional approach to the identification of statistical observations. Now in its 2.0 version it is considering how to add concepts for a more flexibly structure of relationships between the characteristics of statistical series, like what can be done in XBRL.

When you are first exposed to XBRL, it can be a very jarring experience. The approach seems very complicated and heavily dependent on XLINK link bases, which are not widely used or understood. For example one paper (2) gave this description.

A counter-intuitive approach? A well constructed XML schema can be a complex document, but in general schemas are comprehensible to an appropriately knowledgeable reader. The same frequently cannot be said of an XBRL taxonomy. The removal of all structure from the schema and its placement in [XLink] linkbases makes for a schema that gives little idea of the sense of the data it is designed to constrain. The linkbases themselves are equally hard to read, precisely because the concepts they structure are in the schema! In short, without software designed to process it, XBRL is very difficult to work with. So why on earth would anyone want to use a format like this?
Requirements for flexibility. The answer to this question lies in the stringent requirements that I mentioned earlier. The "X" in XBRL is more than just a moniker stolen from its parent technology - XBRL strives to provide an unusual degree of flexibility. At the heart of this lies an interesting approach to a very general problem of data modelling in XML (and indeed data modelling more generally). What is the best way to define a format such that it captures data with the maximum possible precision, but at the same time can be extended/rearranged with the minimum impact on backwards compatibility?
The article goes no to describe how it approaches identity.
Everything's global. A corollary of the limited [use of] hierarchy is the requirement that all concepts defined in an XBRL Taxonomy Schema must be defined globally. This results in a very different style of schema to that which most people are used to - an enormously long list of global element definitions, which often have long names reflecting their very precise level of applicability. As an example, the IFRS (International Financial Reporting Standards) taxonomy contains concepts with names like:
Link bases are then used to structure a graph of relationships between these global identifiers.

SDMX on the other hand, takes a very hierarchical approach to structuring XML, with Message/Group/Dataset/Series/Observation forming a strict hierarchy. Identity on the other hand is defined by a fixed number of dimensions (property sets, OLAP) whose members are drawn from code lists and organized into a Key family (1) or multidimensional space.

We now have a more sophisticated understanding of what a key family does: it specifies a set of concepts which describe and identify a set of data. It tells us which concepts are dimensions (identification and description), and which are attributes (just description), and it gives us an attachment level for each of these concepts, based on the packaging structure (Data Set, Group, Series, Observation). It also tells us which code lists provide possible values for the dimensions, and gives us the possible values for the attributes, either as code lists or as numeric or free text fields.
The interesting thing about key family structures is that they can be used as metadata to automatically generate web pages for navigational search in a very general, yet very efficient and intuitive way.

While the Federal Reserve started out with a very elaborate ontology of global identifiers, they are now trying to shift to a more dimensional approach to statistical time series identification so that non-expert users can find statistical data.

However, this still begs the question of how does a novice user find the right key family. This is where things like the folksonomy debate , the beta “Google Select” (3) and del.icio.us become interesting. Anyways, not to make this message to long, I’m working on a system that will use free text search on a loose collection of word vectors, to point a use towards a key family based menu system, to select a set of global identifiers, for downloading statistical data.

Interesting stuff with a lot of strong parallels to topic map ideas

Topics: Identity | Meaning | TopicMaps | SDMX | XBRL | Google

links to this post (0) comments

Tuesday, August 09, 2005

Microformats vs. XML vs. RDF 

Dare Obasanjo
Tantek wrote

The sad thing is that while namespaces theoretically addressed one of the problems I pointed out (calling different things by the same name), it actually WORSENED the other problem: calling the same thing by different names. XML Namespaces encouraged document/data silos, with little or no reuse, probably because every person/political body defining their elements wanted "control" over the definition of any particular thing in their documents. The <svg:a> tag is the perfect example of needless duplication.

And if something was theoretically supposed to have solved something but effectively hasn't 6-7 years later, then in our internet-time-frame, it has failed.

This is a valid problem in the real world. For example, for all intents an purposes an <atom:entry> element in an Atom feed is semantically equivalent to an <item> element in an RSS feed to every feed reader that supports both. However we have two names for what is effectively the same thing as far as an aggregator developer or end user is concerned.

The XML solution to this problem has been that it is OK to have myriad formats as long as we have technologies for performing syntactic translations between XML vocabularies such as XSLT. The RDF solution is for us to agree on the semantics of the data in the format (i.e. a canonical data model for that problem space) in which case alternative syntaxes are fine and we performs translations using RDF-based mapping technologies like DAML+OIL or OWL. The microformat solution which Tantek espouses is that we all agree on a canonical data model and a canonical syntax (typically some subset of [X]HTML).

So far the approach that has gotten the most traction in the real world is XML. From my perspective, the reason for this is obvious; it doesn't require that everyone has to agree on a single data model or a single format for that problem space.  

Microformats don't solve the problem of different entities coming up with the different names for the same concept. Instead its proponents are ignoring the reasons why the problem exists in the first place and then offering microformats as a panacea when they are not.

Topics: Microformats | RDF

links to this post (0) comments

Biological Metaphors of Internet Serendipity 

Jon Udell

Increasingly I think about this stuff in biological terms. I'm a cell; the blog is my cell membrane; the items I post here extrude that membrane out into the intercellular environment, forming a complex surface area with which other cells interact. The other day, a piece of Brenda's extruded surface touched a piece of mine. I know that because my surface is instrumented with a variety of sensors: my referral log, del.icio.us, Feedster, PubSub, Technorati. So when this pseudopod of Brenda's touched this pseodopod of mine, I noticed.

. . .

By subscribing to Brenda's feed, and then posting this item, I reinforce the connection, and it's cool that things work that way. But the initial discovery is the most amazing thing. It looks like serendipity, and in a way it is, but it's manufactured serendipity.

Topics: Biology | Serendipity

links to this post (0) comments

This page is powered by Blogger. Isn't yours?