<$BlogRSDUrl$>

Wednesday, May 04, 2005

Long anticipated, the arrival of radically restructured database architectures is now finally at hand. 

Jim Grey
Application developers still have the three-tier and n-tier design options available to them, but now two-tier is an option again. For many applications, the simplicity of the client/server approach is understandably attractive. Still, security concerns—bearing in mind that databases offer vast attack surfaces—will likely lead many designers to opt for three-tier server architectures that allow only Web servers in the demilitarized zone and database systems to be safely tucked away behind these Web servers on private networks.

. . .

All major database systems now incorporate queuing mechanisms that make it easy to define queues, to queue and de-queue messages, to attach triggers to queues, and to dispatch the tasks that the queues are responsible for driving.

. . .

the tendency is to implement publish/subscribe and workflow systems on top of the basic queuing system. Ideas about how best to handle workflows and notifications are still controversial—and the focus of ongoing experimentation.

. . .

The key question facing researchers is how to structure workflows. Frankly, a general solution to this problem has eluded us for several decades. Because of the current immediacy of the problem, however, we can expect to see plenty of solutions in the near future. Out of all that, some clear design patterns are sure to emerge, which should then lead us to the research challenge: characterizing those design patterns.

. . .

Now the time has come to build on those earliest mining efforts. Already, we’ve discovered how to embrace and extend machine learning through clustering, Bayes nets, neural nets, time-series analysis, and the like. Our next step is to create a learning table (labeled T for this discussion). The system can be instructed to learn columns x, y, and z from attributes a, b, and c—or, alternatively, to cluster attributes a, b, and c or perhaps even treat a as a time stamp for b. Then, with the addition of training data into learning table T, some data-mining algorithm builds a decision tree or Bayes net or time-series model for our dataset.

. . .

Increasingly, one runs across tables that incorporate thousands of columns, typically because some particular object in the table features thousands of measured attributes. Not infrequently, many of the values in these tables prove to be null. For example, an LDAP object requires only seven attributes, while defining another 1,000 optional attributes.

Although it can be quite convenient to think of each object as a row in a table, actually representing them that way would be highly inefficient—both in terms of space and bandwidth. Classic relational systems generally represent each row as a vector of values, even in those instances where the rows are null. Sparse tables created using this row-store approach tend to be quite large and only sparsely populated with information.

One approach to storing sparse data is to plot it according to three characteristics: key, attribute, and value. This allows for extraordinary compression, often as a bitmap, which can have the effect of reducing query times by orders of magnitude—thus enabling a wealth of new optimization possibilities. Although these ideas first emerged in Adabase and Model204 in the early 1970s, they’re currently enjoying a rebirth.

. . .

all of these data types—most particularly, for text retrieval—require that the database be able to deal with approximate answers and employ probabilistic reasoning. For most traditional relational database systems, this represents quite a stretch. It’s also fair to say that, before we’re able to integrate textual, temporal, and spatial data types seamlessly into our database frameworks, we still have much to accomplish on the research front. Currently, we don’t have a clear algebra for supporting approximate reasoning, which we’ll need not only to support these complex data types, but also to enable more sophisticated data-mining techniques. This same issue came up earlier in our discussion of data mining—data mining algorithms return ranked and probability-weighted results. So there are several forces pushing data management systems into accommodating approximate reasoning.

. . .

The emergence of enterprise data warehouses has spawned a wholesale/retail data model whereby subsets of vast corporate data archives are published to various data marts within the enterprise, each of which has been established to serve the needs of some particular special interest group. This bulk publish/distribute/subscribe model, which is already quite widespread, employs just about every replication scheme you can imagine.

. . .

It turns out that publish/subscribe systems and stream-processing systems are actually quite similar in structure. First, the millions of standing queries are compiled into a dataflow graph, which in turn is incrementally evaluated to determine which subscriptions are affected by a change and thus must be notified. In effect, updated data ends up triggering updates to each subscriber that has indicated an interest in that particular information.

. . .

Indeed, if every file system, every disk, every phone, every TV, every camera, and every piece of smart dust is to have a database inside, then those database systems will need to be self-managing, self-organizing, and self-healing. The database community is justly proud of the advances it has already realized in terms of automating system design and operation. The result is that database systems are now ubiquitous—your e-mail system is a simple database, as is your file system, and so too are many of the other familiar applications you use on a regular basis.

. . .

algorithms and data are being unified by integrating familiar, portable programming languages into database systems, such that all those design rules you were taught about separating code from data simply won’t apply any longer. Instead, you’ll work with extensible object-relational database systems where nonprocedural relational operators can be used to manipulate object sets. Coupled with that, database systems are well on their way to becoming Web services—and this will have huge implications in terms of how we structure applications. Within this new mind-set, DBMSs become object containers, with queues being the first objects that need to be added. It’s on the basis of these queues that future transaction processing and workflow applications will be built.

Topics: SQL | Workflow


Comments: Post a Comment

This page is powered by Blogger. Isn't yours?