Musings on relations - or: WinFS is not enough

Tuesday, December 20, 2005

Pile

Have you had a look at WinFS? No, you should. It´s cool. Or maybe I should say: It could be even more cool, if it didn´t stop too early.

The basic idea behind WinFS is: set your data free! Unlock the gems hidden in large databases! Microsoft´s fundamental insight underlying this is, the future is about relationships between data and you can only set up relationships between separately adressable and accessable units of data.

Within a single relational database you can setup relations between rows in tables. The relationship between smaller informational units (columns, fields) is implicit by arranging them within a single table. But what about relationships between data in different databases? Although that´s possible, it´s not really what you want to do in your day to day business. Databases thus draw a boundary around data. From a security point of view this is of course benefical; but from a data reuse point of view, this is a contraproductive.

Wouldn´t is be nice, though, to not be concerned about the database, the bucket where an address is stored in? Wouldn´t it be nice to be able to relate an existing address to a new information you want to store, which not necessarily comes to rest in the same database as the address? Think about linking your latest digital pictures to contacts you already have in Outlook. Think about adding tasks in Outlook to an order workflow process in your ERP program.

All those scenarios are either not possible today or very hard to achieve. That´s the reasoning behind why WinFS is needed. With WinFS there is no more Outlook .pst file hiding all those precious contacts, tasks, emails and appointments behind a wall. Instead these items are set free to float around as separately addressable and accessable data units in the file system space. And you can do the same with your own data.

Once there are comparatively small data items floating around rather than being penned up in an ever increasing number of (incompatible) databases, you can start to relate those data items to each other. That´s so cool! It let´s you reuse information in different contexts/programs instead of reentering (or importing/exporting) it over and over again.

However, although WinFS breaks up the database barriers around the information nuggets it remains in the world of relational databases. No, not just because SQL Server 2005 is the foundation on which WinFS is built. Rather because relationships are still an afterthought and a second class citizen in data modelling.

From data to associations

Filesystems, XML, RDBMS, ODBMS and also WinFS are all data centric. Data is the main concern. Storing data is the most important task of an RDBMS. Databases are about recording data, making it persistent. Well, that sounds reasonable, doesn´t it?

The following picture depicts the current thinking: Data is arranged in fields stuffed into a row. Rows can point at each other. The data items (fields) are related implicitly and explicitly on two levels: implicitly by putting them next to each other in a row and storing all rows of the same kind in a table, explicitly by foreign keys.

The relational calculus is good in describing sets. But it´s bad at describing relations between data in different sets. Explicit identities (primary keys) need to be introduced and normalization is needed to avoid update inconsistencies due to duplication of data.

To say it somewhat bluntly: The problem with the relational calculus and RDBMS etc. is the focus on data. It´s seems to be so important to store the data, that connecting the data moves to the background.

That might be close to how we store filled in paper forms. But it´s so unlike how the mind works.

There is no data stored in your brain. If you look at the fridge in your kitchen, there is no tiny fridge created in your brain so you can take the memory of your fridge with you, when you leave your kitchen.

Instead the fridge is left where it is, right there in your kitchen. However, what is stored in your brain are associations of all kinds. In fact, your brain can only store "immaterial" associations. (Let´s neglect for the moment, that those immaterial associations need to manifest themselves somehow, e.g. electrical signals, chemical substances, or cell growth.)

The fridge causes the brain to setup internally an unknown number of associations. Thus, the brain works just with relations/associations and not with data or "the real things". The brain has its own representations for the data. There is not data in the brain; rather the data itself stays outside the brain.

So "the real thing", the fridge, is not in the brain, but instead some kind of, hm, "token" or handle. Or maybe there is not even a "token" for a whole fridge in the brain, but a large number of handles for parts of a fridge? Or what seems to be even more likely: the brain knows nothing about fridges and fridge parts, but just about very, very simple visual structures like points, edges, colors. So the mental representation of a fridge is a set of relations between such basic structures/concepts. Then the brain does not need "tokens" for real world entities, but just for basic structures/concepts to relate them to each other.

Ok, why am I telling you all this? What does this have to do with WinFS? Well, it´s about a completely different way to deal with data (or things). To map what the brain does to the software world means, removing the data from the "system" leaving only associations:

Within the "system" there are just associations and associations between associations. The data is outside the "system". Compared to our traditional thinking this kind of "system" is homogeneous. There are only associations. That´s it. The is no distinction between associations and data or different kinds of associations (implicit vs explicit). Associations or relations are first class citizens in this kind of "system".

And since there are no different kinds of data and no more "data buckets" like tables or columns, any association can be associated with any other association.

When you define an RDBMS schema you explicitly set up which kind of data (rows) can be connected to which other kind of data. You try to forsee what could possibly make sense in terms of associating data. Well, that´s what the Outlook team did in the past. They said: Well, we think, users want to associate a contact with an appointment or an email with a task. So we stuff everything in a nice little database.

But then, users thought differently. All of a sudden, they wanted to associate an Outlook contact with an invoice - without success, because the Outlook developers had thought they could foresee the future usage of certain data.

This dawned on Microsoft and they now come up with WinFS. Great! Or not?

No, not so great, although still technologically cool. Because WinFS still requires you to think in pretty large bins of data (e.g. a contact, an appointment). Although you can set up relations between those smaller bins, WinFS still is about data first - and only then come associations between data. It´s a heterogeneous system.

Your brain, on the other side, is homogeneous: the brain knows only about associations. Because that´s the only way to deal with an unpredictable world where you cannot foresee how "things" might look and behave and how you might want to associate fine grained basic concepts like points or coarse grained concepts like fridges with each other. The brain knows about causality/time, points, edges, space, that´s probably pretty much it. Those concepts/structures are its roots. All else is just associations between those roots and other associations. Billions, trillions of them. And it works :-)

So why stop where WinFS stops? Why not take WinFS to the max? Why not radically chance of view of the database world? How about association bases or connection bases instead of data bases?

A world of associations

The gain of a new view on how to deal with data would be an explosion of possible associations. When you look at your fridge, you immediately can see it in different contexts: there is the context of "kitchen" where the fridge is one of many applicances, then there is the fridge as a manufactured product pointing to a history of industrial production, then there is the context of "food" which the fridge keeps, then there is the context of "information" because you put post-it! notes on the fridge´s door, and so on...

The fridge is at the origin of a multi-dimensional space of contexts. Many different contexts intersect in a fridge. That´s so natural to all of us... so why not treat data the same?

Switching to a new view on dealing with data is thus a switch from one context to multiple contexts. In an associative system and data unit (external to the system) can exist in any number of contexts, just depending on the associations between it and other data units or other associations.

So if associations are the real value of data, because they put them "in perspective" aka into different contexts, then how to get more out of an associative system? Well, by forming as many associations as possible (or as makes sense for a certain observer).

Since the number of possible associations is determined by the number of data units, it´s best to see to maximize their number first. And that´s exactly where WinFS falls short.

Although WinFS promotes disassembling databases into their rows (objects, e.g. contacts, tasks), the resulting data units not only stay within the system, but are also still fairly coarse grained. A whole contact can be associated with a whole appointment.

But why stop there? Why disassemble the data further in order to be able to generate even more associations? Who´s able to foresee that associating a whole task with a whole invoice is all that users ever need?

Maybe I want to navigate (by traversing the maze of associations) from a single date in an appointment to contacts with this date as a birthdate? Why not reuse names from contacts in the context of appointments? And I mean just names.

What this would mean is blowing up those WinFS data units (objects) into very small pieces, data atoms. Each atom being some data unit which cannot be split into smaller pieces.

Single letters come to mind as candidates for a data atom. (The bit values 1 and 0 would be the true data atoms, but even though it would be possible to build a "system" on them, since letters are just associations between 1s and 0s, I find this low level a bit unwieldy.) Pictures might be larger data atoms because their individual bytes might indeed make no sense in other associations - but who knows.

In the end, an associative base system should be data atom agnostic. If might know, data atoms are streams of bytes and might offer to store them as is. But then... why should it know about data atoms? They are of no use within (!) the system. So an associative system should provide just one operation concerning data atoms: create a handle for a data atom, if you ask it to.

The associative system then looks like this:

Whatever is outside the system, the system does not care about. However, in order to setup associations with the outside data atoms, the system has to have some kind of internal representation, that´s why the system needs to be able to generate - ex nihilo so to speak - handles for external data atoms (or terminal values). What those handles mean, which terminal values they stand for, whether it´s a single letter or a multi-megabyte picture, the associative system does not know.

Conclusion

Now, think about the implications for a while...

Such kind of associative base, an AB instead of a DB if you want, would not store data, but rather would generate data from data atoms as needed.

Take a text like the Bible: If you defined the 256 ASCII characters as to be the atoms, then there would be no bible text data, but just some 800,000 associations between those 256 terminal values and other associations. (I know this figure, because I´ve implemented such a system in C# and loaded the 4.5 MB King James Bible into the AB.)

Still, though, I can losslessly generate the complete Bible text upon request from those associations. It´s just a matter of recursive descend in a binary tree. But what´s more important is, no combination of letters would need to be stored twice in such an AB. Each association could be unique. No more duplication of data.

This, though, not only leads to maybe saving some disk space, but it means, when looking for the pattern "Enoch" I immediately get all contexts in which Enoch appears in the Old Testament. Starting to look for patterns from the handles for their terminal values immediately leads to all associations which connect to those patterns.

But this is only a simple example and you might say, hey, this is what full text database searches are for. And you´re right! However, a full text database stores the data twice: once as the data, and once all the major words in the index. Also a full text database usually limits you to searching for words. If you want to look for arbitrary patterns, e.g. "o b" in the text of Hamlet, then you´re lost. A full text search engine would not return "to be". For an AB engine, though, this would make no difference. And that´s important, for example, in searching for gene sequences in the field of bio informatics.

I can understand, though, if you find it difficult to switch your thinking from data centric to associations only. It took me 2-3 weeks and I´m still working on it. But the potential of this switch seems to be huge! Each day I learn something new. It almost feels as if I´m in love :-) I´m almost blocked from doing other work, because my mind reels with the possibilities and implications. That´s the reason, why I needed to write this blog entry. I needed to get this out of my head to move on.

Just yesterday I talked to a developer of an ODBMS about all this. Fortunately I was able to depict all this to him on the phone - and he immediately grasped the idea. He even corrected me when I thought about maybe defining whole data fields (e.g. a name, a birthdate, a zip-code) as data atoms to gain performance from having a "regular" database engine to index them. He said, no, that´s not necessary, because all those values (consisting of characters) can be indexes using associations within (!) the AB. And he´s right! I felt so relieved: Such an index would be just another context in with terminal values appear.

The beauty of an association only system is very striking, I think. So while WinFS is a cool idead compared to todays situation, WinFS is but a small step towards really setting data free to be associated in a million ways like in our brains.

I think that performance of everyday queries in Associative Database would be much lower than in RDBMS.

lexp - Tuesday, December 20, 2005 9:51:00 PM

Aren't you describing Topic Maps?

Jason Foster - Tuesday, December 20, 2005 10:57:00 PM

@lexp: The question of query performance sure is an important one. But as first results show: searching for patterns is very fast if not faster compared to usual fulltext databases (especially if you´re looking for non-word patterns).

Plus, as I explained, when you think about structured data you can keep an index (as tree of associations) within (!) your AB next to the associations connecting your data atom handles. This would be the same as an ordinary index in an RDBMS, I´d say, so there should not be much performance loss.

But even if there is some performance loss, I´d argue that in many cases it is compensated by much higher flexibility of the whole AB.

Ralf - Tuesday, December 20, 2005 11:14:00 PM

Dude, better read up on your relational foundations first. Indexes are not a relational concept. They are implementation level concepts. You can have relational databases without any conventional indexes at all and still perform (the TransRelational technology, the Nucleus database implementation approach, etc.).

The design of a relational database is just a means of specifying the constraints you want on your data. If you want an "everything can be associated with everything else" design, and throw away most if not all integrity out the window, then its your decision, but being relational doesnt stop you from doing so.

Open up your eyes and go beyond what MS, Oracle and the rest of the mainstream market peddles and claims to be relational.

omen - Wednesday, December 21, 2005 7:17:00 AM

Associative Model of Data.

And I think it splitting hairs to construct a framework under which TopicMaps doesn't fit. If it looks like a ducks, and walks like a duck....

Guy Murphy - Wednesday, December 21, 2005 10:33:00 AM

Ralf,

any chance you could expand on how you implemented your prototype AB. (Maybe even make the source available for download). I've followed your latest posts about this subject with great interest, and would like to see how you solved some issues in code.

You can also email me on my gmail address.(firstname.lastname)

Geert Baeyaert

Geert Baeyaert - Wednesday, December 21, 2005 1:12:00 PM

I'd love to hear what Fabian Pascal etc. has to say on this pure unadulterated BS. It wouldn't be pretty. It's a rather poor show to make comments regarding relational calculus when you clearly haven't got the slightest clue about it.

Baxter Basics - Wednesday, December 21, 2005 3:38:00 PM

@Baxter: Well, I´d love to hear what Fabian Pascal has to say.

And I like to see emotions flying high as a result of my posting. Where "BS" is thrown at me and "open your eyes" there must be some nerve touched. Because if not, why bother and post such comments at all?

I just can say: I don´t have all the answers. I just know, that RDBMS - as much as I like them; I use them every day - have their limits. Decade old alternatives like ODBMS or OLAP products are proof to that.

So I guess, trying to go further is not a bad thing. We´ll see where thinking beyond RDBMS leads. Whatever I write here is just an inviation to join on an intellectually intereting path.

Ralf - Wednesday, December 21, 2005 3:47:00 PM

re: "However, I disagree with you opposing "everything can be associated with everything else" and integrity. Integrity is dependend on a certain schema. So if my "schema" defines a network of associations, then this does not violate integritiy. Enforcing integrity just will look different from RDBMS."

All that is relational database design is taking advantange of relational facilities to enforce (declare) the integity you want.

There are already short hands for certain common integrity concerns (domain, entity, referential, etc.). Putting together certain attributes into a single relation is also an integrity decision. The final "general purpose" integrity that a RDBMS should support is database-wide constraints. Your approach seems to take an "anything-connected-to-anything" design with all integrity concerns left to database-level constraints.

I would predict that after doing this design process quite a number of times you would recognize certain repeating integrity concerns and before you know it, you would be repeating all the relational facilities but "hand-made" and error-prone.

You can design an "associative" schema using a fully relational database with full domain support, if you want. Why you would want that, and only that, everytime is beyond me, however. What you want is already a subset of being relational.

omen - Wednesday, December 21, 2005 4:43:00 PM

@omen: Putting attributes into relations helps inforce integrity - but limits how you can connect information. I cannot link just a name/zip pair to something else, if its just part of a relation and not a relation itself (in an RDBMS).

You´re right, that also an AB must enforce integrity constraints. But it gives you more freedom. Informational units get not penned up in cages (tables/rows). Which does not mean, such concepts (e.g. a set (table) of like structured sets (rows)) don´t make sense. They are usefull - but they are limited.

With a relational databases you can traverse all contacts in a table. With an AB you can do the same (if you decided to model tables and rows) - but you can also traverse a name accross different tables, e.g. contact, appointment, invoice.

So I´d argue: an associative system is more general than a traditional relational database.

Ralf - Wednesday, December 21, 2005 4:57:00 PM

LOL - talk about try to model the WORLD in a diagram!

Don X - Friday, December 23, 2005 4:17:00 AM

re: "if vendors fail to position their solutions in a way so anyone can see how general they are, how fundamental their approach is... well, i can´t help it"

Well, thats the problem with relying mainly (or solely) with vendors and products...and with "marketing", for your background.

"I hope very much that computing science at large will become more mature, as I am annoyed by two phenomena that both strike me as symptoms of immaturity.

The one is the widespread sensitivity to fads and fashions, and the wholesale adoption of buzzwords and even buzz notes. Write a paper promising salvation, make it a "structured" something or a "virtual" something, or "abstract", "distributed" or "higher-order" or "applicative" and you can almost be certain of having started a new cult.

The other one is the sensitivity to the market place, the unchallenged assumption that industrial products, just because they are there, become by their mere existence a topic worthy of scientific attention, no matter how grave the mistakes they embody."

-Edsger W. Dijkstra

omen - Friday, December 23, 2005 3:51:00 PM

@omen: I agree with you (or Dijkstra) that "computing science" (or software development) needs to become more mature. Since 1979, when Dijkstra made his statement, much has happened - but much has still to happen. And if you follow my blog you´ll see some areas for which I see great need for more maturity.

But it´s one thing to diagnose immaturity, and another thing to say, whatever carries the label "abstract" or "distributed" (or you might add "service oriented") is just hype or cult - and therefore another immature fad.

That would be to throw out the baby with the bath. Because it´s a sign of arrogance or senility to take something not seriously just because of some buzzword.

The scientific method is or at least should be buzzword agonistic. Whoever claims something must be judged impartially on the grounds of his statements and in comparison to an established body of knowledge.

Ralf - Friday, December 23, 2005 4:55:00 PM

@omen: Thx for your elaborate reply. I guess we´ve exchanged positions. Unfortunately though, you have not substantiated your criticism by picking any number of concrete claims of Pile and contrasted them to any concrete existing statements.

To give you an example of what I mean: You could have said "The notion of a relation being itself a subject of relations is an old hat. Read XYZ and you´ll find is has been proposed before." or "The notion of not storing data items (e.g. names, zip codes) anymore but just relations between much, much smaller informational units has already been proposed by XYZ." or "Associations between relations in the relational calculus are no second class citizens even though keys have to be introduced into the calculus, because of XYZ and you can read this here (link to ABC)."

Please respect I´m writing about Pile as I do. It is to my best understanding of Pile and the relational calculus. If you find my knowledge or Pile´s claim lacking anything, I´d be happy to hear from you the exact (!) locations in my text (or their´s) where you´re dissatisfied - including as exact as possible hints to more substantial claims.

I´ve to admit I find it tiresome to hear "read up on XYZ". Even though I admit I don´t know everything about the relational calculus or set theory or algorithmis complexity and what not and am very (!) willing to learn - I´m reluctant to follow such sweeping criticism.

To point out errors or deficits in my writing is perfectly fine with me. But then, please, be as accurate as you ask me to be.

That said, I´m looking forward to specific critizism from you or anyone else.

And by the way: I prefer to know my critics by name. So please devulge your identity - otherwise I have to assume you have a commercial interest in derating Pile. So, who is "omen"?

Ralf - Sunday, December 25, 2005 5:40:00 PM

From data to associations

A world of associations

Conclusion

14 Comments