NSimpleDB - Use Amazon´s SimpleDB data model in your applications now - Part 2
Amazon´s SimpleDB is an exciting new player in the database world. It´s free, it´s online, it´s not relational. SimpleDB is a dynamic database implementing a tuple space. Currently SimpleDB (as of Jan 08) is in beta - but not everyone can get his hands on it. You have to apply and line up for one of the limited test accounts.
Nevertheless it´s worthwhile to take a closer look at SimpleDB. It´s a brave step forward by Amazon to offer an online database (accessible via a web service) that´s deviating from the mainstream data model of RDBMS.
In part 1 of the series of postings I described this data model: You store tuples (aka items) consisting of name-value pairs (aka attributes) in a SimpleDB "data space" without the need of any configuration. No schema design necessary. No tuple needs to look like an other. Just so called domains are a structuring concept to group tuples. But it´s nowhere written you have to use more than one domain. Even different kinds of items don´t force you to distribute them across domains. Domains that way are more of a concern regarding scalability and quantitative constraints Amazon put on them.
A simple SimpleDB API
The data model of SimpleDB is simple, so is its API. It´s not based on a query language (although it provides set selection, see below), but rather follows the tuple space concept in that it defines just a small number of methods to read item from and write item to the "data space".
Following I´ll use pseudo code to describe the API. I think will be pretty self explaining. In reality Amazon offers a web service to work with SimpleDB, so you´ll use some kind of proxy class in your code. Amazon even published a .NET binding - but hasn´t gotten rave reviews so far. There is much room for improvement.
Attributes as smallest data units
The smallest piece of data with SimpleDB is an attribute. An attribute is a name value pair like "Name"("Peter") or "Amount due"("000000300.00") or "DOB"("2000-05-12") or "Marked for deletion"("1").
As you can see, values are just strings. It´s like with XML. Attribute names are also strings - and they can contain white space. This makes them easier to read and use as labels in frontends.
In addition - and in stark deviation from the relational data model - attributes can have multiple values, e.g. "Phone numbers"("05195-7234", "040-413 823 090", "0170-233 4439").
Amazon suggests, you don´t try to store large pieces of data in attributes, e.g. a multi-MB image. Rather you should put such byte-blobs into some other store - e.g. a file on an FTP-server or Amazon´s S3 - and use the attribute value as a reference.
Items as containers for attributes
Attributes belong to items. In principle items can contain any number of attributes, but Amazon put some limitations on them. Currently only 256 attributes are allowed in each item.
Items can be written as tuples and are identified by an explicit id you have to provide, e.g. "123"["Name"("Peter"), "City"("Berlin")]. The id is called "item name" an again is a string.
As you can see, attributes are tuples with unnamed elements, but items are tuples whose elements are named.
Domains as containers for items
Items are stored in domains. Like them, domains have an id, the domain name. No schema needs to be defined for them. Just pour items of any structure into them as you like, e.g. "contacts"{"123"["Name"("Peter"), "Addresses"("a", "b")], "a"["City"("London"), "Country"("GB")], "b"["City"("Hamburg"), "Country"("Germany")]}.
As you can see, domains are tuples, too. Their elements are named tuples, the items.
Writing data
Roughly you can say, domains are like tables, items are like records in a table, attributes are table columns. So storing data with SimpleDB means: write items with their attributes to a domain. That´s like writing records with their column data to a table.
SimpleDB provides a single operation for writing data: PutAttributes(). Identify where you want to put the attributes - into which item in which domain -, hand in the attributes - and you´re done.
This command would write a single attribute to the item with name "123" in domain "contacts":
PutAttributes("contacts", "123", ["Name"("Peter")])
But now watch! If you then issue this command
PutAttribute("contacts", "123", ["Addresses"("a")])
you don´t overwrite what´s been stored in the item, but add to it! The same is true for this command:
PutAttribute("contacts", "123", ["Addresses"("b")])
Remember that attributes can have several values. Item "123" now looks like this: "123"["Name"("Peter"), "Addresses"("a", "b")]. So you better also write the referenced addresses to the domain:
PutAttribute("contacts", "a", ["City"("London"), "Country"("GB")])
PutAttribute("contacts", "b", ["City"("Hamburg"), "Country"("Germany")])
But how then can you overwrite data, e.g. change the name of tis contact? If you just issue a PutAttributes() with the new name, the name will be added as a second value to the existing attribute. To overwrite you need to add a replace-flag to an attribute (I´ll denote it with a "!" after the attribute name):
PutAttributes("contacts", "123", ["Name"!("Paul")])
Replacing an attribute like this deletes all (!) existing attribute values and replaces them with the new value.
A word of caution: Amazon´s SimpleDB is supposed to scale. That´s why they distribute it across many servers and need to replicate data all the time. That in turn means, it will take some time until changes you made by PutAttributes() and the other operations ripple through to all relevant servers. So don´t expect to see changes right after you applied them! Otherwise, if you issue a PutAttributes() followed right away by a GetAttributes() for the same data - this could run on a different thread - you might be in for a surprise.
Reading data
Reading items back from the SimpleDB "data space" is even easier than writing them. Just send the GetAttributes() command addressing an item in a domain and pass the names of the attributes to retrieve:
GetAttributes("contacts", "123", "Name")
will return ["Name"("Paul")]. Of course you can specify more attributes to be retrieved. And since you only state their name, they´ll be returned with all their values.
Item data can only be retrieved like this! Queries (see below) just return item names, but no attributes. Think of them as SQL statements like this:
select attributeName1, attributeName2, ... from domainName where itemName="..."
Looking up data thus always is a two step process: 1. Issue query and receive a list of matching items, 2. retrieve item´s attributes with an item name from the query result.
Deleting data
You can´t delete items explicitly. You can only delete attributes from them - and if none are left in the item, the item is deleted automatically.
DeleteAttributes("contacts", "123", "Addresses")
only deletes the references to the other items, but the contact item remains in the domain. You also need to delete its name attribute, plus, of course, the parentless addresses:
DeleteAttributes("contacts", "123", "Name")
DeleteAttributes("contacts", "a", "City", "Country")
DeleteAttributes("contacts", "b", "City", "Country")
Creating a domain
Working with domains as the containers for items is easy. You can create a domain at any time. Just call
CreateDomain("contacts")
and that´s it. Just pass in a unique domain name. From then on, you can use this domain name in item-operations.
Deleting a domain
Deleting a domain is as easy as creating it:
DeleteDomain("contacts")
The items and attributes in that domain will be gone then. But this might take up to 10 seconds, Amazon says, due to the distributed nature of SimpleDB.
Querying domains
If you want to get an overview of the domains in your SimpleDB "data space", just call ListDomains():
ListDomains(10, &nextToken)
It returns a list of domain names. This resultset is paged, though. The first parameter to ListDomains() specifies the size of these pages, e.g. 10 domain names per page, the second parameter is a token you can use to retrieve the next page.
Passing in a token to ListDomains() returns that page´s domain names and sets the token to the next page, if there is any.
nextToken = ""
domainNames = ListDomains(10, &nextToken)
// process first page of domain names
domainNames = ListDomains(10, &nextToken)
// process second page of domain names
...
Querying data
Finally, there is also a way to query items. SimpleDB sports a simple query language. You can think of the queries as the where-clause of a SQL select statement, e.g.
select itemName from domainName where simpleDB-query
Queries are limited to a single domain and return just item names as paged resultsets like ListDomains().
The building blocks of queries are predicates. A predicate is a logical expression made up of attribute comparisons, e.g.
['City' = 'Hamburg' OR 'City' = 'London']
Both attribute name and attribute value need to be put in single quotes. SimpleDB sports the usual comparison operators like =, != etc. and a STARTS-WITH which resembles the SQL like, e.g. like 'A%'.
['Name' STARTS-WITH 'A']
Remember, all comparisons are alphanumeric, since SimpleDB only stores texts.
The logical operators within predicates are AND, OR and NOT.
You may only query for a single attribute name with one predicate! ['City'='Hamburg' OR 'City'='London'] is ok, but not ['Name'='Peter' AND 'City'='London']!
To state queries on attributes with different names, you need to separate predicates for each:
['Name'='Peter'] INTERSECT ['City'='London']
The set-operations to combine the resultsets of each predicate into one are INTERSECT, UNION and NOT. INTERSECT calculates the common set of item names of two predicates, UNION merges the item name sets of two predicates. INTERSECTS thus works like the logical AND operator, UNION like the OR.
Why does Amazon deviate like this from the well established SQL way of defining queries? The reason probably lies with the internal structure of the SimpleDB "data space". Grouping the constraints on attributes with the same name probably makes query execution faster. Maybe SimpleDB is based on a column store?
EBNF SimpleDB query syntax
Query ::= ItemSetTerm { "UNION" ItemSetTerm }.
ItemSetTerm ::= ItemSetFactor { "INTERSECTION" ItemSetFactor }.
ItemSetFactor ::= [ "NOT" ] "[" PredicateExpression "]".
PredicateExpression ::= PredicateTerm { "OR" PredicateTerm }.
PredicateTerm ::= PredicateFactor { "AND" PredicateFactor }.
PredicateFactor ::= [ "NOT" ] PredicateComparison.
PredicateComparison ::= AttributeName ComparisonOperator AttributeValue.
AttributeName ::= Chars enclosed in single quotes, e.g. 'Name'.
All AttributeNames in a PredicateExpression need to be the same.
All quotes in AttributeName need to be properly escaped.
AttributeValue ::= Chars enclosed in single quotes, e.g. '003.14'.
All quotes in AttributeValue need to be properly escaped.
ComparisonOperator ::= "=" | "!=" | ">" | ">=" | "<" | "<=" | "STARTS-WITH".
What´s missing?
SimpleDB´s API is simple. That´s the beauty of it. A simple, dynamic data model plus a simple API sounds like a powerful combination for today´s fast moving software business.
But this simplicity comes at a price. Common operations like looking up data, are more cumbersome than with SQL. It´s a two step process due to SimpleDB´s queries returning just item names. Also currently transactions are missing completely.
Another aspect to get used to is the "eventual consistency" model, that means, changes take time to ripple through to all replicas of your data. Thus after a change there might be a short time where different clients might see the "data space" in a different state.
But overall, Amazon´s effort is very exciting nevertheless.
What´s next?
I deem SimpleDB even so exciting, that I wanted to be able to use it now and on my desktop. But there is no desktop/local version of SimpleDB and I don´t know when Amazon will grant me a test account of SimpleDB.
That´s why I sat down and developed my own Open Source version of SimpleDB: the .NET SimpleDB or NSimpleDB for short. I believe in the growing importance of tuple spaces in general and thus also am working with the University Vienna on bringing this paradigm to the hands of .NET developers. We call the basic technology "XVSM" for "eXtensible Virtual Shared Memory"; and it´s somewhat like SimpleDB. But on top we place more elaborate data structures so our space is not just partioned into domains but collections and other high level data structures. We envison them to allow for true "Space Based Collaboration" (SBC), which is in our view the foundation for "serverless real-time online collaboration". But I digress.
Back to SimpleDB: In my next posting I´ll show you, how you can use SimpleDB or the C# implementation of the SimpleDB API in your applications today without reliance on Amazon.