Lesson learned after 2 years working with Cosmos DB

This is now about 2 years I am working extensively with Cosmos DB and here I would like to share my current view points about this database. First, a brief introduction about Cosmos DB.

Cosmos DB is Microsoft's NoSQL database which is offered as PaaS (Platform as a Service). The database used to be called DocumentDB but after a while Microsoft re-branded that to Cosmos DB. Still in some places like documentations or exception details or source codes, you can find foot prints of DocumentDB.

Cosmos DB is relatively an expensive database. The pricing model of the db is based on Throughput (Request Unit per second) you provision for the database or collections. This post is not about teaching Cosmos DB. Fortunately there is a lot of good materials and documentations available in the internet and could be found in a few seconds by googling. This post is more about sharing experiences and you may need to have at least a bit of experience to understand the concepts.

It used to be only possible to provision throughput at collection (it's also called container) level. So even if there was only a few documents inside a collection or even nothing at all, we needed to pay at least about $24 a mount just for that collection.

If the load on the db is more than what is provisioned, Cosmos DB starts rejecting the requests with 429 http status code and returns a parameter which contain an estimation time for the next retry (in some cases you may receive parts of the query result back and need to retry to get the other parts!). So if you try to query/write data to a collection and receive 429 status code, you need to retry until it succeeds. If the throughput you have provisioned for the db is far less than it should be, the retries may never succeed. I you use Microsoft official SDK for Cosmos DB, the process of retry happens several times by the SDK behind the scene. That's why I recommend you to always use official SKDs. At least, as far as I know, this behavior is unique to Cosmos DB and people coming from other ecosystems and with other (even NoSQL) database experiences get quite annoyed (I have witnessed many times). I have seen many developers changed their mind about using Cosmos Db only for this reason!

Collections are logical concepts and there is no limit how big they could be. That pricing model back then, pushed developers to use the same collections to keep different types of documents. Well, there was nothing wrong with that approach but for some reasons I was not a big fan.

After a while Microsoft offered a new feature and it became possible to provision throughput at database level. So the throughput was shared between all collections in the database. That was a huge improvement in pricing I believe!

Cosmos DB is shining in two senses. First, there is a very good integration and out of the box support between Cosmos DB and all the other data related services in azure like Azure Search, Azure Functions, Big Data services and so on. For instance, in just a matter of a few minutes we can connect Cosmos DB and Azure Search together and implement our own super fast and customized search engine.

Second is the fact that Cosmos DB scales out (horizontally) infinitely and seamlessly behind the scene if required and the hard work is all hidden from the developers and users which is very nice. If you enable replicating data globally, the changes in one database get globally replicated in all Microsoft data centers in up to a second (as promised by Microsoft).

Therefore, no matter how big is the load of the data you want to persist in the database, Cosmos DB can fulfil the job on two conditions:

  1.  There is enough throughput provisioned for the database/collection.
  2. There is a right Partitioning strategy.

How much throughput is enough?

Like almost any other resources in azure, there is a Metrics section for Cosmos DB. If there is an expectable or granular load on the DB, you can simply see the consumption of throughput in Metrics and decide of the right amount of throughput.

But imagine the scenario that there is no load on the db at all during the day. Only couple of times a day you you receive a big load of data and need to store in the database. Unfortunately as of writing this post, there is no good solution for this requirement. What we need here is to increase the throughput as much as needed and lower down when then not any longer needed. Dynamic throughput provisioning is not yet offered by Cosmos DB. There is a couple workarounds but are not efficient for all usecases. For instance, if you know exactly when the data is received, you can call an api to bump up the provision and later lower it down when it's no longer needed.

What is it about partitioning?

If you are doing something serious with Cosmos DB, Partitioning is the most important topic you need to focus on. Earlier I mentioned Cosmos DB scales out automatically and seamlessly but that would not be the case if you use wrong partitioning strategy.

If there is going to be only one take away for you from this article, it should be the following section! So pay attention to that.

While creating a collection in Cosmos DB, you need to assign a PartitionKey. PartitionKey should be a property of a document that we store in db. Needless to say that documents in Cosmos DB are just plain JSON objects.

After you create a collection (again, it's also refereed as container), it's NOT possible to change the PartitionKey. Imagine you receive IoT telemetry data from 4 different countries. Germany, France, United Kingdom and Italy. you may be temped to use country name as PartitionKey as follow:

{
        id:"123456",
        country: "Germany",
        city: "Hamburg",
        deviceType: "smart-lock",
        timestamp: 1569147916
}

Microsoft guarantees that the all the documents with the same PartitionKey are stored in the same Logical Partition and ultimately in a single physical machine. That means if later we query the data within the same partition (e.g. telemetry data only in Germany), the query runs over a dataset in a single physical machine and therefore, it does not have to travel over the network and collect data from different data storages! That brings the maximum speed to us.

So far so good but it's not all the story! Storing all the documents with the same PartitionKey in a single machine, brings up a size limit! As of writing this post, maximum size of a logical partition (and consequently a physical partition) could not exceed 20 GB. If it does, the database crashes and the whole thing goes down and there is no remedy for that! Therefore, the first very important factor to take into account for while choosing a PartitionKey, is the size of the data over the time.

The second factor for choosing a PartitionKey is considering the fact that how data is written into the database. Cosmos DB works the best if it needs to write the data evenly in different partitions. Let's get back to our example.

If the data coming form Germany, France, UK and Rome is even, I mean 25% from each country, then data can evenly distributed and persisted in the database. But what if 60% the data is coming from Germany and 40% from the others?! Then the Germany's partition gets hit the most and results in what's called Hot Spot or Hot Partition! The interesting fact is that even if you increase the throughput, it will not help much!

Well, then we may think it's better to choose city name as partition Key! Hamburg, Munich, Berlin, London, Paris, Rome! Does it help? Only you know and your data pattern. Worth to recap that after creating a collection, it's not possible to change PartitionKey. So we need to create a new collection and choose the right PartitionKey and possibly migrate the data from the old collection to the new one! sounds painful?! Yes it is. Makes it worse to know, there is no backup/restore functionality to move the data in an atomic fashion. So you either need to write your own tool for data migration ot use ETL tools like Azure Data Factory in azure to orchestrate data migration from the old collection to the new ones!

Personally I found it quite useful to have a property in the document called "partitionKey" and then the value will be whatever is appropriate. It could result in a bit of data duplication but also bring flexibility of changing it's value in future if needed.

{
        id:"123456",
        country: "Germany",
        city: "Hamburg",
        deviceType: "smart-lock",
        timestamp: 1569147916,
        partitionKey: "Hamburg"
}

If you are receiving a huge load of data, you can use time based partitionKeys. For instance, combination of day, hour, minute etc. Then the data each minutes is written in a different partition.

{
        id:"123456",
        country: "Germany",
        city: "Hamburg",
        deviceType: "smart-lock",
        timestamp: 1569147916,
        partitionKey: "Hamburg_20190922_1400_03"
}

If the load is still big enough to get you stuck with Hot Partitions, you may even use a random unique value for each document to help Cosmos DB to persist the data as it wishes and wherever it wishes. That helps a lot to persist the data but what about querying/reading data afterwards?! What if the data you need to query, exists in more tha one partitions? I am afraid, then you need to issue a Cross Partition Query and Cosmos DB needs to ask for the data in all partitions and possibly in different physical machines!

So there is paradox here. Optimizing partitions for write scenario resulted in poor performance in read scenarios. Unfortunately that's a big issue now. The solution here is to have two collections. One optimized for writing and the other for reading. Cosmos DB offers a feature called changed-feed. Using this feature, Cosmos DB generates events where a document changes (insert/update/delete) in the db. We can write code to receive the changes and populate the second collection which is supposed to be optimized for read queries.

I wish Microsoft one day provide an atomic way for doing so. I mean, having two collections which are mirrored/synced in sense of data but with different partitionKeys.

Long story short, while choosing a PartitionKey try to consider size of the data, load of the data while writing and finally read queries.

7 Comments

  • Great write up, Morteza. I always love to learn about real-case experiences on cloud services.

  • Very good points here, especially with partitioning. I would suggest that there is even MORE to consider when choosing a partition. For example, if you want to use aggregate functions in your query, they only work against a single partition key. Something to think about. On the other hand, I've found that if I want to improve query performance, it's actually better to spread the data across multiple partitions and allow the built in parallelization to do the work. Of course, this spikes your RUs but if you need performance, then using multiple partitions in each query makes the biggest impact. Just make sure to set enableCrossPartition to true on each query.

  • Excellent write up!!

  • Thanks for sharing real-time cases.

  • I have a stored procedure which takes a parameter of a document array. I am getting into situations where I send too many documents from node.js at once and am getting a "Request size is too large" exception. I could easily avoid this by just calling the stored procedure multiple times with smaller sets of data until all is done. is ther any different solutions ?

    Please let me know for different way to solve in node.js

  • excellent blog ! I referred your blog in Microsoft forum and got answers to some of pain points highlighted in your blog.

    https://docs.microsoft.com/en-us/answers/questions/745640/choosing-cosmos-db-for-enterprise-use-underlying-p.html#answer-745790

  • Brilliant wrap up! Thanks:)

Comments have been disabled for this content.