Microsoft has generated a lot of buzz since the launch of CosmosDB. It is basically a rebranding of Amazon DocumentDB with some new cool features. Let’s go a little deeper on it and explore its strategy, documentation, what developers have been talking about and how does it compares with Couchbase Server.
One Database to rule them all?
In simple words, Microsoft claims that CosmosDB is a NoSQL database able to do literally everything: It is a Document database, Columnar storage, a Key-Value Store and a Graph Database. All achieved thanks to an abstraction of the data format called atom-record-sequence (ARS).
A good sign of Microsoft’s work is how data is differently organized according to each model. First, you have to choose the API you would like to use ( SQL, MongoDB API, Microsoft Azure Table, Cassandra or Gremlin) and stick with it as it can’t be changed later. Currently, you can still try to access some models through DocumentDB API. that was what gave me some hints of how CosmosDB uses internally a decorated JSON format to store its data.
It looks like Microsoft wants to compete with most of the NoSQL databases out there, which is a really risky strategy as we might have passed the gold era of a single database solution for everything. There are huge benefits of choosing specialized storages, and this is the path most of the applications have been following right now with the rise of polyglot persistences. An all-in-one solution like CosmosDB might be good for low-demanding applications, but all those abstractions come with a cost and will ultimately impact simplicity, performance and be feature limited.
Couchbase vs CosmosDB – Comparing Apples with “Apples”
I will try to limit my comparison with CosmosDB focusing most on scenarios that make sense to compare both technologies. The table below tries to show some of the differences side-by-side:
|Scalability||Highly Scalable||Highly Scalable|
|Backup & Restore|
|Data Center Replication||
||Sharding is automatically done under the covers|
I think this is the very first article comparing CosmosDB with another database. It took me a good amount of time to go through a lot of documentation, developer’s feedbacks, and some webinars.
My feeling, in general, is that CosmosDB has a great vision, but currently, it is still immature in some aspects. Documentation and backups, for instance, are not one of their strengths, which is a natural consequence of building something focusing on multiple fields at once. Microsoft’s database also brings a lot of innovations, one of the most prominent is the new multiple levels of eventual consistency: Bounded-staleness, Session, Consistent Prefix and Eventually Consistent.
The fact that Session is set as the default consistency says a lot about the recommended way to use CosmosDB. It also gives us hints that it might not be the best solution if you need a strong data consistency.
I could not find any mention of caching mechanisms in CosmosDB, so I am assuming that it is not a major part of the database. The problem is that caching is crucial for good performance in strongly consistent databases, being memory-first is one of the reasons why Couchbase Server is blazing fast.
CosmosDB does not provide memory-optimized indexes and by default, all fields are indexed in their Global Secondary Indexes (GSI). It totally sounds like overkill to me as I still think it is easier to specify which fields I want indexes than specifying which fields I don’t. Of course, you don’t necessarily need to remove those fields from the index but don’t forget you are getting charged for it.
Sharding seems to be right now one of the trickiest things in CosmosDB. Partitions are moved automatically among nodes, but you still have to specify a partition key. The drawback of this approach is that each partition is indivisible with a max size of 10Gb. If you pick a bad partition key, a lot of frequently accessed documents might end up in the same partition, which limits the throughput of your reads/writes by the node capacity where the partition is stored.
The partition key is also immutable, so in order to change it, you will be required to copy your whole data to another collection. In Couchbase, we transparently distribute your documents evenly between vBuckets to avoid this problem, and also to increase your reads/writes performance.
Currently, throttling is done only by increasing Request Units (RUs) which is a common standard for fully managed databases (on DynamoDB, for instance, throttling is made by increasing Read/Write capacity units). The challenge with this approach is that it is not a very good predictor of the query performance and makes even harder to boost just a specific behavior like increasing only the writes capacity.
Microsoft has put a lot of effort in trying to make RUs provisioning easy to understand, but I have found many comments of developers underestimating their RUs ( like here or here ) and ending up with a bill much higher than expected. In general, the pattern that I have seen of provisioning in CosmosDB is mostly based on trial-and-error. On Couchbase, throttling is very flexible, it can be done by vertical/horizontal scaling, running specific services according to the node hardware, keeping indexes in memory, etc.
Microsoft is also clearly trying to convince MongoDB’s users to migrate to CosmosDB. They even provide a fairly compatible connector to make the migration easier. The problem is that the root cause of why some users are willing to migrate to other databases is due to MongoDB’s scalability and performance issues. We know it very well because many of those users end up migrating to Couchbase Server, and CosmosDB performance does not seem to be a big plus, at least not for a reasonable cost.
Microsoft does provide a limited local version for development, but so far it runs only on Windows machines.
CosmosDB also provides a cool push-button global data distribution that makes really simple to replicate data in multiple locations of the world. It is, however, a feature not used daily to require such simplicity, it could also be easily achieved in a matter of minutes in Couchbase Server without the limitation of running in a single cloud.
In summary, I agree with CosmosDB point of view that eventual-consistency is a too broad definition. Their new consistency models let the developer choose the level of consistency their application tolerates.
The reasons to use it are nearly the same as the ones mentioned in my article about DynamoDB. The main difference, of course, is that CosmosDB is much more flexible than DynamoDB. It is right now an average multi-purpose database for applications demanding average performance with strong consistency . It also easily integrates with some features of Azure Functions.
CosmosDB still lacks famous use cases/clients, but it has the potential to stand out in applications with eventual consistency, as it seems to be their main focus. But when it comes to strongly-consistent medium/high demanding applications, Couchbase Server is by far a better choice, both from the price and performance point-of-view.
It is hard to come up with a fair benchmark between those two databases as it’s unclear, for instance, how many servers are running when you provision 30.000 RUs in CosmosDB, so the easiest way to predict their expected performance is through their architecture/features.
Pretty much like DynamoDB, CosmosDB pricing is attractive if you have a small database with few reads/writes per second. But anything above that with cost you a good amount of money: 200 000 documents of 45kb, with 4 writes/sec and 40 reads/sec will cost at least US$ 2 500.
Their calculator does not consider the consistency model you are going to use, so you have to add a few extra dollars to this number for Strong-Consistency. In this setup, CosmosDB cost is at least double the price you would spend to run Couchbase EE on Amazon Web Services with our recommend architecture (which is capable of handling more than that)
As I mentioned at the beginning of the article, there are a lot of advantages in choosing specialized storages for each specific purpose, and Couchbase Server really excels in delivering high performance with strong-consistency.
If you have any questions, feel free to tweet me at @deniswsrosa
Really inaccurate on the pricing front for Cosmos DB. 40 reads/sec and 40 writes/sec at 45kb docs requires nearly 2500 RUs (request units) – they are not dollars! They are a throughput measure. Cost for that workload is around $130 / month.
The key advantage like every PAAS service is you don’t have to manage a bunch of VMs yourself and set every aspect of backups/replication/tuning/patching/etc etc. Also you’re not hit with a license cost as well as VM cost for your cluster. You only pay for the throughput you use. Depends if you want a service that requires minimal management, or if want to tweak everything yourself.