Schema changes in Couchbase?
Hi Guys,
I am trying to wrap my head around the concept of a NoSQL database, and to see where we could use it for our systems. The first thing we are going to do is get rid of our old caching system and replace with with a combination of caching in memory and persistent storage using Couchbase and the Memcached clients (since I have that already). Next I will likely move our session data into Couchbase, and then perhaps stuff that may work well such as customer shopping cart data and get that out of our MySQL database.
But as this stuff progresses, it seems that there is movement afoot to be able to do everything needed from a database engine withing a NoSQL database, like Couchbase. We already just went through a server upgrade for our MySQL server to handle additional load, and eventually we can't scale up any more and have to think about scaling out, which MySQL just does not do well. So I am looking at this with the intention of learning how something like Couchbase may be able to one day get rid of MySQL from our servers altogether, and store the data differently.
But first I need to wrap my head around how schema migration would work with Couchbase, and other JSON based document stores. One of the supposed benefits (other than scaling out well) is that schema changes are much easier since you just stuff what you want into the database and don't have to define in advance what the schema is. But, at some point you still have to deal with schema versioning, and I can't yet figure out how this would work with a NoSQL document database. Or maybe I should call it 'document versioning', since that is what it really is.
Just say I decide to stick my customer shopping cart details into a document, and store it in Couchbase. So I go ahead and define the shopping cart to consist of an optional customer ID (if they are logged in), along with a list of product details for the items in their shopping cart. Pretty simple, so we define our objects for the shopping cart and we stuff them into the database (C# in case you care):
public class ShoppingCartProduct
{
public int ProductID { get; set; }
public int Qty { get; set; }
}
public class ShoppingCart
{
public int? CustomerID { get; set; }
public List<ShoppingCartProduct> Products { get; set; }
}So this works great. But then I suddenly decide that I really want to keep track of whether this shopping cart item is also in the customers wish list (if they are logged in), so we add a new field to the ShoppingCartProduct to keep track of this:
public class ShoppingCartProductV2
{
public int ProductID { get; set; }
public int Qty { get; set; }
public int? WishListProductID { get; set; }
}So my question is, how do you handle migration of old data in this kind of environment? We modify our code and want to deploy it, but all the existing documents for customer shopping carts don't have the WishListProductID field at all in the record, so how do we go about 'upgrading' all those old documents? What happens when you request a document in the 'old' format? Does it create an object of the new format with the 'new' fields set to the default values?
What happens if a field is removed from an old document, but I need to migrate the data from an old field across to a new field or something? How is that handled, since at the C# level if I removed a field from a document when it is pulled out of the document and deserialized into an object, it won't match the object any more if the JSON data did not match the new class definition?
I can see that I don't have to define the data to the DATABASE itself, since it doesn't care. But my application layer still needs to care about this, and provide some way of migrating the old data to new data. But I can't find anything in the documentation that discusses this, and the best practices to handle this kind of thing since it would seem this has moved from the database layer (and database schema migration/upgrade tools) to the application layer?
So it would seem that there would still need to be some kind of migration/upgrade logic written to make sure that old documents are iterated over and converted over to 'new' documents and any data migration taken care of? If that code is written in C#, it would need to know about both the old and new class definitions, so it could load the objects in the old format and translate them into the new format?
Any suggestions to additional reading, blog posts or whatever would be greatly appreciated if this has already been discussed somewhere.
Hi Kendall,
Sorry for the delay with a response. How to handle schema versioning is certainly a common question that developers face when starting to adopt non-relational, schema-less databases. It is true that the serialization and deserialization of objects into JSON documents (or binary formats) does introduce some complexity into the process. But I think that's always been the case when you map database schemas to object models.
In traditional RDBMS systems, we tend to have mapping layers that hide schema from our applications. Whether using an ORM or a data access layer that reads a row and maps a column value to a field, we still have to update that mapping code (or XML files) to handle schema changes. So in that sense, we haven't introduced a new problem with NoSQL, we just need a different approach to handle it.
With Couchbase (and other document stores), your actual migration needs will depend on your application and its tolerance to schema change. So in your shopping cart example, if you could "expire" all shopping carts prior to the new schema version being released, you could simply ignore those documents with older schemas (or remove them). Of course that's hardly always going to be the case, but it demonstrates a difference between non-relational and relational databases. It's OK to have shopping cart documents that don't look like other shopping cart documents.
Another reason you might not need to worry about the schema change depends on whether you're working with strongly typed documents. In other words, are you expecting the client library to give you back a ShoppingCart instance or a dictionary structure. If you use the Python client for example, you'll get a dict back and whatever prop/val pairs were in he document are mapped to dict key/values. In that case, you're application could check the schema at runtime. Again, that's not always the best answer, but it's worth noting because it could work in simple cases.
In the .NET case, it's generally going to be more complicated because you are most likely going to be working with strongly typed documents. In that case, schema migration is clearly more important. If you're using Couchbase Server 1.8 and the Couchbase .NET Client Library 1.0, you're handling any JSON serialization on your own as it's not done by the client (that's in the works for 1.2). So how tolerant your application is to schema change depends on your JSON serializer. But generally speaking, you're more likely to see deserialization errors if your document has a property that's been removed from your class and the deserializer tries to grab that PropertyInfo instance.
In terms of actual document migration, if you're changes don't break your JSON serialization, then you could simply retrieve the document, set its new properties and resave it. Whether you do this in batch or at runtime is somewhat up to you. Document updates are in place (the entire document is replaced), so you will automatically get the new schema when the document is saved. Of course, iterating over all of your documents and performing a batch update will save your application from some of the schema concerns. If deserialization will break, you'll want to work with the raw JSON or use the multi-get option in the .NET client, which returns a Dictionary.
Finally, you should also consider including meta data in your document to help facilitate schema migrations. Most importantly would be some sort of type identifier. I prefer to use a "type" property that will take the name of the class being serialized. You can see an example of this at http://www.couchbase.com/develop/net/current. You can then use that property to create a quick view (Couchbase Server 2.0) to retrieve all shopping cart documents and update as you iterate over the list.
So really, there's no single best way to migrate your schema. Some applications could live with a mixed schema. Some applications need a batch update and reformatting of the documents. Some can defer the decision to runtime gets and re-saves. I suspect you'll find the batch most agreeable.
Please let me know if that helped.
-- John