Most of us could churn out a first stab at a relational database model while sleeping.
Once you’ve chosen to work with a document database, though, you’ll need to think a little differently.
It’s no more difficult, it’s just that you’re optimising for different things.
So, what are the basics to getting it right?
Well, there are three principles that can help guide your thinking:
- Answer the questions you know you’ll ask.
- Embed data for convenience, refer for integrity.
- Name your keys predictably and semantically.
Today, let’s look at the first of those in detail.
Questions, not answers
When we split our data into the tables, columns and rows of the relational model, we’re optimising for queryability. We’re building a source of almost unbound answers and deferring our decisions about what questions to ask.
Let’s take a simple example: a stock management system that lets us track Couchbase-branded swag.
In this system, we have t-shirts, USB sticks, pens and that sort of thing. From time to time we get orders to send them out to meet-up groups, conferences and individuals.
Most likely, that’d give us the following relational tables:
At first glance, Order details might seem non-obvious. However, it allows us to store references to all the items in each order without breaking the first normal form. Otherwise, we’d have to serialise the line items from each order into a string and store that in a TEXT column.
Unless we make a mistake, each product, customer and order will appear only once in our database. That gives us a guarantee that updates to records are universal and it makes it supremely easy to query the data in whatever way we choose.
Trade-offs, there are always trade-offs
That way, we get to store the data for our swag management system in its purest form and then query it in whatever way suits us later on.
So, what’s the problem?
Well, there are a couple of trade-offs:
- SQL queries are expensive: for each join there’s a disk seek, there’s CPU overhead, there’s a user waiting for their page to render.
- Relational databases are hard to scale out across a cluster.
Let’s focus on that first trade-off for now: if you’re being unkind, you could say that relational data modelling is an exercise in stealing CPU cycles from your future self.
We know most of our query patterns early on and yet we so often spend time carefully stripping our data of that context. We split the data once and then spend the rest of our applications’ lifetime asking the database server to piece it all back together.
Of course, that has its place but it’s almost the very opposite of the most efficient path with a document database.
Don’t make people wait for answers
The first thing we do when modelling for a document database is ask, “what questions do I want to ask of my data?”
Then when our system state changes, we compute the answers to those questions and store them pre-canned in the database.
Rather than re-piecing the answers each time we make a query, we pull the answer fully formed from the database in a single look-up.
What does that mean in practice?
Let’s go back to our swag management example. One of our questions would be:
What do I need to do in order to fulfill a particular order?
With a relational database, we’d write a SQL query that would find the order, use a join to find the items in the order, then another join to find the detail of what each item is and a further join to find the customer details.
With Couchbase, it would look something more like this:
- User selects their items and makes their order.
- Our system writes the order into the database as a single document.
- When we need to pull out order details, it’s one read to grab the lot.
The resultant order document might look like this:
“name”: “Matthew Revell”,
“address”: “11-21 Paul Street”,
“itemName”: “Red Couchbase t-shirt”,
“itemName”: “Black 8GB USB stick with red Couchbase logo”,
This example is pretty heavily denormalised. Our swag management system would have details of each customer and product in separate documents already but we’re repeating what we know about them by embedding their details here.
We’ll look at the trade-offs involved with embedding data versus referring to data in a future post.
Production is more complex
In a production system, we’d most likely generate several of those ready made answers. Without really thinking about it, we can probably come up with a few of them for our swag management system:
- Customer order history: customers want to look back at everything they’ve ordered, so we’d record this order against that person.
- Live order status: similarly, our customers will want to see the status of their order, so we can generate that information now and then update whenever anything changes.
- Dispatch instructions: we’ll need to tell the people in our warehouse what they need to send where.
The important thing is that we don’t have to generate all of these while a human being is waiting for a response.
While we might want to update the customer’s live order status immediately, so that they can verify their order was recorded, it’s fine to process the dispatch instruction asynchronously.
This way, when a human views something in our system, the data is already there waiting.
In summary: pre-compute your answers
The first step to efficient document database modelling is to think in terms of pre-computing your answers.
We know up-front what questions we want to ask and so we can cater to those questions in the way that we write our data.
While there’s a lot of intellectual appeal in the idea of a pure, normalised, mathematically sound representation of our data, in practice it can make it harder to serve multiple concurrent users and to scale our operation as demand changes.
By computing our various answers at write time, we:
- remove lag from our user experience
- can more easily distribute the data across a cluster
- get the added bonus of less mismatch between object state and what’s in the database.
Next time I’ll look at when to embed data and when to refer data.
I\’m hoping that you explore at least a token non-trivial case into this series, such as realizing after the fact that you have a need to see how much of each item to order (incremental map-reduce perhaps).
How would updating a customer info work well if you have it stored in 8 million different places
If you reference the same data in eight million different places, and they should never go out of sync, then you\’re almost certainly better off keeping a canonical record of that data and referring to it from each of the eight million places in which you need it.
If you did have a large number of copies of that data that had to remain in separate documents, and also needed to remain consistent, then the easiest route would be a N1QL query, I suspect.
so, is there a Part II ?
Part 2 is here: https://www.couchbase.com/data-modelling-when-embed-or-refer/