There are multiple ways of getting data in and out of Couchbase. Notice that I did not say querying, I said in and out…on purpose. Not all ways of getting data in and out of Couchbase are querying like in other databases. Couchbase offers multiple ways that provide different capabilities/functionality and performance characteristics that you can mix and match to meet your application needs. Let’s list the different Couchbase data access patterns and then dive into the practical application of each.
- Read/Write data by object ID (key)
- Read data from a View
- Read/Write/Update data using N1QL
- Full Text Search via Solr or Elastic
When and Where to Use Each Method
With the introduction of N1QL (pronounced “nickel”) and the Query and Index Services to Couchbase in 4.0, Couchbase gets a new level of functionality. N1QL brings nearly full SQL ANSI-92 compliance. (I say nearly, as N1QL leaves out features that are useful in a relational database, but not in a document database. Inverse to that, it has features that are needed for a document database, but are not appropriate for relational. In other terms, it is both a subset and superset of SQL ANSI-92). Let’s be clear though, N1QL is NOT meant to and should NOT replace the other means to read and write data that Couchbase had prior to 4.0. It simply offers yet another way to get at the data.
With the introduction of N1QL, the need for Solr and Elastic has diminished as Couchbase supports full querying that most people used these two tools for. They are still required if your application needs full text search. Both platforms have excellent integration with Couchbase to provide this functionality.
Each of the four means of accessing data is a tool in your tool box and each is for a different purpose. Each tool should be used as the functional and performance needs of that use case dictate. Remember, these tools are not an either/or situation. You can mix and match these tools to your advantage. For example, you can query a View that emits object IDs and then use those IDs with a parallelized BulkGet using the Couchbase SDK to read all of those objects, or simply do create/read/update/delete (CRUD) on all of those objects, which will be very fast. So together these tools provide you with standard and scalable ways that anyone can use to get data in and out of Couchbase with ease and flexibility.
Let’s dive into the details though…
Read/Write Data by Object ID (key)
At its core, Couchbase is an amazing key/value database and always has been. Accessing via object ID is also one of the most difficult concepts in Couchbase for people to grasp quickly and use it wisely (thus the length of this section). Once they do grasp its power, they see what it can provide and how they might apply this tool. Access via object ID is the misunderstood, powerful beast in the corner. So let’s better understand this beast and how we can harness it for our benefit.
Let me be clear before we start, data access via object ID (key) will always be faster than querying. It is the difference between knowing the answer to get your data and having to ask a question (query) to find that data. Let’s say you walk into a library and need a specific book. If you know the ID of the book, you go to floor one, row three, shelf four, third book from the right. You just go there, grab it, checkout and leave. If you do not have the ID for the book, you might ask the librarian or computer, give them the information you have (author, title, etc.) and they get you to the location to retrieve the book, or worse yet, you look at every single book and eventually you will find it. When you know the object ID, there is no need for an indexed lookup of your data; you just go get the data you need right from the Couchbase managed cache. It is extremely fast, with very consistent access times and very low latency. So make sure you do not compare its performance with the other access methods, as each is for different functional needs.
Now you may say to key/value access, “meh, I need querying!” Maybe and maybe not. In Couchbase, accessing data with the object ID can be very powerful as the object ID can be max 250 bytes and depending on how you use the object ID, it could enable you to avoid querying. The real power of what you can do with that object ID is when you use a standardized pattern for each object ID that your application can construct to go after the exact data it needs, when it’s needed. Think of the object ID as an extension to your overall object modeling. Combine all that with Couchbase’s architecture, and you’ve made sure your application gets the data it needs as fast as possible from the built-in managed cache. A moment of caution though. By default Couchbase will store all object IDs for each object in the managed cache for performance reasons. So do not go wild with large keys. For example, if you have 250 million objects multiplied by 250 bytes of data, that is around 58GB of RAM needed across the cluster, just for the keys. So just because you can have a 250 byte key, does not mean you should. At serious scale this could become an issue, so keep them down below 100 is what I would recommend.
Couchbase’s architecture with the combined caching and persistence tiers excels at data access patterns that might be crippling to other databases, especially relational. Reading multiple objects right from the managed cache is considerably faster than traditional databases. And with other databases, you need to restrict the round trips to the database. With Couchbase, you can read a document, grab data from it, and then read more objects based on that, all in the same overall time or even less than it takes for other databases to do just that one query and return results. The penalty for multiple trips to Couchbase is dramatically lower and actually encouraged.
I’ll give you a few example of what is meant by standardized object ID patterns:
This object stores the login information for a unique username of hernandez94 in a user profile store. So when you need to authenticate this user, you grab just this JSON document that only contains their login info.
In this same user profile store example, this object would store a JSON document of that user’s three security questions. When they forget their password, all your app has to do is get this object. The other nice thing is since security questions are not accessed often, they might fall out of the managed cache for objects that are used often and that is ok.
You can see that with a standard object ID patterns like these, your application could create these with the information it has available and then interact with these objects directly in Couchbase. No full database querying needed. We could get deeper into object modeling strategies, but that is outside the scope of this article. For more on this, read “Performance Oriented Architecture” by Chris Anderson.
For some more specific examples of how you might use a standardized object ID pattern in your application, please see this and this blog post I wrote. Even if the specific use cases in the blogs are not applicable to yours, they might help you understand better the use of object ID patterns and how they might apply to your own use case.
- Very fast access and if your cluster is sized correctly, the object is already in the Couchbase managed cache.
- Flexibility to find data via object ID
- Data is strongly consistent. i.e. You always read your own writes.
- Scales out linearly with even distribution of data across nodes
- The application requires more intelligence to access the objects it needs
- More advanced data modeling
- More in-depth understanding of your application’s data access patterns before you write your application
Reading Data from a View
Up until Couchbase 4.0, View indexes were the only way to query Couchbase if you did not know the object ID. Now that we have the Query and Index Services driving N1QL, let’s re-visit what Couchbase Views are, what they are best at, and why they are definitely still relevant.
For example, management needs to know on a semi-regular basis how many iOS users we have, what version of the app they use, and what country they are from, but do this across 30,000,000 user documents in the database. Views solve this very well as that information would be computed as the data is inserted or updated in the database. So when you needed to query that pre-computed view, the query is relatively cheap.
One thing to note though, Views are eventually consistent by default. When querying them, you can use “stale=false” to force the view to update before returning and more likely to be strongly consistent. You will pay a performance penalty for that though. The penalty depends on how frequently data is changing in your database and how your view is designed. The flow of satisfying a view query with stale=false turned on is: Your app calls to the cluster nodes, they update the view index on nodes of the cluster, then return back to the application with the results. Now imagine this with a very high insert/update rate and a high query rate and you see where you might get into trouble. Just be aware.
- Provides easy queryability over larger amounts of data
- Once created, it looks at every object as it is updated or inserted for inclusion
- Each Data Service node only processes its portion of the total data in the cluster. For example, in a four node cluster, each node has 25% of the active data and so only indexes that 25% it has.
- View indexes are spread out over the Data Service nodes, not the Index Service. The more nodes you have as part of the Data Service, the more nodes the view engine has to get data from to get an answer back to the application.
- Eventually consistent by default, but you can query with stale=false, but take the performance hit.
Reading and Writing Data with N1QL
With N1QL, we now get into traditional querying of data. SELECT this FROM that WHERE this = ‘stuff we know’ JOIN with that other thing. If access via object ID is the beast in the room, then this is a powerful Wizard.
N1QL, along with the Query and Index Services that drive it, gives you the most flexibility in getting at your data in Couchbase while being performant and scalable through independently managed services. If you need to do analytics, complex queries, compare data, etc., then N1QL is what you are looking for. In the analogy of being in a library and needing a book, querying with N1QL is the librarian getting the books that satisfy the data you have. “Please get me everything in the library by the author Neil Gaiman that was written between 1998 and 2014 and are books or graphic novels.” This changes the kinds of application functionality Couchbase can be used for quite significantly.
The great thing with the Couchbase client SDKs is that you just use a different method to query with and the SDK takes care of communication with the appropriate services. Other than that, you do not have to do anything extra. In your app, the first call can be a complex N1QL query with joins and the next is to use the results of the query to call a map reduce view to grab a pre-calculated aggregation. This is another example of using the right tool for the right job makes sense and gives you options.
Now that we have established the power and flexibility, there are other tools you can incorporate with the Query service. Couchbase has teamed with Simba Technologies to create ODBC and JDBC drivers for data access as well. This allows you to utilize Excel or more complex BI tools like Pentaho, Informatica, etc.
- Very flexible to query data from the database to get answers you need.
- The indexes are located only on the nodes servicing the Index nodes and not spread out across the cluster.
- Can use Multi-Dimensional Scaling (MDS) to scale out just the services you need to get the required performance for your application.
- Developers that know SQL can easily transition to writing N1QL
- ODBC and JDBC drivers for integration with BI tools
- Queries will never be as performant as accessing data via object ID, for reasons I already went over.
- Eventually consistent by default, but you can query with stale=false to immediately update the index, but take a performance hit. For most workloads though, the index is being updated as fast as possible in the background and consistency should be fine.
Full Text Search with Solr or Elastic
Couchbase integrates with Solr and Elastic (Search) via plugins to provide Full Text Search capabilities that Couchbase lacks, for the moment. If you have a functional requirement for this, each has a plugin that enables every write/update operation to be streamed to Solr and/or Elastic. By default, these search servers do not save the entire JSON document, but merely create the internal components (indexes, etc.) needed to allow full text searching. Once the documents are searchable, your application would refer to those tools when you need that functionality, get the results from the search and if needed, read the full document(s) from Couchbase. This enables you to use each tool for what they are best at and get the best performance from each.
- Provides full text search capabilities
- Application does not need to do dual writes, it writes to Couchbase and from there the write is automatically replicated to Elastic/Solr for indexing.
- Each system can be scaled to handle the job it needs
- Must maintain separate infrastructure for Solr or Elastic.
- Eventually consistent
- Not fully integrated into Couchbase cluster
- Not built into the Couchbase SDK, so your application will have to talk to one of these and Couchbase
You might ask, “Why not just use Solr or Elastic and skip Couchbase altogether?” The reason is simple: while both are great for what they do, neither Solr nor Elastic are databases and neither have the performance or other powerful capabilities of Couchbase. Test it yourself and you will find that both can be at least 2-3x slower than getting the same data from Couchbase.
Depending on how you need to access data in Couchbase as well as what functional and performance needs your application needs from Couchbase, you can mix and match the tools I have outlined to get the results you need. If you need raw speed and/or full consistency, go with access via object ID and a standardized key pattern. If you need to ask questions of the data, go with views, N1QL or full text search. If you need to get a bunch of documents and then update them quickly, mix views with access via object ID.
The other great thing is that access methods #1-3 are all built into the Couchbase SDKs, with the “how each one does its magic” obfuscated from your application. This enables you to not only combine all of these access patterns to work together, but to develop with agility with all the features against a single node of Couchbase on your laptop, but then operate with that same code at any scale on a cluster. So whether you have 3 nodes or 80 nodes of Couchbase in your cluster, your application is ready to scale with zero code changes to facilitate that.