What Is Database Sharding?

Learn all about sharding and how Couchbase’s NoSQL cloud database service can help

Database sharding overview

This page covers:

Database sharding is a powerful tool for optimizing the performance and scalability of a database. It allows for faster access to data and enables a database to handle larger workloads by distributing data and processing power across multiple servers. Because NoSQL databases are designed with distributed computing and automatic sharding in mind, they’re often the databases most associated with sharding. With enough effort, though, sharding can be achieved with any database technology.

How does database sharding work?

Database sharding divides the entire dataset into multiple groups known as shards. Once divided, each shard can be stored independently, usually on multiple servers, which are often referred to as a cluster. Each shard can be accessed independently, which means you can access data faster, and you have more resources available for processing, computing, and storage.

 

 

Where does sharding take place?

If a database has sharding features built in, then the development team requires less work to achieve sharding. If sharding is an optional feature or requires configuration, then you’ll need to plan carefully, but sharding shouldn’t require any significant codebase changes or additions. If the underlying database cannot do sharding (as is the case with many relational databases), then you may be required to make major changes to the codebase, and developers will have to build sharding into the persistence layer of an application.

 

Do I need to shard my data?

Whether or not you should use sharding depends on many factors. Those factors include the size of your dataset, the number of system users, the number of operations performed, and your infrastructure constraints.

 

If your application experiences a noticeable decrease in performance due to more users or more operations, then horizontal scaling (which often uses sharding) is one way to increase compute resources available to your database. But poor performance may also indicate suboptimal code, lack of proper indexes, the need for data modeling changes, or other problems. Sharding shouldn’t always be the first choice for improving performance, but for some database technologies it can be a low-friction means to achieve your performance goals.

Advantages of sharding

  • Faster performance: There are more servers available to handle input/output
  • Horizontal scaling: You can quickly add additional servers to a cluster
  • Costs: Horizontal scaling can often be less expensive than vertical scaling (i.e., upgrading one server to another more powerful server)
  • Distribution/uptime: A horizontally scaled distributed database can achieve better uptime than a traditional single server

Disadvantages of sharding

  • Complexity: Depending on the database system, sharding complexity can vary. Some databases are designed with distribution, horizontal scale, and sharding included. Others require a more hands-on DIY approach.
  • Rebalancing: When adding additional machines to a cluster, the shards will likely need to be rebalanced to distribute data evenly. (For example, if you have 1,000 documents evenly distributed across three shards, that’s roughly 333 documents per shard. If you add a fourth shard, even distribution would be 250 documents per shard). If a database doesn’t have sharding features built in, rebalancing is guaranteed to be a complex manual DIY process.

Types of sharding

There are multiple approaches to sharding. Some database systems have sharding functionality built in, while others do not directly support sharding (and require a lot of custom coding or DIY processes). The goal of each approach is to divide data into shards consistently so that data can be looked up on or written to the same shard each time.

 

Range-based sharding

Range-based sharding involves selecting data values and assigning them to a shard based on whether or not they fall within a specific range. For instance, if you have user data that contains age, one shard could store users between the ages of 0-10, another shard would store users between the ages 11-20, and so on.

 

This approach can be problematic because one shard could end up storing many more users than the other. And shards storing a disproportionately high amount of data can become hot spots that impact performance.

 

Key-based sharding

Key-based sharding takes more of an independent approach. A value in the data (usually the document ID in a NoSQL document database) is run through a hash, and that hash determines which shard the data should be stored in.

 

This approach can be problematic if it’s not directly supported by the database, because any application accessing the database must be able to construct the hash. Also, this approach requires the data value used for the hash to be immutable. This is usually not an issue, but it can be for rare edge cases.

 

Couchbase uses automatic key-based sharding to distribute data evenly in a cluster, and also provides automatic rebalancing and automatic replication. These automations can simplify critical processes and free up valuable time for your development team.

 

Directory-based sharding

Directory-based sharding is an approach whereby some value of the data is mapped to a particular shard based on its value in a lookup table or lookup configuration. It’s similar to the range approach, but may involve a simple lookup. For instance, a user with an address in Ohio would be stored in the “Ohio” shard, a user in California would go to the “California” shard, and so on.

 

This approach can be problematic because a lookup table or configuration can become unavailable, go down, or become corrupted. In such cases, the application can no longer perform reads or writes.

 

Geo sharding

A geo shard can be combined with, or used instead of, the other sharding options. The idea behind geo sharding is to store data physically closer to where it will most often be accessed. For instance, a user with an Ohio address would be stored on a server in Ohio, and a user with a California address would be stored on a server in California.

 

This approach can provide faster access, but can also lead to hotspots and underutilized servers. Geo sharding may also fail to meet legal requirements for specific applications or jurisdictions.

 

In addition to providing automatic sharding, Couchbase can support geo sharding through cross data center replication (XDCR).

 

Entity-based sharding

Entity-based sharding means that separate, but closely related, data is stored together on the same shard. For instance, a user may be considered an entity within an application’s logic, but a user’s shopping history may be stored separately in a different shard. By storing the related data in the same shard, you can reduce the amount of computing work required to retrieve it together at the same time.

 

The drawback to this approach is complexity. Configuring which data goes where can be a complex process, especially if some data is used by multiple entities.

Alternatives to database sharding

Horizontal scaling will always involve sharding at some level, but there are many options for how to shard. One way is architectural sharding, like a microservices architecture or a physical disk sharding that’s opaque to the user. Sharding can even be hidden and abstracted behind a cloud database.

 

When considering a database system, it’s crucial to understand how sharding will be accomplished. It can be wholly abstracted and hidden, it can be automatic, it can be supported with potentially complex configuration options, or it can be unsupported and require a DIY approach.

How Couchbase Capella helps with database sharding

Couchbase Capella is the cloud database platform for digital enterprises.

 

Capella uses the same sharding system as Couchbase Server, which is key-based automatic sharding. From a user or developer’s point of view, sharding with Couchbase requires no additional configuration or maintenance. By using a CRC32 algorithm in conjunction with vBuckets, Capella ensures that your system will not have any hotspots.

 

Capella automates replication. You just select the number of replicas you want, and Capella handles the rest.

 

Capella also automates rebalancing. When you add servers or remove them from a cluster, Capella automatically rebalances them without causing any downtime.

 

Finally, Capella can achieve geo sharding via the XDCR feature. XDCR replicates data between data centers in real time. XDCR replications can include or exclude data based on user-defined filters in order to improve local latency or to meet data location requirements.

Conclusion

Sharding is an important concept to understand if you’re scaling a database to handle more operations. And NoSQL databases are especially good at sharding because they eliminate many of the constraints imposed by relational databases. That said, Couchbase Capella provides some of the best features of a relational database (SQL syntax and a full SQL implementation that includes JOINs and ACID transactions) to a distributed database with automatic sharding.

 

To learn more about sharding in Couchbase, check out:

 

Other important resources for considering how to scale a database:

Ready to get started?