This page will cover the following to help you better understand data replication:
Data replication is the process of copying one or more records from one place to another. These places could be very similar (like copying files within the same database) or more distinct (like copying data from one database to another). The term data replication typically implies keeping data up to date from the source to the destination, but the speed and level of automation of replication can impact data consistency.
Data replication terms
Here’s a list of common terms to help you understand data replication:
Unidirectional versus bidirectional: A unidirectional relationship means data flows only from a source to a destination. A bidirectional relationship means data flows both ways.
Active-active versus active-passive: An active-active cluster evenly distributes loads across all nodes at all times. In contrast, an active-passive cluster has a backup node that takes over only if the active node is overloaded.
Synchronous versus asynchronous: Synchronous replication simultaneously writes data to the primary node and the replica. Asynchronous replication writes data to the primary node first and then copies it to the replica.
Batch versus real-time processing: Batch processing collects and processes data in groups or batches at scheduled intervals, and it is typically suited for handling large volumes of data. Real-time processing handles data as it is generated or received, making it suitable for time-sensitive applications.
Incremental versus full: Incremental data replication means you only replicate the updated elements of a record. Full data replication means you replicate the entire record when its elements change.
Filtered: Data from a source can be filtered so that only a specific subset or selection of the data is replicated to a destination.
Transformed: Data transformation is the process of converting data from one format or structure to another to put it in the correct format and structure for analysis, reporting, or storage at its destination.
Benefits of data replication
Data replication has many uses and benefits. These include:
High availability (HA): Maintaining up-to-date copies of data in multiple locations prevents data loss in case of failures. Typically, real-time replication is unidirectional and takes place between a source and one or more replicas. If the source becomes unavailable, one of the replicas takes over, often automatically.
Disaster recovery (DR): Closely related to HA, disaster recovery ensures that copies of your data are available in the event of disaster.
Scaling throughput: This process uses multiple copies of data to increase the capacity of a system to handle requests. It is typically used for read traffic and less commonly for write traffic.
Secondary access: Also known as indexing, this involves replicating the data to another system to access it differently. The second system can either be within the same database (in the case of indexing) or can be external. Depending on the technologies in use, an intermediary like Kafka is sometimes required to transfer the data between a source and an external system.
Note: We’ve intentionally excluded “backup” as a benefit of data replication because you don’t update a backup with changes. A key point to understand is that data replication is susceptible to application-level corruption or deletion of data, whereas backups are not. Backups should not be considered a replacement for HA or DR, nor should HA or DR replace backups.
Challenges of data replication
With any data replication strategy, you will be forced to make trade-offs between:
- Consistency, availability, and partition tolerance (the CAP theorem)
- Resource usage and cost (RAM, disk, CPU, network)
- Maintaining multiple replicas increases the security of your data in the event of outages but results in higher resource usage and cost. The same goes for scaling reads with multiple replicas.
- Synchronous replication can cause the source writes to be slower (or fail altogether). However, asynchronous replication may result in higher resource usage if you write records faster than you can replicate them. Asynchronous replication also introduces the potential for discrepancies between source and destination data no matter how fast or reliable your replication technology is.
- Incremental, filtered, and transformed replication may result in lower network resource usage, but these types of replication tend to work more slowly and require a higher-performance CPU for the source.
The best practices for data replication vary greatly depending on your use case requirements and the capabilities of the technologies you are using.
Data replication in RDBMS
Every database technology provides different capabilities and options for data replication. Historically, relational database management systems (RDBMS) replicate data from one database instance to another through log shipping, which involves sending the data from one database instance to another after it’s written to disk. Newer databases, such as the Couchbase NoSQL database, replicate data directly from RAM, significantly increasing speed and reliability.
Data replication is a core concept underpinning many different database capabilities, and replication comes in many different forms with different challenges and advantages. The best data replication choice for your organization depends on what you need to achieve and what technologies you’re using.
Use these resources to learn more about data replication and Couchbase’s data replication capabilities:
Guide to Cloud Data Replication
Data Replication: Advantages & Disadvantages
Data Replication and Synchronization in Couchbase
Cross Data Center Replication (Couchbase Capella™)
Cross Data Center Replication (Couchbase Server)
Capella App Services (BaaS)
Check out our database hub to learn about other key concepts of data management.