Questions around server/p2p replication with 1M+ documents

I’m testing CBL as the data storage layer for a .NET desktop application that will be used on multiple machines at a business location for handling customer data. In addition, the data must be available on a remote server for access by remote machines. CBL appears to be the perfect fit for this, with the local clients having a local copy of the database, with replication to the remote server for access by other clients. In addition, because internet access is not guaranteed to be available 100% of the time at the business location, replication between the desktops would allow them to continue to work until the internet access was restored.

So, the architecture would be:

  • P2P replication (pull-only) between desktop clients
  • Client to server replication (push/pull) for the remote server (CouchDB)

I’ve been testing using 1.7 Million documents to simulate about 10 years worth of customer data (it’s an extreme example, but if it works at this scale, I know it will work under more realistic volumes), and have run into some issues:

  1. After the initial pull replication from the server, unless a document is saved before the application is restarted, Push replication to the server will not have the LastSequence set, and therefore will attempt to push all 1.7 million revisions back up to CouchDB.

  2. WebSocket P2P Pull replication will crash with an OutOfMemory exception because the socket connection to /_changes/?feed=websocket attempts to return too many documents.

  3. Polling P2P Pull replication will crash with an OutOfMemory exception because the initial request tries to retrieve all documents from revision 0, which returns too many documents.

Issue #1 can be worked around because it’s fairly simple to perform a document save after the initial retrieval. #2 and #3 are more difficult because it appears that each replicator maintains a separate LastSequence and therefore does not take into account the revisions that have already been retrieved by a different replicator.

Am I using the wrong tool/is there a better way to configure CBL, or can these issues be resolved?

Before I attempt to come up with a response to this I’d like to clear a few things up. There is no such thing as web socket P2P replication (the P2P ‘server’ side does not support it). Are you perhaps talking about continuous replication vs one-shot? I have a large data set that I test with made up of a DB which contains about 1,000,000 revisions (and a size of about 700 MB) and I’ve never observed out of memory issues. However, that is for operations against sync gateway which leads me to my next question. Which side is the one crashing? The ‘client’ side of the P2P (receiver of the pull) or the ‘server’ side (sender of the pull)?

Also, you are correct that continuous and non-continuous replication will store their checkpoints differently. However, it’s not as bad as pushing up revisions when they are not needed. Part of the algorithm involves agreement on which revisions are needed before actually sending them. That part is a lot faster than the actual upload or download.

One suggestion would to be to preseed the DBs on the client. That means you will bundle a snapshot of your data along with the app and the replication can start from that point instead of zero (I assume the ’10 years of data’ is not going to change anymore?) That will save you a lot of time too since you won’t have to spend time going through the motions of safely syncing data you already know you need form the beginning.

Thanks for the fast response. For the P2P replication, yes, I’m talking about continuous pull replication (sorry, should have made that clear in the original post). In my tests, the ‘server’ side of the pull replication is crashing when it attempts to JSON encode the entire list of changes to send back to the ‘client’. The ‘client’ side of a pull replication can definitely handle the document size because pull replication of 1.7M documents from CouchDB works perfectly fine. It’s just the P2P replication that crashes.

Pre-seeding the database won’t help me for two reasons:

  1. I am already syncing the database from CouchDB with no issues - the “remote server” replication that I described in my first post. This uses a standard pull replicator and downloads the entire document set (1.7M documents, 900MB+) in 10-15 mins.

  2. The P2P replicators will still try to pull all revisions because their unique ID will be different for every installation, so LastSequence will be 0.

To solve the P2P replication problem, I think I need some way to initialize the P2P replicators so that they do not try to pull the entire revision list from each other, but only changes that have not been retrieved from CouchDB.

And if I have a way to do this, then I can probably solve that first issue with Push replication back to CouchDB, because that replicator can be initialized to the current revision so it won’t try to send all the revisions back to CouchDB.

Yeah I see the problem, but there is not an easy solution to this other than the obvious fixing the P2P server side crash. The replication process has no way of knowing that these two replications are the same (on the client side it does, but the server side will force the reset to zero as a safety measure). I’ll file a ticket and try to reproduce this. This is my last working day before my winter break but if I can reproduce and fix this soon then it will make it into our 1.4 release.

I just gave P2P replication a shot on my OS X machine and it had no problem with 300k changes (that’s the largest I have at the moment but I have a program that can generate arbitrary random large sets) and was able to retrieve the changes easily. Perhaps the data set needs to be bigger, but could you visit this ticket and fill in some details about the crash (and possibly the amount of memory your system has). I have an idea of where it is already, though, but it would be good to have confirmation.

As requested, I have added the information about the crash to the ticket in GitHub.

Fixing the OOM error will be great, but it will still cause the P2P replicators to pull all the changes down from each desktop in the network, which for 10 desktops will be quite a lot of data going over the network. Is there no way for the pull replicator to pull the last 100 or so changes and check if they already exist locally, before doing a pull for all documents?

It won’t pull the documents if it already has them, just their IDs and revision IDs.

Just checking the latest changes won’t work because there is no consistent ordering of changes in a distributed system like this.