Details
-
Type:
Bug
-
Status:
Resolved
-
Priority:
Blocker
-
Resolution: Fixed
-
Affects Version/s: 1.8.1-release-candidate
-
Fix Version/s: 1.8.1
-
Component/s: couchbase-bucket, ns_server
-
Security Level: Public
-
Labels:None
Description
As part of fixing MB-5052 it's logs revealed that one of vbuckets we previously built using new code did not actually have closed checkpoint. So replica wasn't in fact built.
After discussion with Chiyoung I found that using 'backfill_completed' to find out when replica is mostly up-to-date is not correct. This stat becomes true when backfill is done, but there's also next message that opens next checkpoint. And we should be waiting for it instead. But we also found there are no producer-side stats we can use.
So we decided I'll have to additionally poll destinations for actually closed checkpoints on them before I'll stop replication building.
After discussion with Chiyoung I found that using 'backfill_completed' to find out when replica is mostly up-to-date is not correct. This stat becomes true when backfill is done, but there's also next message that opens next checkpoint. And we should be waiting for it instead. But we also found there are no producer-side stats we can use.
So we decided I'll have to additionally poll destinations for actually closed checkpoints on them before I'll stop replication building.
I've found one potential cause of issue happening in practice inside ep-engine and uploaded patch. http://review.couchbase.org/15119
I think with that patch we should not see this problem in practice.
Hopefully for 1.8.2 we'll have some native ep-engine's support for replica building. Like new tap option similar to tap_dump.
I've stashed my ugly attempt to work around this issue on ns_server side hoping we won't needed, but testing is needed to confirm that.