When replicating documents from our CB to ES servers (all hosted on AWS) using the transport plugin, quite a few errors show up in CB’s goxdcr.log file, and most of these don’t lead to much results on Google so I was hoping someone here might know a bit more. Some examples of the messages in the log are:
CheckpointManager 2017-05-09T02:06:28.105Z [ERROR] MassValidateVBUUID failed, err=Can't find 'mismatched' in response
ReplicationSpecService 2017-05-09T02:06:31.531Z [INFO] result of remote bucket call: remote_connStr=<ES server address>, targetBucketUUID=, err_target=Failed on calling host=<ES server address>, path=/pools/default/b/<ES index>, err=<nil>, statusCode=404
ReplicationSpecService 2017-05-09T02:06:31.532Z [INFO] Received error Failed on calling host=<ES server address>, path=/pools/default/b/<ES index>, err=<nil>, statusCode=404 when validating target bucket <ES index> for spec 256a621647f13ea7afd69a31cb183e10/<bucket>/<ES index>. Skipping target bucket validation. remote_connStr=<ES server address>, remote_userName=<user>
CapiNozzle 2017-05-09T02:09:51.728Z [ERROR] capi_256a621647f13ea7afd69a31cb183e10/<CB bucket>/<ES index>_<ES server address>_0 Error reading response. vb=169, err=unexpected EOF
CapiNozzle 2017-05-09T02:09:51.728Z [ERROR] capi_256a621647f13ea7afd69a31cb183e10/<CB bucket>/<ES index>_<ES server address>_0 batchUpdateDocs for vb 169 failed with err Error reading response. vb=169, err=unexpected EOF
CapiNozzle 2017-05-08T03:50:40.756Z [ERROR] Received error when writing boby part. err=write tcp <CB server address>-><ES server address>: write: broken pipe
CapiNozzle 2017-05-08T03:50:40.757Z [ERROR] capi_3b057ef39c773498bc80e763d779824c/<CB bucket>/<ES index>_<ES server address>_1 batchUpdateDocs for vb 315 failed with err write tcp <CB server address>-><ES server address>: write: broken pipe.
CapiNozzle 2017-05-08T03:50:40.757Z [ERROR] capi_3b057ef39c773498bc80e763d779824c/<CB bucket>/<ES index>_<ES server address>_1 error updating docs on target. err=batch update docs failed for vb 315 after 6 retries
CapiNozzle 2017-05-08T03:50:40.758Z [ERROR] capi_3b057ef39c773498bc80e763d779824c/<CB bucket>/<ES index>_<ES server address>_1 raise error condition batch update docs failed for vb 315 after 6 retries
GenericSupervisor 2017-05-08T03:50:40.758Z [ERROR] Received error report : map[capi_3b057ef39c773498bc80e763d779824c/<CB bucket>/<ES index>_<ES server address>_1:batch update docs failed for vb 315 after 6 retries]
ReplicationManager 2017-05-08T03:50:40.758Z [INFO] Supervisor PipelineSupervisor_3b057ef39c773498bc80e763d779824c/<CB bucket>/<ES index> of type *supervisor.GenericSupervisor reported errors map[capi_3b057ef39c773498bc80e763d779824c/<CB bucket>/<ES index>_<ES server address>_1:batch update docs failed for vb 315 after 6 retries]
PipelineManager 2017-05-08T03:50:42.365Z [INFO] Pipeline updater 3b057ef39c773498bc80e763d779824c/<CB bucket>/<ES index> is lauched with retry_interval=10
xdcr_errors.log remains empty and no errors show up in the logs on the ES side.
It has been happening for me for a while now on CB 4.5 and ES 2.4.3, but I’ve now updated to CB 4.6 and ES 5.3.0 to see if it helps and it seems to not have made any difference. I’ve tried most related fixes I could find, such as increasing threadpool.bulk.queue_size (rejected in all queues stays at 0), increasing http.max_content_size, setting index.refresh_interval to -1, index.translog.durability to async, and decreasing the XDCR nozzle count, but none of them had any noticeable impact. The document replication rate starts at 3000/s and then drops to around 500/s when the errors start happening.
Any clues on why this is happening would be appreciated!