[MB-6763] XDCR Error Logging Created: 27/Sep/12  Updated: 10/Jan/13  Resolved: 10/Oct/12

Status: Closed
Project: Couchbase Server
Component/s: cross-datacenter-replication
Affects Version/s: 2.0
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Ketaki Gangal Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
* Place holder for adding error logging on the Source XDCR cluster.
- Will add more as we come across more use-cases/scenarios.

Replication failure due to the following reason should be logged - This will be helpful for support to troubleshoot errors and for any end-user.
*Today most of the replication failures are debugged using ns_server logs, we should move the error displaying on the UI as well.

Errors
-----------------
- Source replication cluster reference cannot be deleted if there is a replication is set up (another bug to track this - MB-6843)
- Source replication is deleted - issue user-visible log message saying "Replication has been deleted" on source cluster
- Source bucket is deleted - (1) Issue warning message to the user on delete bucket that replication is going on (2). issue user-visible log message saying "Bucket has been deleted, replication to remote bucket has stopped" on source cluster (3) Error message in the XDCR page on the replication impacted saying "Bucket has been delete, XDCR has stopped"
- Source bucket is flushed - Cannot be done.


For errors on the destination node and replication times out: raise this error on the source node .




 Comments   
Comment by Abhinav Dangeti [ 28/Sep/12 ]
fyi, .. http://www.couchbase.com/issues/browse/MB-5611
Comment by Dipti Borkar [ 02/Oct/12 ]
Pasting useful message that we need to provide better errors on from MB-5611

Some common XDCR failure reasons:

1 db_not_found error: when node is unresponsive, for e.g:
"could not open http://Administrator:
*****@10.3.3.28:8092/default%2f120%3b093c0a978eb59342ea52d87eae424bb3/"

2 badmatch,{error,corrupted_data}, Erlang-related corruption
    [{couch_compress,decompress,1},
     {couch_doc,with_uncompressed_body,1},
     {couch_doc,to_json_base64,1},
     {xdc_vbucket_rep_worker,maybe_flush_docs,3},
     {lists,foldl,3},
     {xdc_vbucket_rep_worker,local_process_batch,5},
     {xdc_vbucket_rep_worker,queue_fetch_loop,4}]

3 checkpoint_commit_failure
     {bad_return_value,
       {checkpoint_commit_failure,
           <<"Failure on target commit: {error,<<\"not_found\">>}">>}}

4 http_request_failed
xdc_replicator:handle_info:282] Worker <0.11173.72> died with reason: {http_request_failed,"POST",
                                       "http://10.3.121.33:8092/default%2F684/_bulk_docs",
                                       {error,{code,500}}}

  Replicator: couldn't write document
xdc_replicator_worker:flush_docs:111] Replicator: couldn't
write document ``, revision ``,
to target database `http://10.3.121.33:8092/default%2F683/`. Error: ``, reason: ``.

5 replicator_died
{replicator_died, {'EXIT',<15849.2212.0>, {badmatch,{error,closed}}}}

6 bulk_set_vbucket_state_failed
General error seen when rebalance fails due to vbucket_map not ready (possibly)
that may cause replication to fail.
Comment by Junyi Xie (Inactive) [ 08/Oct/12 ]
Hi Dipti,

Thanks for organizing the meeting. Actually today XDCR has already have the API to expose errors to ns_server. The API is within XDCR replication manager (xdc_rep_manager:latest_errors()) and when called, it will return the last 10 errors for each bucket which are actively replicating. Alk will expose these errors (or at least some of them) on UI. Alk will also determine where to expose these msgs.

If users feel these error msgs are hard to understand, we can change it later to make it more user-friendly. At this time, we just need to ask Alk expose them to UI. Let me know if any other questions.


Thanks!

Junyi
Comment by Junyi Xie (Inactive) [ 08/Oct/12 ]
Nothing to do within XDCR. All ns_server work.
Comment by Aleksey Kondratenko [ 10/Oct/12 ]
Fixed & approved but still sits in gerrit: http://review.couchbase.org/#/c/21459/

it is naive, but so are 'errors' from xdcr
Generated at Thu Oct 23 00:42:12 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.