[MB-6934] Displaying XDCR Replication error messages/warnings. Created: 16/Oct/12  Updated: 10/Jan/13  Resolved: 23/Oct/12

Status: Closed
Project: Couchbase Server
Component/s: cross-datacenter-replication, UI
Affects Version/s: 2.0
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Ketaki Gangal Assignee: Junyi Xie (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: 2.0-1856

Attachments: PNG File Screen Shot 2012-10-16 at 4.24.53 PM.png     PNG File Screen Shot 2012-10-16 at 4.24.58 PM.png    

 Description   

Hi,

With the new error logging code, we now display "recent 10 errors". Added a screenshot at end of email.

At any point, the last 10 error are displayed on the replication - 10 errors, which may or may not be valid depending upon the current time.

This issue needs to be addressed at two levels -
1. Level of error logging - Currently too much information is displayed, which also gives misleading idea on state of replication.
2. Classification of errors v/s warnings.

Having lower level information on the ns_logs can help trouble shoot , but having all of that information on the web-console might just confuse and overwhelm end-user IMO.


XDCR can have an error at any of the following levels
- xdc vbucket replicators - timing out, checkpoint failures, db_not_found
- xdc replication manager
- ns_server level - where it is unable to talk to the other remote cluster and so on.

With some recent trials on the new code, we see a lot of errors on the level of bucket replicators, say vbucket XXX commit_checkpoint_failure.
But the replication is continuing as expected. Replication has not failed, but it is continuing minus the above checkpoint failure.

It might be nicer to classify errors v/s warnings.

Errors - When finally xdcr has stopped working . No more data is being sent over to the destination.
Replication will be attempted for X number of times, and is finally given up?

Warnings - When there are timeouts, but it is a recoverable situation.

-Ketaki

Screenshot


 Comments   
Comment by Ketaki Gangal [ 16/Oct/12 ]
Comments from Product Mgmt
Hi Junyi,

Is there a log level for the XDCR error messages?
Are the last 10 errors the only errors tracked?
Do these include info and warning messages or only errors in this list?
Do we clean up this error log periodically? (there is no way for ns_server to know if the error is relevant any more)

Aliaksey, as we discussed, at a minimum we need to change the "10 errors" link that appears the first time this message buffer gets populated to a link in aqua blue (like the IP address in cluster reference) and should say "Recent XDCR log messages"

Junyi, if you can provide more visibility from the replicator side about warnings vs errors vs info messages, we can do something better, if not in 2.0 sometime in the future. But this basic level of error handling doesn't give users enough visibility into what is going on.
Comment by Junyi Xie (Inactive) [ 16/Oct/12 ]
Alk,

Change within XDCR is at

http://review.couchbase.org/#/c/21694/

Now the error returned to ns_server is a pair {Time, ErrorString} instead of a string.

Please go ahead and modify UI code accordingly. Thanks.
Comment by Aleksey Kondratenko [ 23/Oct/12 ]
Commit to filter out too old errors is in gerrit. I've also implemented Dipti's proposal to display errors link in normal color rather than red.
Comment by Junyi Xie (Inactive) [ 23/Oct/12 ]
All fixes are on gerrit

http://review.couchbase.org/#/c/21694/

http://review.couchbase.org/#/c/21903/

http://review.couchbase.org/#/c/21904/2

Comment by Junyi Xie (Inactive) [ 23/Oct/12 ]
fixes on gerrit
Generated at Tue Sep 23 06:24:17 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.