[MB-6934] Displaying XDCR Replication error messages/warnings. Created: 16/Oct/12 Updated: 10/Jan/13 Resolved: 23/Oct/12 |
|
| Status: | Closed |
| Project: | Couchbase Server |
| Component/s: | cross-datacenter-replication, UI |
| Affects Version/s: | 2.0 |
| Fix Version/s: | 2.0 |
| Security Level: | Public |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Ketaki Gangal | Assignee: | Junyi Xie |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: | 2.0-1856 | ||
| Attachments: |
|
| Description |
|
Hi, With the new error logging code, we now display "recent 10 errors". Added a screenshot at end of email. At any point, the last 10 error are displayed on the replication - 10 errors, which may or may not be valid depending upon the current time. This issue needs to be addressed at two levels - 1. Level of error logging - Currently too much information is displayed, which also gives misleading idea on state of replication. 2. Classification of errors v/s warnings. Having lower level information on the ns_logs can help trouble shoot , but having all of that information on the web-console might just confuse and overwhelm end-user IMO. XDCR can have an error at any of the following levels - xdc vbucket replicators - timing out, checkpoint failures, db_not_found - xdc replication manager - ns_server level - where it is unable to talk to the other remote cluster and so on. With some recent trials on the new code, we see a lot of errors on the level of bucket replicators, say vbucket XXX commit_checkpoint_failure. But the replication is continuing as expected. Replication has not failed, but it is continuing minus the above checkpoint failure. It might be nicer to classify errors v/s warnings. Errors - When finally xdcr has stopped working . No more data is being sent over to the destination. Replication will be attempted for X number of times, and is finally given up? Warnings - When there are timeouts, but it is a recoverable situation. -Ketaki Screenshot |
| Comments |
| Comment by Ketaki Gangal [ 16/Oct/12 ] |
|
Comments from Product Mgmt
Hi Junyi, Is there a log level for the XDCR error messages? Are the last 10 errors the only errors tracked? Do these include info and warning messages or only errors in this list? Do we clean up this error log periodically? (there is no way for ns_server to know if the error is relevant any more) Aliaksey, as we discussed, at a minimum we need to change the "10 errors" link that appears the first time this message buffer gets populated to a link in aqua blue (like the IP address in cluster reference) and should say "Recent XDCR log messages" Junyi, if you can provide more visibility from the replicator side about warnings vs errors vs info messages, we can do something better, if not in 2.0 sometime in the future. But this basic level of error handling doesn't give users enough visibility into what is going on. |
| Comment by Junyi Xie [ 16/Oct/12 ] |
|
Alk,
Change within XDCR is at http://review.couchbase.org/#/c/21694/ Now the error returned to ns_server is a pair {Time, ErrorString} instead of a string. Please go ahead and modify UI code accordingly. Thanks. |
| Comment by Aleksey Kondratenko [ 23/Oct/12 ] |
| Commit to filter out too old errors is in gerrit. I've also implemented Dipti's proposal to display errors link in normal color rather than red. |
| Comment by Junyi Xie [ 23/Oct/12 ] |
|
All fixes are on gerrit
http://review.couchbase.org/#/c/21694/ http://review.couchbase.org/#/c/21903/ http://review.couchbase.org/#/c/21904/2 |
| Comment by Junyi Xie [ 23/Oct/12 ] |
| fixes on gerrit |