Details
-
Type:
Bug
-
Status:
Open
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: 1.8.1
-
Fix Version/s: None
-
Component/s: ns_server
-
Security Level: Public
-
Labels:None
Description
We are running 1.8.1-937 Coucbase server on 10 Linux nodes with about 200M items bucket.
About 1.5 month ago we've upgraded servers from 1.7 to 1.8 and started to see following error messages in the log:
killing myself due to unexpected upstream sender exit with reason: {{badmatch,
{error,
closed}},
[{ebucketmigrator_srv,
upstream_sender_loop,
1}]} ebucketmigrator_srv000 ns_1@172.19.4.41 03:52:33 - Thu Sep 20, 2012
In many cases this error comes from same node, but sometimes from different ones.
Since the upgrade we've experienced a failure of the node, which became unresponsive after spitting multiple errors of this kind. But in other cases (with single error appearing once a day or so), there seems to be no impact.
Google search gave me only following references for this error:
http://www.couchbase.com/issues/browse/MB-5171
http://www.couchbase.com/issues/browse/MB-5467
http://www.couchbase.com/issues/secure/ReleaseNote.jspa?projectId=10010&version=10295
It looks like according to the last reference, this error supposed to be fixed in 1.8.1 version.
I also can't find clear information as to whether this is a 'bad' error or not and how to fix and/or prevent it.
About 1.5 month ago we've upgraded servers from 1.7 to 1.8 and started to see following error messages in the log:
killing myself due to unexpected upstream sender exit with reason: {{badmatch,
{error,
closed}},
[{ebucketmigrator_srv,
upstream_sender_loop,
1}]} ebucketmigrator_srv000 ns_1@172.19.4.41 03:52:33 - Thu Sep 20, 2012
In many cases this error comes from same node, but sometimes from different ones.
Since the upgrade we've experienced a failure of the node, which became unresponsive after spitting multiple errors of this kind. But in other cases (with single error appearing once a day or so), there seems to be no impact.
Google search gave me only following references for this error:
http://www.couchbase.com/issues/browse/MB-5171
http://www.couchbase.com/issues/browse/MB-5467
http://www.couchbase.com/issues/secure/ReleaseNote.jspa?projectId=10010&version=10295
It looks like according to the last reference, this error supposed to be fixed in 1.8.1 version.
I also can't find clear information as to whether this is a 'bad' error or not and how to fix and/or prevent it.
One such moment is around Sep 19 18:30 and onwards. There are lots of ebucketmigrator restarts caused mostly by source node (.41) closing it's tap connection. They are rapidly restarted by their supervisor, but fail rapidly again (or sometimes rapidly again), which causes supervisor to give up and fail. Parent supervisor restarts replication supervisor and janitor sets up replications again in about 10 seconds, only to start this 'loop' again.
I see that apparently eventually replicas fall back far enough behind so that tap cannot reuse checkpoints anymore. It seems to start at around this:
[rebalance:info] [2012-09-19 20:56:53] [ns_1@172.19.4.41:<0.26205.227>:ebucketmigrator_srv:init:252] Starting tap stream:
[{vbuckets,[17,320,321,322,323,932,933,934,935,936,937,938]},
{checkpoints,[{17,32722},
{320,32508},
{321,32523},
{322,32566},
{323,32393},
{932,32455},
{933,32368},
{934,32570},
{935,32578},
{936,32505},
{937,32580},
{938,32547}]},
{name,"replication_ns_1@172.19.4.44"},
{takeover,false}]
{{"172.19.4.41",11209},
{"172.19.4.44",11209},
[{username,"default"},
{password,[]},
{vbuckets,[17,320,321,322,323,932,933,934,935,936,937,938]},
{takeover,false},
{suffix,"ns_1@172.19.4.44"}]}
[rebalance:info] [2012-09-19 20:56:53] [ns_1@172.19.4.41:<0.26205.227>:ebucketmigrator_srv:process_upstream:465] Initial stream for vbucket 320
And soon after that we see backfills for many more vbuckets for this and other different tap connections too.
I see last reached_max_restart_intensity at 22:33 Oct 19. After that I'm seeing period of massive stats archiver timeouts. Perhaps indicating there's lots of disk contention.
At around 3 AM next night (Oct 20) box was apparently restarted (or just couchbase server, doesn't matter).