Details
-
Type:
Bug
-
Status:
Open
-
Priority:
Critical
-
Resolution: Unresolved
-
Affects Version/s: 2.0.2
-
Fix Version/s: 2.1
-
Component/s: couchbase-bucket
-
Security Level: Public
-
Labels:
-
Environment:SLES11 SP2, Xen virtual machines with a single processor and 1.5GB RAM
Description
Servers M1 and M2 are each running couchbase (C1 and C2 respectively, both added to the same cluster via the web interface) and an application (A1 and A2 respectively). A1 is configured to primarily access C2 and to fail over to C1 if C2 becomes unavailable. A2 is similarly configured, only accessing C1 primarily and failing over to C2.
When accessing A1, Couchbase reports that the value is stored on C2 and has replicated to C1, howerver while accessing A2, Couchbase reports that the value is stored on C1 and has not replicated to C2. Additionally, when attempting to manually rebalance the cluster, the rebalance fails (the log output is the attached image). After consulting the IRC channel, user ingenthr suggested I file this as an issue.
I've also attached the output of cbcollect_info in the hopes that it might contain something useful. If you need more information, do not hesitate to ask.
When accessing A1, Couchbase reports that the value is stored on C2 and has replicated to C1, howerver while accessing A2, Couchbase reports that the value is stored on C1 and has not replicated to C2. Additionally, when attempting to manually rebalance the cluster, the rebalance fails (the log output is the attached image). After consulting the IRC channel, user ingenthr suggested I file this as an issue.
I've also attached the output of cbcollect_info in the hopes that it might contain something useful. If you need more information, do not hesitate to ask.
-
Hide
- debug.zip
- 19/Mar/13 3:57 PM
- 9.70 MB
- Frojoe
-
- cbcollect_info_ns_1@127.0.0.1_20130319-202548/couchbase.log 263 kB
- cbcollect_info_ns_1@127.0.0.1_20130319-202548/ns_server.xdcr.log 12 kB
- cbcollect_info_ns_1@127.0.0.1_20130319-202548/ns_server.couchdb.log 961 kB
- cbcollect_info_ns_1@127.0.0.1_20130319-202548/stats.log 1.60 MB
- cbcollect_info_ns_1@127.0.0.1_20130319-202548/ini.log 0.3 kB
- cbcollect_info_ns_1@127.0.0.1_20130319-202548/ns_server.error.log 10.38 MB
- cbcollect_info_ns_1@127.0.0.1_20130319-202548/ns_server.views.log 1.76 MB
- cbcollect_info_ns_1@127.0.0.1_20130319-202548/ns_server.info.log 46.50 MB
- cbcollect_info_ns_1@127.0.0.1_20130319-202548/ns_server.xdcr_errors.log 0.2 kB
- cbcollect_info_ns_1@127.0.0.1_20130319-202548/ns_server.mapreduce_errors.log 0.2 kB
- cbcollect_info_ns_1@127.0.0.1_20130319-202548/ns_server.debug.log 86.54 MB
- cbcollect_info_ns_1@127.0.0.1_20130319-202548/memcached.log 0.4 kB
- cbcollect_info_ns_1@127.0.0.1_20130319-202548/ns_server.stats.log 44.10 MB
-
Hide
- debug2.zip
- 19/Mar/13 3:57 PM
- 2.51 MB
- Frojoe
-
- cbcollect_info_ns_1@127.0.0.1_20130319-204054/couchbase.log 813 kB
- cbcollect_info_ns_1@127.0.0.1_20130319-204054/ns_server.xdcr.log 17 kB
- cbcollect_info_ns_1@127.0.0.1_20130319-204054/ns_server.couchdb.log 279 kB
- cbcollect_info_ns_1@127.0.0.1_20130319-204054/stats.log 1.43 MB
- cbcollect_info_ns_1@127.0.0.1_20130319-204054/ini.log 0.3 kB
- cbcollect_info_ns_1@127.0.0.1_20130319-204054/ns_server.error.log 7.41 MB
- cbcollect_info_ns_1@127.0.0.1_20130319-204054/ns_server.views.log 845 kB
- cbcollect_info_ns_1@127.0.0.1_20130319-204054/ns_server.info.log 14.97 MB
- cbcollect_info_ns_1@127.0.0.1_20130319-204054/ns_server.xdcr_errors.log 0.2 kB
- cbcollect_info_ns_1@127.0.0.1_20130319-204054/ns_server.mapreduce_errors.log 0.2 kB
- cbcollect_info_ns_1@127.0.0.1_20130319-204054/syslog.tar.gz 0.0 kB
- cbcollect_info_ns_1@127.0.0.1_20130319-204054/ns_server.debug.log 27.51 MB
- cbcollect_info_ns_1@127.0.0.1_20130319-204054/memcached.log 0.4 kB
- cbcollect_info_ns_1@127.0.0.1_20130319-204054/ns_server.stats.log 537 kB
-
- memcached.log.8.txt
- 25/Mar/13 4:02 PM
- 584 kB
- Frojoe
-
- couchbase-log1.PNG
- 38 kB
- 19/Mar/13 3:57 PM
Activity
- All
- Comments
- Work Log
- History
- Activity
- Gerrit Reviews
Show
Mike Wiederhold
added a comment - Frojoe,
Any chance you can attach the memcached logs too?
Hide
Mike Wiederhold
added a comment -
There's no good information in the logs that points out what the error might be. I will add some better logging around disconnects in memcached.
Show
Mike Wiederhold
added a comment - There's no good information in the logs that points out what the error might be. I will add some better logging around disconnects in memcached.
Show
Maria McDuff
added a comment - per mike: candidate for deferral.
Hide
Chiyoung Seo
added a comment -
It seems to me that we're opening too many connections on the memcached admin port (11209) in a short time, which *might* result in failing to create a connection for building the replica vbucket 53. Garbage collection of the memcached connections are performed after the ep-engine cleans up the corresponding invalid tap producer or consumer resources through the background thread and notify it to the memcached layer.
In their cluster, cleaning up hundreds of invalid tap consumer resources through the background thread (i.e., NON-IO dispatcher thread) sometime takes up to 5 - 10 sec.
If the memcached reaches to the max connection limit, it prints out the INFO level log unfortunately, which is not included in the memcached log by default. This should be definitely set to a warning level.
For more accurate analysis, we need to set the memcached log level to "INFO" to see if the memcached reaches to the max connection limit in their scenario.
In their cluster, cleaning up hundreds of invalid tap consumer resources through the background thread (i.e., NON-IO dispatcher thread) sometime takes up to 5 - 10 sec.
If the memcached reaches to the max connection limit, it prints out the INFO level log unfortunately, which is not included in the memcached log by default. This should be definitely set to a warning level.
For more accurate analysis, we need to set the memcached log level to "INFO" to see if the memcached reaches to the max connection limit in their scenario.
Show
Chiyoung Seo
added a comment - It seems to me that we're opening too many connections on the memcached admin port (11209) in a short time, which *might* result in failing to create a connection for building the replica vbucket 53. Garbage collection of the memcached connections are performed after the ep-engine cleans up the corresponding invalid tap producer or consumer resources through the background thread and notify it to the memcached layer.
In their cluster, cleaning up hundreds of invalid tap consumer resources through the background thread (i.e., NON-IO dispatcher thread) sometime takes up to 5 - 10 sec.
If the memcached reaches to the max connection limit, it prints out the INFO level log unfortunately, which is not included in the memcached log by default. This should be definitely set to a warning level.
For more accurate analysis, we need to set the memcached log level to "INFO" to see if the memcached reaches to the max connection limit in their scenario.
Hide
Mike Wiederhold
added a comment -
Chiyoung,
It seems like it would make sense to change this log message message to warning level for 2.0.2. What do you think?
It seems like it would make sense to change this log message message to warning level for 2.0.2. What do you think?
Show
Mike Wiederhold
added a comment - Chiyoung,
It seems like it would make sense to change this log message message to warning level for 2.0.2. What do you think?
Show
Chiyoung Seo
added a comment - Yes, it should be changed for 2.0.2.
[error_logger:error,2013-03-19T16:24:00.686,ns_1@192.168.3.87:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server <13375.24034.0> terminating
** Last message in was {tcp_closed,#Port<13375.22320>}
** When Server state == {state,#Port<13375.22319>,#Port<13375.22317>,
#Port<13375.22320>,#Port<13375.22318>,
<13375.24035.0>,<<>>,<<>>,
{set,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],"5",[],[],[],[],[],[],
[],[]}}},
-1,false,false,0,
{1363,724640,44293},
not_started,undefined,
<<"replication_building_53_'ns_1@192.168.3.88'">>,
<13375.24034.0>,
{had_backfill,undefined,undefined,
[{<0.6160.1>,#Ref<0.0.14.227400>}]},
undefined,false}
** Reason for termination ==
** downstream_closed