[MB-4394] memcached crash while rebalancing 15 nodes with 30M items (FATAL: Object returned from mccouch with CAS == 0) Created: 31/Oct/11  Updated: 18/Jun/13  Resolved: 07/Nov/11

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.0
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Critical
Reporter: Thuan Nguyen Assignee: Dustin Sallings
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 5.4 64 bit on ecs

Attachments: Text File core.memcached.1420.log     Text File core.memcached.21045.log     Text File core.memcached.2406.log     Text File core.memcached.2408.log     Text File core.memcached.2422.log    

create a cluster of 10 node couchbase server 2.0.0r-177
Load 30 million items to cluster so that reach about 85% resident.
Keep the load runing and add 5 more nodes 2.0.0r-177
Rebalance cluster. Failed.

Comment by Farshid Ghods [ 31/Oct/11 ]
Port server memcached on node 'ns_1@' exited with status 134. Restarting. Messages: Preloaded 5898290 keys (with metadata)
tcmalloc: large alloc 4294938624 bytes == 0x49c5e000 @
FATAL: Object returned from mccouch with CAS == 0
Comment by Dustin Sallings [ 01/Nov/11 ]
Found myself commenting on a bug in email. Need to be careful about that.

The error is misleading. It appears to be one of these two things:

    static bool decodeMeta(const uint8_t *dta, uint32_t &seqno, uint64_t &cas,
                           uint32_t &length, uint32_t &flags) {
        if (*dta != 0x01) {
            // Unsupported meta tag
            return false;
        if (*dta != 20) {
            // Unsupported size
            return false;

Considering the prior allocation was 4GB, I'm guessing that something read something incorrectly and we're just off by this point.

Any chance this is a small database I can play with?
Comment by Dustin Sallings [ 01/Nov/11 ]
I didn't mean to assign this to myself, but I'm going to pass a baton briefly to Farshid for some reproduction data.

I'd *really* like an attachment shard that does this so I can try to do the same thing in isolation.
Comment by Dustin Sallings [ 01/Nov/11 ]
Although this is also very interesting from one of the attached stacks. How many different things are going wrong here?

Thread 1 (Thread 0x7f8e0c97e700 (LWP 22692)):
#0 0x0000000000000000 in ?? ()
#1 0x00007f8e0d294b1c in Task::maxExpectedDuration (this=0x4b2c3000) at dispatcher.hh:152
#2 0x00007f8e0d2945a2 in Dispatcher::run (this=0xf5b9000) at dispatcher.cc:136
#3 0x00007f8e0d29479c in launch_dispatcher_thread (arg=0xf5b9000) at dispatcher.cc:28
#4 0x00007f8e11afc7e1 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f8e11863ead in clone () from /lib64/libc.so.6
Comment by Farshid Ghods [ 01/Nov/11 ]
can't access those vms anymore. they seem to be terminated by now.


can you please provide the ratio of set/get/delete/expire which you run the mixloader with ? and possibly copy-paste that part of the mix-loader which loops over all the keys and run those memcached commands
Comment by Thuan Nguyen [ 07/Nov/11 ]
I just got a crash again today and have 3 new core attached in here.
I run 15 python threads to load 30 millions items. The script does all set.
python scripts/mixload-allset.py -i manual2.0.ini -p prefix=key_01,size=655,count=2000000 &
Command to run memcachetest.
./memcachetest -h -i 100000 -c 50000 -m 128 -t 2 -l 
After finish loading 30 million items, I do 70% set,get, 30% delete and 30% set again.
counter_10 = 0
            all_set = False
            while i < count:
                    key = "{0}-{1}".format(prefix, i)
                    if counter_10 >= 7:
                        if all_set == True:
                        mc.set(key, 0, 0, payload)
                        if counter_10 == 10:
                            counter_10 = 0
                        mc.set(key, 0, 0, payload)
                    counter_10 += 1
                    i += 1
                    if i == int(count):
                        all_set = True
                        i = 0
Comment by Farshid Ghods [ 07/Nov/11 ]
looked at the more recent core logs
its a dupe of http://www.couchbase.org/issues/browse/MB-4412
Comment by Farshid Ghods [ 07/Nov/11 ]
Generated at Thu Apr 17 12:00:44 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.