[MB-4394] memcached crash while rebalancing 15 nodes with 30M items (FATAL: Object returned from mccouch with CAS == 0) Created: 31/Oct/11 Updated: 10/Jan/13 Resolved: 07/Nov/11 |
|
| Status: | Closed |
| Project: | Couchbase Server |
| Component/s: | bucket-engine |
| Affects Version/s: | 2.0 |
| Fix Version/s: | 2.0 |
| Security Level: | Public |
| Type: | Bug | Priority: | Critical |
| Reporter: | Thuan Nguyen | Assignee: | Dustin Sallings |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: | centos 5.4 64 bit on ecs | ||
| Attachments: |
|
| Description |
|
create a cluster of 10 node couchbase server 2.0.0r-177
Load 30 million items to cluster so that reach about 85% resident. Keep the load runing and add 5 more nodes 2.0.0r-177 Rebalance cluster. Failed. |
| Comments |
| Comment by Farshid Ghods [ 31/Oct/11 ] |
|
Port server memcached on node 'ns_1@10.34.149.233' exited with status 134. Restarting. Messages: Preloaded 5898290 keys (with metadata)
tcmalloc: large alloc 4294938624 bytes == 0x49c5e000 @ FATAL: Object returned from mccouch with CAS == 0 |
| Comment by Dustin Sallings [ 01/Nov/11 ] |
|
Found myself commenting on a bug in email. Need to be careful about that.
The error is misleading. It appears to be one of these two things: static bool decodeMeta(const uint8_t *dta, uint32_t &seqno, uint64_t &cas, uint32_t &length, uint32_t &flags) { if (*dta != 0x01) { // Unsupported meta tag return false; } ++dta; if (*dta != 20) { // Unsupported size return false; } Considering the prior allocation was 4GB, I'm guessing that something read something incorrectly and we're just off by this point. Any chance this is a small database I can play with? |
| Comment by Dustin Sallings [ 01/Nov/11 ] |
|
I didn't mean to assign this to myself, but I'm going to pass a baton briefly to Farshid for some reproduction data. I'd *really* like an attachment shard that does this so I can try to do the same thing in isolation. |
| Comment by Dustin Sallings [ 01/Nov/11 ] |
|
Although this is also very interesting from one of the attached stacks. How many different things are going wrong here?
Thread 1 (Thread 0x7f8e0c97e700 (LWP 22692)): #0 0x0000000000000000 in ?? () #1 0x00007f8e0d294b1c in Task::maxExpectedDuration (this=0x4b2c3000) at dispatcher.hh:152 #2 0x00007f8e0d2945a2 in Dispatcher::run (this=0xf5b9000) at dispatcher.cc:136 #3 0x00007f8e0d29479c in launch_dispatcher_thread (arg=0xf5b9000) at dispatcher.cc:28 #4 0x00007f8e11afc7e1 in start_thread () from /lib64/libpthread.so.0 #5 0x00007f8e11863ead in clone () from /lib64/libc.so.6 |
| Comment by Farshid Ghods [ 01/Nov/11 ] |
|
can't access those vms anymore. they seem to be terminated by now. Tony, can you please provide the ratio of set/get/delete/expire which you run the mixloader with ? and possibly copy-paste that part of the mix-loader which loops over all the keys and run those memcached commands |
| Comment by Thuan Nguyen [ 07/Nov/11 ] |
|
I just got a crash again today and have 3 new core attached in here.
I run 15 python threads to load 30 millions items. The script does all set. python scripts/mixload-allset.py -i manual2.0.ini -p prefix=key_01,size=655,count=2000000 & Command to run memcachetest. ./memcachetest -h 184.72.85.127:11211 -i 100000 -c 50000 -m 128 -t 2 -l After finish loading 30 million items, I do 70% set,get, 30% delete and 30% set again. counter_10 = 0 all_set = False while i < count: try: key = "{0}-{1}".format(prefix, i) if counter_10 >= 7: if all_set == True: mc.delete(key) mc.set(key, 0, 0, payload) if counter_10 == 10: counter_10 = 0 else: mc.set(key, 0, 0, payload) mc.get(key) counter_10 += 1 i += 1 if i == int(count): all_set = True i = 0 |
| Comment by Farshid Ghods [ 07/Nov/11 ] |
|
looked at the more recent core logs
its a dupe of http://www.couchbase.org/issues/browse/MB-4412 |
| Comment by Farshid Ghods [ 07/Nov/11 ] |
| http://www.couchbase.org/issues/browse/MB-4412 |