[MB-7140] [system test] tcmalloc segfault Created: 09/Nov/12  Updated: 11/Apr/13  Resolved: 18/Jan/13

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.0
Fix Version/s: 2.1.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Thuan Nguyen
Resolution: Cannot Reproduce Votes: 0
Labels: system-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 6.2 64bit build 2.0.0-1945


 Description   
Cluster: 7 nodes
10.6.2.37
10.6.2.38
10.6.2.39
10.6.2.40
10.6.2.42
10.6.2.43
10.6.2.44

Node will be added:
10.6.2.45

Build # 2.0.0-1945
Environment: 8 nodes with 390GB SSD drive, 32GB RAM
Bucket: 1 default bucket (1 replica), disable replica index.
Number of clients: 1

Load 40 million items to default bucket that push resident ratio down to around 62%
Maintain load about 600 ops and 600 queries per second
Create 1 design doc with 8 views. Let the initial index completed.
Then add node 45 to cluster and rebalance.
Rebalance was done at 01:35:40 - Friday Nov 9, 2012.

Then at 07:19:06 - Fri Nov 9, 2012, control connection to memcached on node 42 was disconnected.
Generate memcached backtrace on node 42, got error

Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Core was generated by `/opt/couchbase/bin/memcached -X /opt/couchbase/lib/memcached/stdin_term_handler'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fb8600f3d03 in tcmalloc::CentralFreeList::FetchFromSpans() ()
from /opt/couchbase/lib/libtcmalloc_minimal.so.4
Missing separate debuginfos, use: debuginfo-install couchbase-server-2.0.0-1945.x86_64

This bug looks similar to bug MB-5179

Link to manifest file http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1945-rel.rpm.manifest.xml

Link to memcached stack trace https://friendpaste.com/mhLCkxtmbXmmO0t3ZtQkV

Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201211/8nodes-ci-1945-tcmalloc-segfault-20121109-120057.tgz


 Comments   
Comment by Chiyoung Seo [ 09/Nov/12 ]
I don't see anything suspicious in ep-engine. At this time, I move this to 2.0.1.

I was keeping track of the tcmalloc bug reports and some users reported some crashes recently as well. I will create a separate a bug for patching the tcmalloc latest version in 2.0.1
Comment by Chiyoung Seo [ 21/Nov/12 ]
Move this to 2.0.2 as it happens very rarely. As mentioned above, we need to patch the latest tcmalloc in 2.0.1 as it has fixes to some crash issues.
Comment by Chiyoung Seo [ 18/Jan/13 ]
I was not able to reproduce this issue, but we opened the bug to patch the latest tcmalloc.
Comment by Maria McDuff (Inactive) [ 05/Apr/13 ]
Hi Tony, chk if current 2.0.2 build is segfaulting, if not, pls close. thanks.
Comment by Thuan Nguyen [ 11/Apr/13 ]
Re-test in 4 nodes centos 5.8 64bit with build 2.0.2-760. I don't see any memcached crashed with segfault. I will close this bug.
Generated at Tue Jul 22 23:10:31 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.