[MB-6550] [longevity] Rebalance hang after failover and remove node because of the memory leak on a couple of nodes Created: 06/Sep/12 Updated: 09/Jan/13 Resolved: 07/Sep/12 |
|
| Status: | Closed |
| Project: | Couchbase Server |
| Component/s: | couchbase-bucket |
| Affects Version/s: | 2.0-beta |
| Fix Version/s: | 2.0-beta |
| Security Level: | Public |
| Type: | Bug | Priority: | Major |
| Reporter: | Thuan Nguyen | Assignee: | Chiyoung Seo |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | system-test | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: | centos 6.2 64bit | ||
| Attachments: |
|
| Description |
|
Cluster information:
- 11 centos 6.2 64bit server with 4 cores CPU - Each server has 10 GB RAM and 150 GB disk. - 8 GB RAM for couchbase server at each node (80% total system memmories) - Disk format ext3 on both data and root - Each server has its own drive, no disk sharing with other server. - Load 9 million items to both buckets - Cluster has 2 buckets, default (3GB) and saslbucket (3GB) - Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11) - Add one more doc d2 with 2 views to default bucket * Start cluster with 10 nodes installed couchbase server 2.0.0-1663 10.3.121.13 10.3.121.14 10.3.121.15 10.3.121.16 10.3.121.17 10.3.121.20 10.3.121.22 10.3.121.24 10.3.121.25 10.3.121.23 * Data path /data * View path /data * The last run, I do swap rebalance remove node 13 and add node 26. * Then node 26 failed due to physical failure. I failover node 26 and rebalance. * Rebalance failed with known issue * Node 22 down due to run out of disk space. Failover node 22. * Remove node 13. Start rebalance from 19:26:35 - Wed Sep 5, 2012 Bucket "default" rebalance does not seem to be swap rebalance ns_vbucket_mover000 ns_1@10.3.121.14 19:26:35 - Wed Sep 5, 2012 Rebalance hang until now Thu Sep 6 19:25:29 PDT 2012 CPU and beam stats 10.3.121.15 Vm: 2796m Rm: 613m CPU: 13.7 beam.smp Vm: 6091m Rm: 4.2g CPU: 9.8 memcached 10.3.121.13 Vm: 1845m Rm: 338m CPU: 9.9 beam.smp Vm: 1230m Rm: 1.0g CPU: 2.0 memcached 10.3.121.23 Vm: 2443m Rm: 652m CPU: 9.8 beam.smp Vm: 4969m Rm: 3.4g CPU: 7.9 memcached 10.3.121.24 Vm: 3304m Rm: 907m CPU: 19.4 beam.smp Vm: 5440m Rm: 4.0g CPU: 3.9 memcached 10.3.121.14 Vm: 3462m Rm: 665m CPU: 30.7 beam.smp Vm: 6329m Rm: 4.1g CPU: 5.1 memcached 10.3.121.16 Vm: 2702m Rm: 642m CPU: 13.2 beam.smp Vm: 4845m Rm: 3.5g CPU: 5.0 memcached 10.3.121.17 Vm: 4498m Rm: 1.4g CPU: 91.2 beam.smp Vm: 5359m Rm: 3.6g CPU: 1.7 memcached 10.3.121.20 Vm: 3793m Rm: 1.0g CPU: 11.7 beam.smp Vm: 5356m Rm: 3.7g CPU: 1.7 memcached Swap stats in MB Total Used Free 10.3.121.15 Swap: 5199 1815 3384 10.3.121.13 Swap: 5199 10 5189 10.3.121.22 Swap: 5199 15 5184 10.3.121.14 Swap: 5199 2503 2696 10.3.121.23 Swap: 5199 1037 4162 10.3.121.24 Swap: 5199 1543 3656 10.3.121.17 Swap: 5199 2156 3043 10.3.121.16 Swap: 5199 1156 4043 10.3.121.20 Swap: 5199 1949 3250 Link to diags of all nodes https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/9nodes-1663-reb-hang-20120906.tgz |
| Comments |
| Comment by Chiyoung Seo [ 07/Sep/12 ] |
|
The memory usage on 10.3.121.14 and 10.3.121.15 is above 90% of their bucket quota even after most of active and replica items were ejected. This is the reason why rebalance got stuck:
Chiyoung-MacBook:ep-engine chiyoung$ ./management/cbstats 10.3.121.14:11210 raw memory ep_kv_size: 2436606624 ep_max_data_size: 3145728000 ep_mem_high_wat: 2359296000 ep_mem_low_wat: 1887436800 ep_mem_tracker_enabled: true ep_oom_errors: 0 ep_overhead: 221345920 ep_tmp_oom_errors: 0 ep_value_size: 2214922031 mem_used: 2831961568 tcmalloc_current_thread_cache_bytes: 2281472 tcmalloc_max_thread_cache_bytes: 4194304 tcmalloc_unmapped_bytes: 7356416 total_allocated_bytes: 5440249488 total_fragmentation_bytes: 919716208 total_free_bytes: 2457600 total_heap_bytes: 6362423296 Chiyoung-MacBook:ep-engine chiyoung$ ./management/cbstats 10.3.121.14:11210 all | grep resident ep_num_non_resident: 2427780 vb_active_num_non_resident: 1005950 vb_active_perc_mem_resident: 0 vb_pending_num_non_resident: 0 vb_pending_perc_mem_resident: 0 vb_replica_num_non_resident: 1421830 vb_replica_perc_mem_resident: 0 It seems to me that there is a serious memory leak on 14 and 15. Especially, ep_value_size (2214922031) means that most of Blob value instances are freed even after we ejected them. Those blob values are referenced in many places (hash table, flusher, tap replicator, etc.) |
| Comment by Chiyoung Seo [ 07/Sep/12 ] |
| http://review.couchbase.org/#/c/20632/ |
| Comment by Thuan Nguyen [ 08/Sep/12 ] |
|
Integrated in github-ep-engine-2-0 #426 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/426/]) Result = SUCCESS Chiyoung Seo : Files : * src/tapconnmap.cc |
| Comment by Farshid Ghods [ 12/Sep/12 ] |
| is this a system test blocker ? if so please add sblocker label |
| Comment by Karen Zeller [ 17/Sep/12 ] |
|
Beta RN: Fixed rebalance failure. Rebalanced had stalled
after performing failover and removing node due to memory leak on cluster nodes. |