Details
-
Type:
Bug
-
Status:
Closed
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 2.0-beta
-
Fix Version/s: 2.0-beta
-
Component/s: couchbase-bucket
-
Security Level: Public
-
Labels:
-
Environment:centos 6.2 64bit
Description
Cluster information:
- 11 centos 6.2 64bit server with 4 cores CPU
- Each server has 10 GB RAM and 150 GB disk.
- 8 GB RAM for couchbase server at each node (80% total system memmories)
- Disk format ext3 on both data and root
- Each server has its own drive, no disk sharing with other server.
- Load 9 million items to both buckets
- Cluster has 2 buckets, default (3GB) and saslbucket (3GB)
- Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
- Add one more doc d2 with 2 views to default bucket
* Start cluster with 10 nodes installed couchbase server 2.0.0-1663
10.3.121.13
10.3.121.14
10.3.121.15
10.3.121.16
10.3.121.17
10.3.121.20
10.3.121.22
10.3.121.24
10.3.121.25
10.3.121.23
* Data path /data
* View path /data
* The last run, I do swap rebalance remove node 13 and add node 26.
* Then node 26 failed due to physical failure. I failover node 26 and rebalance.
* Rebalance failed with known issueMB-6497 at the end of rebalance saslbucket
* Node 22 down due to run out of disk space. Failover node 22.
* Remove node 13. Start rebalance from 19:26:35 - Wed Sep 5, 2012
Bucket "default" rebalance does not seem to be swap rebalance ns_vbucket_mover000 ns_1@10.3.121.14 19:26:35 - Wed Sep 5, 2012
Rebalance hang until now Thu Sep 6 19:25:29 PDT 2012
CPU and beam stats
10.3.121.15
Vm: 2796m Rm: 613m CPU: 13.7 beam.smp
Vm: 6091m Rm: 4.2g CPU: 9.8 memcached
10.3.121.13
Vm: 1845m Rm: 338m CPU: 9.9 beam.smp
Vm: 1230m Rm: 1.0g CPU: 2.0 memcached
10.3.121.23
Vm: 2443m Rm: 652m CPU: 9.8 beam.smp
Vm: 4969m Rm: 3.4g CPU: 7.9 memcached
10.3.121.24
Vm: 3304m Rm: 907m CPU: 19.4 beam.smp
Vm: 5440m Rm: 4.0g CPU: 3.9 memcached
10.3.121.14
Vm: 3462m Rm: 665m CPU: 30.7 beam.smp
Vm: 6329m Rm: 4.1g CPU: 5.1 memcached
10.3.121.16
Vm: 2702m Rm: 642m CPU: 13.2 beam.smp
Vm: 4845m Rm: 3.5g CPU: 5.0 memcached
10.3.121.17
Vm: 4498m Rm: 1.4g CPU: 91.2 beam.smp
Vm: 5359m Rm: 3.6g CPU: 1.7 memcached
10.3.121.20
Vm: 3793m Rm: 1.0g CPU: 11.7 beam.smp
Vm: 5356m Rm: 3.7g CPU: 1.7 memcached
Swap stats in MB
Total Used Free
10.3.121.15
Swap: 5199 1815 3384
10.3.121.13
Swap: 5199 10 5189
10.3.121.22
Swap: 5199 15 5184
10.3.121.14
Swap: 5199 2503 2696
10.3.121.23
Swap: 5199 1037 4162
10.3.121.24
Swap: 5199 1543 3656
10.3.121.17
Swap: 5199 2156 3043
10.3.121.16
Swap: 5199 1156 4043
10.3.121.20
Swap: 5199 1949 3250
Link to diags of all nodes
https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/9nodes-1663-reb-hang-20120906.tgz
- 11 centos 6.2 64bit server with 4 cores CPU
- Each server has 10 GB RAM and 150 GB disk.
- 8 GB RAM for couchbase server at each node (80% total system memmories)
- Disk format ext3 on both data and root
- Each server has its own drive, no disk sharing with other server.
- Load 9 million items to both buckets
- Cluster has 2 buckets, default (3GB) and saslbucket (3GB)
- Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
- Add one more doc d2 with 2 views to default bucket
* Start cluster with 10 nodes installed couchbase server 2.0.0-1663
10.3.121.13
10.3.121.14
10.3.121.15
10.3.121.16
10.3.121.17
10.3.121.20
10.3.121.22
10.3.121.24
10.3.121.25
10.3.121.23
* Data path /data
* View path /data
* The last run, I do swap rebalance remove node 13 and add node 26.
* Then node 26 failed due to physical failure. I failover node 26 and rebalance.
* Rebalance failed with known issue
* Node 22 down due to run out of disk space. Failover node 22.
* Remove node 13. Start rebalance from 19:26:35 - Wed Sep 5, 2012
Bucket "default" rebalance does not seem to be swap rebalance ns_vbucket_mover000 ns_1@10.3.121.14 19:26:35 - Wed Sep 5, 2012
Rebalance hang until now Thu Sep 6 19:25:29 PDT 2012
CPU and beam stats
10.3.121.15
Vm: 2796m Rm: 613m CPU: 13.7 beam.smp
Vm: 6091m Rm: 4.2g CPU: 9.8 memcached
10.3.121.13
Vm: 1845m Rm: 338m CPU: 9.9 beam.smp
Vm: 1230m Rm: 1.0g CPU: 2.0 memcached
10.3.121.23
Vm: 2443m Rm: 652m CPU: 9.8 beam.smp
Vm: 4969m Rm: 3.4g CPU: 7.9 memcached
10.3.121.24
Vm: 3304m Rm: 907m CPU: 19.4 beam.smp
Vm: 5440m Rm: 4.0g CPU: 3.9 memcached
10.3.121.14
Vm: 3462m Rm: 665m CPU: 30.7 beam.smp
Vm: 6329m Rm: 4.1g CPU: 5.1 memcached
10.3.121.16
Vm: 2702m Rm: 642m CPU: 13.2 beam.smp
Vm: 4845m Rm: 3.5g CPU: 5.0 memcached
10.3.121.17
Vm: 4498m Rm: 1.4g CPU: 91.2 beam.smp
Vm: 5359m Rm: 3.6g CPU: 1.7 memcached
10.3.121.20
Vm: 3793m Rm: 1.0g CPU: 11.7 beam.smp
Vm: 5356m Rm: 3.7g CPU: 1.7 memcached
Swap stats in MB
Total Used Free
10.3.121.15
Swap: 5199 1815 3384
10.3.121.13
Swap: 5199 10 5189
10.3.121.22
Swap: 5199 15 5184
10.3.121.14
Swap: 5199 2503 2696
10.3.121.23
Swap: 5199 1037 4162
10.3.121.24
Swap: 5199 1543 3656
10.3.121.17
Swap: 5199 2156 3043
10.3.121.16
Swap: 5199 1156 4043
10.3.121.20
Swap: 5199 1949 3250
Link to diags of all nodes
https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/9nodes-1663-reb-hang-20120906.tgz
Chiyoung-MacBook:ep-engine chiyoung$ ./management/cbstats 10.3.121.14:11210 raw memory
ep_kv_size: 2436606624
ep_max_data_size: 3145728000
ep_mem_high_wat: 2359296000
ep_mem_low_wat: 1887436800
ep_mem_tracker_enabled: true
ep_oom_errors: 0
ep_overhead: 221345920
ep_tmp_oom_errors: 0
ep_value_size: 2214922031
mem_used: 2831961568
tcmalloc_current_thread_cache_bytes: 2281472
tcmalloc_max_thread_cache_bytes: 4194304
tcmalloc_unmapped_bytes: 7356416
total_allocated_bytes: 5440249488
total_fragmentation_bytes: 919716208
total_free_bytes: 2457600
total_heap_bytes: 6362423296
Chiyoung-MacBook:ep-engine chiyoung$ ./management/cbstats 10.3.121.14:11210 all | grep resident
ep_num_non_resident: 2427780
vb_active_num_non_resident: 1005950
vb_active_perc_mem_resident: 0
vb_pending_num_non_resident: 0
vb_pending_perc_mem_resident: 0
vb_replica_num_non_resident: 1421830
vb_replica_perc_mem_resident: 0
It seems to me that there is a serious memory leak on 14 and 15. Especially, ep_value_size (2214922031) means that most of Blob value instances are freed even after we ejected them. Those blob values are referenced in many places (hash table, flusher, tap replicator, etc.)