[MB-6595] [RN 2.0.1]][longevity] something unknown is causing severe timeouts in ns_server. Particularly under views building and/or compaction. Which causes rebalance to fail and other types of badness. Created: 10/Sep/12  Updated: 13/Feb/13  Resolved: 13/Feb/13

Status: Closed
Project: Couchbase Server
Component/s: documentation
Affects Version/s: 2.0-beta
Fix Version/s: 2.0.1
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: kzeller
Resolution: Fixed Votes: 0
Labels: 2.0-beta-release-notes, 2.0.1-release-notes, system-test
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified
Environment: centos 6.2 64bit build 2.0.0-1697

Attachments: GZip Archive erl_healthy-node24-crash.dump.gz     GZip Archive erl-over-1gb-node14_crash.dump.gz     File ns-diag-20121031094231.txt.xz     PDF File report_atop_10.6.2.37_default_simple-view_test (6).pdf    
Issue Links:
Dependency
blocks MB-7234 Rebalance exited with reason bad_repl... Closed
blocks MB-7261 [system test] Rebalance stopped by ja... Resolved
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MB-7120 rebalance-in fails on ec2: IP address... Technical task Closed Aleksey Kondratenko  
MB-7111 [system test] rebalance failed with e... Technical task Resolved Ketaki Gangal  
MB-7166 [system test] Resetting rebalance sta... Technical task Resolved Aleksey Kondratenko  
MB-7231 bad map function may cause timeouts i... Technical task Resolved Aleksey Kondratenko  
MB-7212 Rebalance stopped by janitor when xdc... Technical task Closed Andrei Baranouski  
MB-7225 [system test] rebalance failed with e... Technical task Resolved Jin Lim  
MB-7261 [system test] Rebalance stopped by ja... Technical task Resolved Aleksey Kondratenko  
Flagged:
Release Note

 Description   
Cluster information:
- 11 centos 6.2 64bit server with 4 cores CPU
- Each server has 10 GB RAM and 150 GB disk.
- 8 GB RAM for couchbase server at each node (80% total system memmories)
- Disk format ext3 on both data and root
- Each server has its own drive, no disk sharing with other server.
- Load 9 million items to both buckets
- Initial indexing, so cpu a little heavy load
- Cluster has 2 buckets, default (3GB) and saslbucket (3GB)
- Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)

* Create cluster with 10 nodes installed couchbase server 2.0.0-1697

10.3.121.13
10.3.121.14
10.3.121.15
10.3.121.16
10.3.121.17
10.3.121.20
10.3.121.22
10.3.121.24
10.3.121.25
10.3.121.23
* Data path /data
* View path /data

* Do swap rebalance. Add node 26 and remove node 25
* Rebalance failed as in bug MB-6573
* Then do rebalance again. Rebalance failed again with error in log page point to node 14

Rebalance exited with reason {bulk_set_vbucket_state_failed,
[{'ns_1@10.3.121.13',
{'EXIT',
{{{timeout,
{gen_server,call,
[ns_config,
{eval,
#Fun<cluster_compat_mode.0.45438860>}]}},
{gen_server,call,
['tap_replication_manager-default',
{change_vbucket_replication,726,
undefined},
infinity]}},
{gen_server,call,
[{'janitor_agent-default',
'ns_1@10.3.121.13'},
{if_rebalance,<0.16591.2>,
{update_vbucket_state,726,replica,
undefined,undefined}},
infinity]}}}}]}
ns_orchestrator002 ns_1@10.3.121.14 01:18:02 - Sat Sep 8, 2012

<0.19004.2> exited with {bulk_set_vbucket_state_failed,
[{'ns_1@10.3.121.13',
{'EXIT',
{{{timeout,
{gen_server,call,
[ns_config,
{eval,
#Fun<cluster_compat_mode.0.45438860>}]}},
{gen_server,call,
['tap_replication_manager-default',
{change_vbucket_replication,726,undefined},
infinity]}},
{gen_server,call,
[{'janitor_agent-default','ns_1@10.3.121.13'},
{if_rebalance,<0.16591.2>,
{update_vbucket_state,726,replica,undefined,
undefined}},
infinity]}}}}]} ns_vbucket_mover000 ns_1@10.3.121.14 01:18:01 - Sat Sep 8, 2012
Server error during processing: ["web request failed",


* Go to node 14, I see many tap_replication_manager-default crash right before rebalane failed at 01:18:01 - Sat Sep 8, 2012

[error_logger:error,2012-09-08T1:18:01.836,ns_1@10.3.121.14:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_single_vbucket_mover:mover/6
    pid: <0.19330.2>
    registered_name: []
    exception exit: {exited,
                     {'EXIT',<0.16591.2>,
                      {bulk_set_vbucket_state_failed,
                       [{'ns_1@10.3.121.13',
                         {'EXIT',
                          {{{timeout,
                             {gen_server,call,
                              [ns_config,
                               {eval,#Fun<cluster_compat_mode.0.45438860>}]}},
                            {gen_server,call,
                             ['tap_replication_manager-default',
                              {change_vbucket_replication,726,undefined},
                              infinity]}},
                           {gen_server,call,
                            [{'janitor_agent-default','ns_1@10.3.121.13'},
                             {if_rebalance,<0.16591.2>,
                              {update_vbucket_state,726,replica,undefined,
                               undefined}},
                             infinity]}}}}]}}}
      in function ns_single_vbucket_mover:mover_inner_old_style/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
    ancestors: [<0.16591.2>,<0.22331.1>]
    messages: [{'EXIT',<0.16591.2>,
                   {bulk_set_vbucket_state_failed,
                    [{'ns_1@10.3.121.13',
                      {'EXIT',
                       {{{timeout,
                          {gen_server,call,
                           [ns_config,
                            {eval,#Fun<cluster_compat_mode.0.45438860>}]}},
                         {gen_server,call,
                          ['tap_replication_manager-default',
                           {change_vbucket_replication,726,undefined},
                           infinity]}},
                        {gen_server,call,
                         [{'janitor_agent-default','ns_1@10.3.121.13'},
                          {if_rebalance,<0.16591.2>,
                           {update_vbucket_state,726,replica,undefined,
                            undefined}},
                          infinity]}}}}]}}]
    links: [<0.16591.2>]
    dictionary: [{cleanup_list,[<0.19392.2>]}]
    trap_exit: true
    status: running
    heap_size: 4181
    stack_size: 24
    reductions: 4550
  neighbours:

[ns_server:info,2012-09-08T1:18:01.835,ns_1@10.3.121.14:<0.19487.2>:ns_replicas_builder:build_replicas_main:94]Got exit not from child ebucketmigrator. Assuming it's our parent: {'EXIT',
                                                                    <0.19393.2>,
                                                                    shutdown}
[ns_server:info,2012-09-08T1:18:01.880,ns_1@10.3.121.14:ns_config_rep:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.3.121.13'

[ns_server:info,2012-09-08T1:18:01.885,ns_1@10.3.121.14:<0.19328.2>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.3.121.22': [<<"replication_building_704_'ns_1@10.3.121.26'">>,
                                                       <<"replication_building_704_'ns_1@10.3.121.24'">>,
                                                       <<"replication_building_704_'ns_1@10.3.121.23'">>]
[error_logger:error,2012-09-08T1:18:01.895,ns_1@10.3.121.14:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_single_vbucket_mover:mover/6
    pid: <0.19276.2>
    registered_name: []
    exception exit: {exited,
                     {'EXIT',<0.16591.2>,
                      {bulk_set_vbucket_state_failed,
                       [{'ns_1@10.3.121.13',
                         {'EXIT',
                          {{{timeout,
                             {gen_server,call,
                              [ns_config,
                               {eval,#Fun<cluster_compat_mode.0.45438860>}]}},
                            {gen_server,call,
                             ['tap_replication_manager-default',
                              {change_vbucket_replication,726,undefined},
                              infinity]}},
                           {gen_server,call,
                            [{'janitor_agent-default','ns_1@10.3.121.13'},
                             {if_rebalance,<0.16591.2>,
                              {update_vbucket_state,726,replica,undefined,
                               undefined}},
                             infinity]}}}}]}}}
      in function ns_single_vbucket_mover:mover_inner_old_style/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
    ancestors: [<0.16591.2>,<0.22331.1>]
    messages: [{'EXIT',<0.16591.2>,
                   {bulk_set_vbucket_state_failed,
                    [{'ns_1@10.3.121.13',
                      {'EXIT',
                       {{{timeout,
                          {gen_server,call,
                           [ns_config,
                            {eval,#Fun<cluster_compat_mode.0.45438860>}]}},
                         {gen_server,call,
                          ['tap_replication_manager-default',
                           {change_vbucket_replication,726,undefined},
                           infinity]}},
                        {gen_server,call,
                         [{'janitor_agent-default','ns_1@10.3.121.13'},
                          {if_rebalance,<0.16591.2>,
                           {update_vbucket_state,726,replica,undefined,
                            undefined}},
                          infinity]}}}}]}}]
    links: [<0.16591.2>]
    dictionary: [{cleanup_list,[<0.19328.2>]}]
    trap_exit: true
    status: running
    heap_size: 4181
    stack_size: 24
    reductions: 4434
  neighbours:

[ns_server:info,2012-09-08T1:18:01.930,ns_1@10.3.121.14:<0.19487.2>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.3.121.16': [<<"replication_building_399_'ns_1@10.3.121.26'">>,
                                                       <<"replication_building_399_'ns_1@10.3.121.24'">>,
                                                       <<"replication_building_399_'ns_1@10.3.121.14'">>]
[error_logger:error,2012-09-08T1:18:01.937,ns_1@10.3.121.14:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_single_vbucket_mover:mover/6
    pid: <0.19393.2>
    registered_name: []
    exception exit: {exited,
                     {'EXIT',<0.16591.2>,
                      {bulk_set_vbucket_state_failed,
                       [{'ns_1@10.3.121.13',
                         {'EXIT',
                          {{{timeout,
                             {gen_server,call,
                              [ns_config,
                               {eval,#Fun<cluster_compat_mode.0.45438860>}]}},
                            {gen_server,call,
                             ['tap_replication_manager-default',
                              {change_vbucket_replication,726,undefined},
                              infinity]}},
                           {gen_server,call,
                            [{'janitor_agent-default','ns_1@10.3.121.13'},
                             {if_rebalance,<0.16591.2>,
                              {update_vbucket_state,726,replica,undefined,
                               undefined}},
                             infinity]}}}}]}}}
      in function ns_single_vbucket_mover:mover_inner_old_style/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
    ancestors: [<0.16591.2>,<0.22331.1>]
    messages: [{'EXIT',<0.16591.2>,
                   {bulk_set_vbucket_state_failed,
                    [{'ns_1@10.3.121.13',
                      {'EXIT',
                       {{{timeout,
                          {gen_server,call,
                           [ns_config,
                            {eval,#Fun<cluster_compat_mode.0.45438860>}]}},
                         {gen_server,call,
                          ['tap_replication_manager-default',
                           {change_vbucket_replication,726,undefined},
                           infinity]}},
                        {gen_server,call,
                         [{'janitor_agent-default','ns_1@10.3.121.13'},
                          {if_rebalance,<0.16591.2>,
                           {update_vbucket_state,726,replica,undefined,
                            undefined}},
                          infinity]}}}}]}}]
    links: [<0.16591.2>]
    dictionary: [{cleanup_list,[<0.19487.2>]}]
    trap_exit: true
    status: running
    heap_size: 4181
    stack_size: 24
    reductions: 4435
  neighbours:

[couchdb:info,2012-09-08T1:18:01.977,ns_1@10.3.121.14:<0.15832.2>:couch_log:info:39]10.3.121.22 - - POST /_view_merge/?stale=ok&limit=10 200
[ns_server:error,2012-09-08T1:18:02.072,ns_1@10.3.121.14:<0.5850.0>:ns_memcached:verify_report_long_call:274]call topkeys took too long: 836560 us
[rebalance:debug,2012-09-08T1:18:02.075,ns_1@10.3.121.14:<0.19493.2>:ns_single_vbucket_mover:mover_inner_old_style:195]child replicas builder for vbucket 138 is <0.19520.2>
[ns_server:info,2012-09-08T1:18:02.077,ns_1@10.3.121.14:<0.19493.2>:ns_single_vbucket_mover:mover_inner_old_style:199]Got exit message (parent is <0.16591.2>). Exiting...
{'EXIT',<0.16591.2>,
    {bulk_set_vbucket_state_failed,
        [{'ns_1@10.3.121.13',
             {'EXIT',
                 {{{timeout,
                       {gen_server,call,
                           [ns_config,
                            {eval,#Fun<cluster_compat_mode.0.45438860>}]}},
                   {gen_server,call,
                       ['tap_replication_manager-default',
                        {change_vbucket_replication,726,undefined},
                        infinity]}},
                  {gen_server,call,
                      [{'janitor_agent-default','ns_1@10.3.121.13'},
                       {if_rebalance,<0.16591.2>,
                           {update_vbucket_state,726,replica,undefined,
                               undefined}},
                       infinity]}}}}]}}
[ns_server:debug,2012-09-08T1:18:02.115,ns_1@10.3.121.14:<0.19520.2>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 138 into 'ns_1@10.3.121.26' is <20326.5386.1>
[ns_server:info,2012-09-08T1:18:02.125,ns_1@10.3.121.14:ns_port_memcached:ns_port_server:log:169]memcached<0.2005.0>: Sat Sep 8 08:18:01.920865 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.16 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0

[ns_server:debug,2012-09-08T1:18:02.142,ns_1@10.3.121.14:<0.3277.0>:mc_connection:do_delete_vbucket:118]Notifying mc_couch_events of vbucket deletion: default/137
[views:info,2012-09-08T1:18:02.146,ns_1@10.3.121.14:'capi_set_view_manager-default':capi_set_view_manager:apply_index_states:459]
Calling couch_set_view:set_partition_states([<<"default">>,<<"_design/d1">>,
                                             [103,104,105,106,107,108,109,110,
                                              111,112,113,114,115,116,117,118,
                                              119,120,121,122,123,124,125,126,
                                              127,128,129,130,131,132,138,139,
                                              140,141,142,143,144,145,146,147,
                                              148,149,150,151,152,153,154,155,
                                              156,157,158,159,160,161,162,163,
                                              164,165,166,167,168,169,170,171,
                                              172,173,174,175,176,177,178,179,
                                              180,181,182,183,184,185,186,187,
                                              188,189,190,191,192,193,194,195,
                                              196,197,198,199,200,201,202,203,
                                              204,205],
                                             [],
                                             [0,1,2,3,4,5,6,7,8,9,10,11,12,13,
                                              14,15,16,17,18,19,20,21,22,23,
                                              24,25,26,27,28,29,30,31,32,33,
                                              34,35,36,37,38,39,40,41,42,43,
                                              44,45,46,47,48,49,50,51,52,53,
                                              54,55,56,57,58,59,60,61,62,63,
                                              64,65,66,67,68,69,70,71,72,73,
                                              74,75,76,77,78,79,80,81,82,83,
                                              84,85,86,87,88,89,90,91,92,93,
                                              94,95,96,97,98,99,100,101,102,
                                              133,134,135,136,137,206,207,208,
                                              209,210,211,212,213,214,215,216,
                                              217,218,219,220,221,222,223,224,
                                              225,226,227,228,229,230,231,232,
                                              233,234,235,236,237,238,239,240,
                                              241,242,243,244,245,246,247,248,
                                              249,250,251,252,253,254,255,256,
                                              257,258,259,260,261,262,263,264,
                                              265,266,267,268,269,270,271,272,
                                              273,274,275,276,277,278,279,280,
                                              281,282,283,284,285,286,287,288,
                                              289,290,291,292,293,294,295,296,
                                              297,298,299,300,301,302,303,304,
                                              305,306,307,308,309,310,311,312,
                                              313,314,315,316,317,318,319,320,
                                              321,322,323,324,325,326,327,328,
                                              329,330,331,332,333,334,335,336,
                                              337,338,339,340,341,342,343,344,
                                              345,346,347,348,349,350,351,352,
                                              353,354,355,356,357,358,359,360,
                                              361,362,363,364,365,366,367,368,
                                              369,370,371,372,373,374,375,376,
                                              377,378,379,380,381,382,383,384,
                                              385,386,387,388,389,390,391,392,
                                              393,394,395,396,397,398,399,400,
                                              401,402,403,404,405,406,407,408,
                                              409,410,411,412,413,414,415,416,
                                              417,418,419,420,421,422,423,424,
                                              425,426,427,428,429,430,431,432,
                                              433,434,435,436,437,438,439,440,
                                              441,442,443,444,445,446,447,448,
                                              449,450,451,452,453,454,455,456,
                                              457,458,459,460,461,462,463,464,
                                              465,466,467,468,469,470,471,472,
                                              473,474,475,476,477,478,479,480,
                                              481,482,483,484,485,486,487,488,
                                              489,490,491,492,493,494,495,496,
                                              497,498,499,500,501,502,503,504,
                                              505,506,507,508,509,510,511,512,
                                              513,514,515,516,517,518,519,520,
                                              521,522,523,524,525,526,527,528,
                                              529,530,531,532,533,534,535,536,
                                              537,538,539,540,541,542,543,544,
                                              545,546,547,548,549,550,551,552,
                                              553,554,555,556,557,558,559,560,
                                              561,562,563,564,565,566,567,568,
                                              569,570,571,572,573,574,575,576,
                                              577,578,579,580,581,582,583,584,
                                              585,586,587,588,589,590,591,592,
                                              593,594,595,596,597,598,599,600,
                                              601,602,603,604,605,606,607,608,
                                              609,610,611,612,613,614,615,616,
                                              617,618,619,620,621,622,623,624,
                                              625,626,627,628,629,630,631,632,
                                              633,634,635,636,637,638,639,640,
                                              641,642,643,644,645,646,647,648,
                                              649,650,651,652,653,654,655,656,
                                              657,658,659,660,661,662,663,664,
                                              665,666,667,668,669,670,671,672,
                                              673,674,675,676,677,678,679,680,
                                              681,682,683,684,685,686,687,688,
                                              689,690,691,692,693,694,695,696,
                                              697,698,699,700,701,702,703,704,
                                              705,706,707,708,709,710,711,712,
                                              713,714,715,716,717,718,719,720,
                                              721,722,723,724,725,726,727,728,
                                              729,730,731,732,733,734,735,736,
                                              737,738,739,740,741,742,743,744,
                                              745,746,747,748,749,750,751,752,
                                              753,754,755,756,757,758,759,760,
                                              761,762,763,764,765,766,767,768,
                                              769,770,771,772,773,774,775,776,
                                              777,778,779,780,781,782,783,784,
                                              785,786,787,788,789,790,791,792,
                                              793,794,795,796,797,798,799,800,
                                              801,802,803,804,805,806,807,808,
                                              809,810,811,812,813,814,815,816,
                                              817,818,819,820,821,822,823,824,
                                              825,826,827,828,829,830,831,832,
                                              833,834,835,836,837,838,839,840,
                                              841,842,843,844,845,846,847,848,
                                              849,850,851,852,853,854,855,856,
                                              857,858,859,860,861,862,863,864,
                                              865,866,867,868,869,870,871,872,
                                              873,874,875,876,877,878,879,880,
                                              881,882,883,884,885,886,887,888,
                                              889,890,891,892,893,894,895,896,
                                              897,898,899,900,901,902,903,904,
                                              905,906,907,908,909,910,911,912,
                                              913,914,915,916,917,918,919,920,
                                              921,922,923,924,925,926,927,928,
                                              929,930,931,932,933,934,935,936,
                                              937,938,939,940,941,942,943,944,
                                              945,946,947,948,949,950,951,952,
                                              953,954,955,956,957,958,959,960,
                                              961,962,963,964,965,966,967,968,
                                              969,970,971,972,973,974,975,976,
                                              977,978,979,980,981,982,983,984,
                                              985,986,987,988,989,990,991,992,
                                              993,994,995,996,997,998,999,
                                              1000,1001,1002,1003,1004,1005,
                                              1006,1007,1008,1009,1010,1011,
                                              1012,1013,1014,1015,1016,1017,
                                              1018,1019,1020,1021,1022,1023]])
[ns_server:debug,2012-09-08T1:18:02.161,ns_1@10.3.121.14:<0.19520.2>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 138 into 'ns_1@10.3.121.16' is <18036.10781.2>
[couchdb:info,2012-09-08T1:18:02.162,ns_1@10.3.121.14:<0.18109.0>:couch_log:info:39]Stopping updater for set view `default`, main group `_design/d1`
[ns_server:debug,2012-09-08T1:18:02.176,ns_1@10.3.121.14:<0.19520.2>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 138 into 'ns_1@10.3.121.24' is <18041.13682.2>
[couchdb:info,2012-09-08T1:18:02.179,ns_1@10.3.121.14:<0.18109.0>:couch_log:info:39]Updater, set view `default`, main group `_design/d1`, stopped with reason: {updater_error,
                                                                            shutdown}
[couchdb:info,2012-09-08T1:18:02.234,ns_1@10.3.121.14:<0.18109.0>:couch_log:info:39]Set view `default`, main group `_design/d1`, partition states updated
active partitions before: [103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199]
active partitions after: [103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199]
passive partitions before: []
passive partitions after: []
cleanup partitions before: [133,200,201,202,203,204,205]
cleanup partitions after: [133,137,200,201,202,203,204,205]
unindexable partitions: []
replica partitions before: [0,1,2,3,4,5,6,7,8,9,10,11,24,25,26,27,28,29,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,58,59,60,61,62,63,80,81,82,83,84,85,92,93,94,95,96,97,98,99,100,101,102,206,207,208,209,210,211,218,219,220,221,222,223,224,225,226,227,228,229,281,282,283,284,285,286,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,356,357,358,359,360,361,374,375,376,377,378,395,396,397,398,399,423,424,425,426,427,428,429,430,431,432,433,514,515,516,517,518,519,525,526,527,528,529,530,531,532,533,534,535,627,628,629,630,631,632,633,634,635,636,637,729,730,731,732,733,734,735,736,737,738,739,832,833,834,835,836,837,838,839,840,841,842,854,855,856,857,858,933,934,935,936,937,938,939,940,941,942,943,944,1000,1001,1002,1003,1004,1005,1012,1013,1014,1015,1016,1017]
replica partitions after: [0,1,2,3,4,5,6,7,8,9,10,11,24,25,26,27,28,29,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,58,59,60,61,62,63,80,81,82,83,84,85,92,93,94,95,96,97,98,99,100,101,102,206,207,208,209,210,211,218,219,220,221,222,223,224,225,226,227,228,229,281,282,283,284,285,286,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,356,357,358,359,360,361,374,375,376,377,378,395,396,397,398,399,423,424,425,426,427,428,429,430,431,432,433,514,515,516,517,518,519,525,526,527,528,529,530,531,532,533,534,535,627,628,629,630,631,632,633,634,635,636,637,729,730,731,732,733,734,735,736,737,738,739,832,833,834,835,836,837,838,839,840,841,842,854,855,856,857,858,933,934,935,936,937,938,939,940,941,942,943,944,1000,1001,1002,1003,1004,1005,1012,1013,1014,1015,1016,1017]
replicas on transfer before: []
replicas on transfer after: []
pending transition before:
  active: [200,201,202,203,204,205]
  passive: []
pending transition after:
  active: [200,201,202,203,204,205]
  passive: []

[ns_server:info,2012-09-08T1:18:02.238,ns_1@10.3.121.14:<0.19520.2>:ns_replicas_builder:build_replicas_main:94]Got exit not from child ebucketmigrator. Assuming it's our parent: {'EXIT',
                                                                    <0.19493.2>,
                                                                    shutdown}
[couchdb:info,2012-09-08T1:18:02.241,ns_1@10.3.121.14:<0.18109.0>:couch_log:info:39]Starting updater for set view `default`, main group `_design/d1`
[couchdb:info,2012-09-08T1:18:02.242,ns_1@10.3.121.14:<0.19533.2>:couch_log:info:39]Updater for set view `default`, main group `_design/d1` started
Active partitions: [103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199]
Passive partitions: []
Cleanup partitions: [133,137,200,201,202,203,204,205]
Replicas to transfer: []
Pending transition:
    active: [200,201,202,203,204,205]
    passive: []
Initial build: false

[views:info,2012-09-08T1:18:02.242,ns_1@10.3.121.14:'capi_set_view_manager-default':capi_set_view_manager:apply_index_states:460]
couch_set_view:set_partition_states([<<"default">>,<<"_design/d1">>,
                                     [103,104,105,106,107,108,109,110,111,112,
                                      113,114,115,116,117,118,119,120,121,122,
                                      123,124,125,126,127,128,129,130,131,132,
                                      138,139,140,141,142,143,144,145,146,147,
                                      148,149,150,151,152,153,154,155,156,157,
                                      158,159,160,161,162,163,164,165,166,167,
                                      168,169,170,171,172,173,174,175,176,177,
                                      178,179,180,181,182,183,184,185,186,187,
                                      188,189,190,191,192,193,194,195,196,197,
                                      198,199,200,201,202,203,204,205],
                                     [],
                                     [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,
                                      16,17,18,19,20,21,22,23,24,25,26,27,28,
                                      29,30,31,32,33,34,35,36,37,38,39,40,41,
                                      42,43,44,45,46,47,48,49,50,51,52,53,54,
                                      55,56,57,58,59,60,61,62,63,64,65,66,67,
                                      68,69,70,71,72,73,74,75,76,77,78,79,80,
                                      81,82,83,84,85,86,87,88,89,90,91,92,93,
                                      94,95,96,97,98,99,100,101,102,133,134,
                                      135,136,137,206,207,208,209,210,211,212,
                                      213,214,215,216,217,218,219,220,221,222,
                                      223,224,225,226,227,228,229,230,231,232,
                                      233,234,235,236,237,238,239,240,241,242,
                                      243,244,245,246,247,248,249,250,251,252,
                                      253,254,255,256,257,258,259,260,261,262,
                                      263,264,265,266,267,268,269,270,271,272,
                                      273,274,275,276,277,278,279,280,281,282,
                                      283,284,285,286,287,288,289,290,291,292,
                                      293,294,295,296,297,298,299,300,301,302,
                                      303,304,305,306,307,308,309,310,311,312,
                                      313,314,315,316,317,318,319,320,321,322,
                                      323,324,325,326,327,328,329,330,331,332,
                                      333,334,335,336,337,338,339,340,341,342,
                                      343,344,345,346,347,348,349,350,351,352,
                                      353,354,355,356,357,358,359,360,361,362,
                                      363,364,365,366,367,368,369,370,371,372,
                                      373,374,375,376,377,378,379,380,381,382,
                                      383,384,385,386,387,388,389,390,391,392,
                                      393,394,395,396,397,398,399,400,401,402,
                                      403,404,405,406,407,408,409,410,411,412,
                                      413,414,415,416,417,418,419,420,421,422,
                                      423,424,425,426,427,428,429,430,431,432,
                                      433,434,435,436,437,438,439,440,441,442,
                                      443,444,445,446,447,448,449,450,451,452,
                                      453,454,455,456,457,458,459,460,461,462,
                                      463,464,465,466,467,468,469,470,471,472,
                                      473,474,475,476,477,478,479,480,481,482,
                                      483,484,485,486,487,488,489,490,491,492,
                                      493,494,495,496,497,498,499,500,501,502,
                                      503,504,505,506,507,508,509,510,511,512,
                                      513,514,515,516,517,518,519,520,521,522,
                                      523,524,525,526,527,528,529,530,531,532,
                                      533,534,535,536,537,538,539,540,541,542,
                                      543,544,545,546,547,548,549,550,551,552,
                                      553,554,555,556,557,558,559,560,561,562,
                                      563,564,565,566,567,568,569,570,571,572,
                                      573,574,575,576,577,578,579,580,581,582,
                                      583,584,585,586,587,588,589,590,591,592,
                                      593,594,595,596,597,598,599,600,601,602,
                                      603,604,605,606,607,608,609,610,611,612,
                                      613,614,615,616,617,618,619,620,621,622,
                                      623,624,625,626,627,628,629,630,631,632,
                                      633,634,635,636,637,638,639,640,641,642,
                                      643,644,645,646,647,648,649,650,651,652,
                                      653,654,655,656,657,658,659,660,661,662,
                                      663,664,665,666,667,668,669,670,671,672,
                                      673,674,675,676,677,678,679,680,681,682,
                                      683,684,685,686,687,688,689,690,691,692,
                                      693,694,695,696,697,698,699,700,701,702,
                                      703,704,705,706,707,708,709,710,711,712,
                                      713,714,715,716,717,718,719,720,721,722,
                                      723,724,725,726,727,728,729,730,731,732,
                                      733,734,735,736,737,738,739,740,741,742,
                                      743,744,745,746,747,748,749,750,751,752,
                                      753,754,755,756,757,758,759,760,761,762,
                                      763,764,765,766,767,768,769,770,771,772,
                                      773,774,775,776,777,778,779,780,781,782,
                                      783,784,785,786,787,788,789,790,791,792,
                                      793,794,795,796,797,798,799,800,801,802,
                                      803,804,805,806,807,808,809,810,811,812,
                                      813,814,815,816,817,818,819,820,821,822,
                                      823,824,825,826,827,828,829,830,831,832,
                                      833,834,835,836,837,838,839,840,841,842,
                                      843,844,845,846,847,848,849,850,851,852,
                                      853,854,855,856,857,858,859,860,861,862,
                                      863,864,865,866,867,868,869,870,871,872,
                                      873,874,875,876,877,878,879,880,881,882,
                                      883,884,885,886,887,888,889,890,891,892,
                                      893,894,895,896,897,898,899,900,901,902,
                                      903,904,905,906,907,908,909,910,911,912,
                                      913,914,915,916,917,918,919,920,921,922,
                                      923,924,925,926,927,928,929,930,931,932,
                                      933,934,935,936,937,938,939,940,941,942,
                                      943,944,945,946,947,948,949,950,951,952,
                                      953,954,955,956,957,958,959,960,961,962,
                                      963,964,965,966,967,968,969,970,971,972,
                                      973,974,975,976,977,978,979,980,981,982,
                                      983,984,985,986,987,988,989,990,991,992,
                                      993,994,995,996,997,998,999,1000,1001,
                                      1002,1003,1004,1005,1006,1007,1008,1009,
                                      1010,1011,1012,1013,1014,1015,1016,1017,
                                      1018,1019,1020,1021,1022,1023]]) returned ok in 80ms
[couchdb:info,2012-09-08T1:18:02.271,ns_1@10.3.121.14:<0.19539.2>:couch_log:info:39]Updater reading changes from active partitions to update main set view group `_design/d1` from set `default`
[views:info,2012-09-08T1:18:02.282,ns_1@10.3.121.14:'capi_set_view_manager-default':capi_set_view_manager:apply_index_states:464]
Calling couch_set_view:add_replica_partitions([<<"default">>,<<"_design/d1">>,
                                               [0,1,2,3,4,5,6,7,8,9,10,11,24,
                                                25,26,27,28,29,36,37,38,39,40,
                                                41,42,43,44,45,46,47,48,49,50,
                                                51,58,59,60,61,62,63,80,81,82,
                                                83,84,85,92,93,94,95,96,97,98,
                                                99,100,101,102,206,207,208,
                                                209,210,211,218,219,220,221,
                                                222,223,224,225,226,227,228,
                                                229,281,282,283,284,285,286,
                                                321,322,323,324,325,326,327,
                                                328,329,330,331,332,333,334,
                                                335,336,337,338,339,340,341,
                                                342,343,344,356,357,358,359,
                                                360,361,374,375,376,377,378,
                                                395,396,397,398,399,423,424,
                                                425,426,427,428,429,430,431,
                                                432,433,514,515,516,517,518,
                                                519,525,526,527,528,529,530,
                                                531,532,533,534,535,627,628,
                                                629,630,631,632,633,634,635,
                                                636,637,729,730,731,732,733,
                                                734,735,736,737,738,739,832,
                                                833,834,835,836,837,838,839,
                                                840,841,842,854,855,856,857,
                                                858,933,934,935,936,937,938,
                                                939,940,941,942,943,944,1000,
                                                1001,1002,1003,1004,1005,1012,
                                                1013,1014,1015,1016,1017]])
[couchdb:info,2012-09-08T1:18:02.285,ns_1@10.3.121.14:<0.15832.2>:couch_log:info:39]10.3.121.22 - - POST /_view_merge/?stale=ok&limit=10 200
[couchdb:info,2012-09-08T1:18:02.296,ns_1@10.3.121.14:<0.18109.0>:couch_log:info:39]Set view `default`, main group `_design/d1`, defined new replica partitions: [0,1,2,3,4,5,6,7,8,9,10,11,24,25,26,27,28,29,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,58,59,60,61,62,63,80,81,82,83,84,85,92,93,94,95,96,97,98,99,100,101,102,206,207,208,209,210,211,218,219,220,221,222,223,224,225,226,227,228,229,281,282,283,284,285,286,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,356,357,358,359,360,361,374,375,376,377,378,395,396,397,398,399,423,424,425,426,427,428,429,430,431,432,433,514,515,516,517,518,519,525,526,527,528,529,530,531,532,533,534,535,627,628,629,630,631,632,633,634,635,636,637,729,730,731,732,733,734,735,736,737,738,739,832,833,834,835,836,837,838,839,840,841,842,854,855,856,857,858,933,934,935,936,937,938,939,940,941,942,943,944,1000,1001,1002,1003,1004,1005,1012,1013,1014,1015,1016,1017]
New full set of replica partitions is: [0,1,2,3,4,5,6,7,8,9,10,11,24,25,26,27,28,29,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,58,59,60,61,62,63,80,81,82,83,84,85,92,93,94,95,96,97,98,99,100,101,102,206,207,208,209,210,211,218,219,220,221,222,223,224,225,226,227,228,229,281,282,283,284,285,286,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,356,357,358,359,360,361,374,375,376,377,378,395,396,397,398,399,423,424,425,426,427,428,429,430,431,432,433,514,515,516,517,518,519,525,526,527,528,529,530,531,532,533,534,535,627,628,629,630,631,632,633,634,635,636,637,729,730,731,732,733,734,735,736,737,738,739,832,833,834,835,836,837,838,839,840,841,842,854,855,856,857,858,933,934,935,936,937,938,939,940,941,942,943,944,1000,1001,1002,1003,1004,1005,1012,1013,1014,1015,1016,1017]

[ns_server:info,2012-09-08T1:18:02.319,ns_1@10.3.121.14:<0.19520.2>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.3.121.14': [<<"replication_building_138_'ns_1@10.3.121.26'">>,
                                                       <<"replication_building_138_'ns_1@10.3.121.16'">>,
                                                       <<"replication_building_138_'ns_1@10.3.121.24'">>]
[ns_server:info,2012-09-08T1:18:02.321,ns_1@10.3.121.14:'janitor_agent-default':janitor_agent:handle_info:646]Undoing temporary vbucket states caused by rebalance
[user:info,2012-09-08T1:18:02.322,ns_1@10.3.121.14:<0.20487.1>:ns_orchestrator:handle_info:295]Rebalance exited with reason {bulk_set_vbucket_state_failed,
                              [{'ns_1@10.3.121.13',
                                {'EXIT',
                                 {{{timeout,
                                    {gen_server,call,
                                     [ns_config,
                                      {eval,
                                       #Fun<cluster_compat_mode.0.45438860>}]}},
                                   {gen_server,call,
                                    ['tap_replication_manager-default',
                                     {change_vbucket_replication,726,
                                      undefined},
                                     infinity]}},
                                  {gen_server,call,
                                   [{'janitor_agent-default',
                                     'ns_1@10.3.121.13'},
                                    {if_rebalance,<0.16591.2>,
                                     {update_vbucket_state,726,replica,
                                      undefined,undefined}},
                                    infinity]}}}}]}

[ns_server:debug,2012-09-08T1:18:02.322,ns_1@10.3.121.14:<0.16604.2>:ns_pubsub:do_subscribe_link:134]Parent process of subscription {ns_node_disco_events,<0.16591.2>} exited with reason {bulk_set_vbucket_state_failed,
                                                                                      [{'ns_1@10.3.121.13',
                                                                                        {'EXIT',
                                                                                         {{{timeout,
                                                                                            {gen_server,
                                                                                             call,
                                                                                             [ns_config,
                                                                                              {eval,
                                                                                               #Fun<cluster_compat_mode.0.45438860>}]}},
                                                                                           {gen_server,
                                                                                            call,
                                                                                            ['tap_replication_manager-default',
                                                                                             {change_vbucket_replication,
                                                                                              726,
                                                                                              undefined},
                                                                                             infinity]}},
                                                                                          {gen_server,
                                                                                           call,
                                                                                           [{'janitor_agent-default',
                                                                                             'ns_1@10.3.121.13'},
                                                                                            {if_rebalance,
                                                                                             <0.16591.2>,
                                                                                             {update_vbucket_state,
                                                                                              726,
                                                                                              replica,
                                                                                              undefined,
                                                                                              undefined}},
                                                                                            infinity]}}}}]}
[ns_server:debug,2012-09-08T1:18:02.341,ns_1@10.3.121.14:<0.16604.2>:ns_pubsub:do_subscribe_link:149]Deleting {ns_node_disco_events,<0.16591.2>} event handler: ok
[ns_server:debug,2012-09-08T1:18:02.345,ns_1@10.3.121.14:'capi_set_view_manager-saslbucket':capi_set_view_manager:handle_info:337]doing replicate_newnodes_docs
[ns_server:info,2012-09-08T1:18:02.354,ns_1@10.3.121.14:<0.19587.2>:diag_handler:log_all_tap_and_checkpoint_stats:126]logging tap & checkpoint stats
[error_logger:error,2012-09-08T1:18:02.343,ns_1@10.3.121.14:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ns_single_vbucket_mover:mover/6
    pid: <0.19493.2>
    registered_name: []
    exception exit: {exited,
                     {'EXIT',<0.16591.2>,
                      {bulk_set_vbucket_state_failed,
                       [{'ns_1@10.3.121.13',
                         {'EXIT',
                          {{{timeout,
                             {gen_server,call,
                              [ns_config,
                               {eval,#Fun<cluster_compat_mode.0.45438860>}]}},
                            {gen_server,call,
                             ['tap_replication_manager-default',
                              {change_vbucket_replication,726,undefined},
                              infinity]}},
                           {gen_server,call,
                            [{'janitor_agent-default','ns_1@10.3.121.13'},
                             {if_rebalance,<0.16591.2>,
                              {update_vbucket_state,726,replica,undefined,
                               undefined}},
                             infinity]}}}}]}}}
      in function ns_single_vbucket_mover:mover_inner_old_style/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
    ancestors: [<0.16591.2>,<0.22331.1>]
    messages: [{'EXIT',<0.16591.2>,
                   {bulk_set_vbucket_state_failed,
                    [{'ns_1@10.3.121.13',
                      {'EXIT',
                       {{{timeout,
                          {gen_server,call,
                           [ns_config,
                            {eval,#Fun<cluster_compat_mode.0.45438860>}]}},
                         {gen_server,call,
                          ['tap_replication_manager-default',
                           {change_vbucket_replication,726,undefined},
                           infinity]}},
                        {gen_server,call,
                         [{'janitor_agent-default','ns_1@10.3.121.13'},
                          {if_rebalance,<0.16591.2>,
                           {update_vbucket_state,726,replica,undefined,
                            undefined}},
                          infinity]}}}}]}}]
    links: [<0.19498.2>,<0.16591.2>]
    dictionary: [{cleanup_list,[<0.19520.2>]}]
    trap_exit: true
    status: running
    heap_size: 2584
    stack_size: 24
    reductions: 4491
  neighbours:


Link to diags of all nodes

https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/11nodes-1697-rebalance-failed-bulk_set_vbucket_state_failed-20120908.tgz



 Comments   
Comment by Karan Kumar (Inactive) [ 10/Sep/12 ]
Seems similar to:-
http://www.couchbase.com/issues/browse/MB-6058

And I guess tap-replication-manager was recently added. making sure this is no regrssion?
Comment by Aleksey Kondratenko [ 11/Sep/12 ]
This one is clearly different and again looks like paging from just looking at error. We're clearly timing out on ns_config access.
Comment by Aleksey Kondratenko [ 11/Sep/12 ]
There must be duplicate somewhere.

In order to confirm paging I need atop.
Comment by Thuan Nguyen [ 11/Sep/12 ]
Fresh install couchbase server 2.0.0-1708 on 10 nodes cluster. Verified swap = 0% and swappiness set to 10
Create 2 buckets, default (3GB) and saslbucket (3GB)
Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)

10.3.121.13
10.3.121.14
10.3.121.15
10.3.121.16
10.3.121.17
10.3.121.20
10.3.121.22
10.3.121.24
10.3.121.25
10.3.121.23

Data path /data
View path /data

I hit this bug again in build 1708 when failover and rebalance out node 13.
The failed error also points to node 14


Rebalance exited with reason {{bulk_set_vbucket_state_failed,
[{'ns_1@10.3.121.22',
{'EXIT',
{{{timeout,
{gen_server,call,
[ns_config,
{eval,
#Fun<cluster_compat_mode.0.45438860>}]}},
{gen_server,call,
['tap_replication_manager-saslbucket',
{change_vbucket_replication,470,
undefined},
infinity]}},
{gen_server,call,
[{'janitor_agent-saslbucket',
'ns_1@10.3.121.22'},
{if_rebalance,<0.24261.36>,
{update_vbucket_state,579,replica,
undefined,undefined}},
infinity]}}}}]},
[{janitor_agent,bulk_set_vbucket_state,4},
{ns_vbucket_mover,
update_replication_post_move,3},
{ns_vbucket_mover,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
ns_orchestrator002 ns_1@10.3.121.14 13:31:20 - Tue Sep 11, 2012

Link to atop of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-10nodes-1708-reb-failed-bulk_set_vbucket_state_failed-20120911-140230.tgz

Link to diags of all nodes https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/10nodes-1708-reb-failed-bulk_set_vbucket_state_failed-20120911-140230.tgz
Comment by Karan Kumar (Inactive) [ 11/Sep/12 ]
@Alk

TOny attached the relevenat atop output in the comment above.
Comment by Aleksey Kondratenko [ 11/Sep/12 ]
Thanks. At last some useful data. I observed quite severe increase in resident size of erlang vm. So quite major page faults are not there, we quite possibly observe minor page fault being blocked on lack of memory.
Comment by Thuan Nguyen [ 11/Sep/12 ]
Link to erl dump when resident memory of beam.smp goes over 1GB of node 16 and 17 during rebalance out node 13.

https://s3.amazonaws.com/packages.couchbase/erlang/orange/201209/2-erl-dump-Rm-over-1GB.tgz
Comment by Aleksey Kondratenko [ 11/Sep/12 ]
May I ask for logs from this run ?
Comment by Karan Kumar (Inactive) [ 11/Sep/12 ]
Test
Comment by Karan Kumar (Inactive) [ 11/Sep/12 ]
Saw the sudden growth of Rsize for beam.smp from ~500MB to ~1.2 GB on the node that timedout and resulted in rebalance failure.

This is also related to other timeouts we have been seeing (~ddocs/ns_disco/ns_doctor), on the nodes that have high Rsize value.
Comment by Aleksey Kondratenko [ 11/Sep/12 ]
Give me logs please. I'm seeing very weird things in crash dumps. Would like to see logs
Comment by Thuan Nguyen [ 11/Sep/12 ]
Generating diags. Will upload them soon.
Comment by Karan Kumar (Inactive) [ 11/Sep/12 ]
Its almost done..
You can access the cluster if you wish.
10.3.121.24:8091
Comment by Thuan Nguyen [ 11/Sep/12 ]
Link to diags of all nodes https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/9nodes-1708-res-mem-beam-over-1gb-20120911.tgz
Comment by Karan Kumar (Inactive) [ 11/Sep/12 ]
Looking at the erl crash. Says the following processes are taints.

Taints "couch_view_parser,couch_ejson_compare,mapreduce,snappy,ejson,crypto"
Comment by Aleksey Kondratenko [ 11/Sep/12 ]
taints are fine. Basically it's some 3rd-party C code living in same address space as erlang. Basically any NIF will taint it as far as I understand.
Comment by Karan Kumar (Inactive) [ 11/Sep/12 ]
Ok.

Another thing is number of atoms 20605. seems large?
Comment by Aleksey Kondratenko [ 11/Sep/12 ]
No. Really weird thing is messages queued in error_logger.
Comment by Aleksey Kondratenko [ 11/Sep/12 ]
Logs of .17 are too old. Latest message is:


[couchdb:info,2012-09-08T1:25:26.301,ns_1@10.3.121.17:<0.7480.2>:couch_log:info:39]10.3.121.13 - - POST /_view_merge/?stale=ok&limit=500 500
[views:info,2012-09-08T1:25:26.342,ns_1@10.3.121.17:'capi_set_view_manager-default':capi_set_view_manager:apply_index_states:464]
Calling couch_set_view:add_replica_partitions([<<"default">>,<<"_design/d1">>,
                                               [36,37,38,39,40,41,42,43,44,45,
                                                46,139,140,141,142,143,144,
                                                145,146,147,148,149,212,213,
                                                214,215,216,217,242,243,244,
                                                245,246,247,248,249,250,251,
                                                252,253,292,293,294,295,296,
                                                297,345,346,347,348,349,350,
                                                351,352,353,354,355,509,510,
                                                520,521,522,523,524,525,526,
                                                527,528,529,530,531,532,533,
                                                534,535,536,537,538,539,540,
                                                541,542,543,544,545,546,547,
                                                548,549,550,551,552,553,554,
                                                555,556,557,558,559,560,561,
                                                562,563,564,565,566,567,568,
                                                569,570,571,572,573,574,575,
                                                576,577,578,579,580,581,600,
                                                601,602,603,604,608,609,610,
                                                611,612,613,614,615,660,661,
                                                662,663,664,665,666,667,668,
                                                669,670,671,683,684,685,686,
                                                687,688,689,690,691,692,693,
                                                694,762,763,764,765,766,767,
                                                768,769,770,771,772,773,774,
                                                775,776,777,778,865,866,867,
                                                868,869,870,871,872,873,874,
                                                875,967,968,969,970,971,972,
                                                973,974,975,976,977,978,979,
                                                980,981,982,983]])
[couchdb:info,2012-09-08T1:25:26.382,ns_1@10.3.121.17:<0.26843.2>:couch_log:info:39]Updater reading changes from active partitions to update main set view group `_design/d1` from set `default`
Comment by Thuan Nguyen [ 11/Sep/12 ]
Here are 2 erlang crash dump. One from health node 24 and one from problem node (14) with beam memory over 1 GB
Comment by Thuan Nguyen [ 11/Sep/12 ]
Link to diags node 17 download from UI.
https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/10.3.121.17-diags-20120911195436.txt.gz
Comment by Farshid Ghods (Inactive) [ 12/Sep/12 ]
adding this to current sprint since i see Chiyoung and Alk are already working and debugging this issue
Comment by Aleksey Kondratenko [ 13/Sep/12 ]
May I ask you guys to re run while periodically doing this:

wget -O- --user=Administrator --password=asdasd --post-data="rpc:eval_everywhere(erlang, apply, [fun () -> [erlang:garbage_collect(P) || P <- erlang:processes()], ok end, []])" http://myhost:8091/diag/eval

Doing it on just one node should be enough.

I want to double check that this is caused by weird GC.
Comment by Karan Kumar (Inactive) [ 13/Sep/12 ]
What what frequency do you wnat us to run erlang GC.
Comment by Aleksey Kondratenko [ 13/Sep/12 ]
30 seconds. And monitor memory usage.

I've seen how single run will half memory usage. But very interesting to see if some growth happens during rebalance in your case.
Comment by Karan Kumar (Inactive) [ 14/Sep/12 ]
@Alk:

Running this every 30 secs, seems to work really well.
The memory usages for beam are around ~30-40% lesser.

On the same workload, we dont see suspicious rsize growth.
Comment by Aleksey Kondratenko [ 14/Sep/12 ]
Great. Use this workaround for now. Once my memory usage optimizations lands we'll re-test and then consider if we need still need this workaround in product or maybe if we need some other tuning
Comment by Aleksey Kondratenko [ 14/Sep/12 ]
I want you to try the following alternative:

wget -O- --user=Administrator --password=asdasd --post-data="rpc:eval_everywhere(erlang, apply, [fun () -> erlang:system_flag(fullsweep_after, 0) end, []])" http://lh:9000/diag/eval

Run this once on one node. It'll enable more aggressive gc on all nodes. I hope it'll have results very similar to periodic full manual gc.

After testing setting of 0 (and only it works resonably ok) I want you to try 10.

wget -O- --user=Administrator --password=asdasd --post-data="rpc:eval_everywhere(erlang, apply, [fun () -> erlang:system_flag(fullsweep_after, 10) end, []])" http://lh:9000/diag/eval

That should provide very valuable information.
Comment by Aleksey Kondratenko [ 14/Sep/12 ]
Ignore message above. Well need this in couchbase-server script

    export ERL_FULLSWEEP_AFTER=0
Comment by Karan Kumar (Inactive) [ 14/Sep/12 ]
We are going to try with value 0 first then with value 10.
Comment by Thuan Nguyen [ 17/Sep/12 ]
With ERL_FULLSWEEP_AFTER set to 0, I got time out again.
Link to atop file of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-12nodes-1728-reset-reb-status-since-not-running-20120917.tgz
Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201209/12nodes-col-info-1728-reset-reb-status-since-not-running-20120917.tgz

Set up ERL_FULLSWEEP_AFTER to 10 and has not hit this bug yet. I will try more to confirm it.
Second try swap rebalance add nodes 22 and 23, remove nodes 14 and 15. Rebalance did not hit this bug
Comment by Aleksey Kondratenko [ 19/Sep/12 ]
Thanks I'll take a look asap
Comment by Aleksey Kondratenko [ 04/Oct/12 ]
I looked at this and was unable to see anything specific that would cause that weird slowness.

There's no signs of swapping but we hit those timeouts
Comment by Aleksey Kondratenko [ 04/Oct/12 ]
As noted above I looked at latest diags and could not spot anything specifically wrong in ns_server.

There's still some suspision about environment as well as potential effect of CPU cycles spent in NIFs not properly accounted for.

Please note, that I do enjoy investigating tough cases like that. But:

*) apparently our system tests use case has changed and I have no idea if that's still relevant

*) we may have other folks better equipped and maybe having more time than myself in order to work on that.

So I need explicit "go ahead" in order to continue working on that.
Comment by Peter Wansch (Inactive) [ 04/Oct/12 ]
Farshid, can we hand this one to Mike to look at it from an ep_engine perspective
Comment by Farshid Ghods (Inactive) [ 05/Oct/12 ]
Mike,

can you please sync up with system test team and look at the timeout issues from ep-engine perspective ?
Comment by Mike Wiederhold [ 05/Oct/12 ]
There is nothing in this issue that looks suspicious from the ep-engine side. If it is assigne back to the ep-engine team then please include the memcached logs. Karan has asked that this bug be assigned back to him for a follow up.
Comment by Aleksey Kondratenko [ 17/Oct/12 ]
Forgot to ask when saw this. What's evidence that it's caused by garbage collection? Ticket clearly doesn't have any so far.
Comment by Farshid Ghods (Inactive) [ 18/Oct/12 ]
Karan is trying 10, and 10,000
performance team has tried 16
Comment by Farshid Ghods (Inactive) [ 18/Oct/12 ]
Pavel,

can you update the ticket with the results you have from view perf tests where erl_sweep is set to 5,10,15 or other values
Comment by Steve Yen [ 18/Oct/12 ]
Some results here...

  https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdGd2NmhLOThCRUc0ekFkZ0FDeGdLRXc#gid=0

{code}
Build 1782, aggressive workload
Set latency (90th) Get latency (90th) Query latency (80th) Query throughput Avg. mem beam.smp (MB) Avf. drain rate Runtime (h)
DEFAULT (64K) 1.39 1.7 17.35 1152 325.73 700.35 3.25
5 1.05 1.38 20.66 984 232.05 645.96 4.77
10 1.66 1.18 20.27 1057.9 249.89 668.04 4.37
15 1.64 1.38 20.07 981.68 248.67 746.57 4.87

Build 1842, aggressive workload
0 32.82 1017.95 151.41 5.55
512 19.51 1126.27 291.11 4.04
1K 19.23 1070.77 304.27 4.43
8K 23.94 1029.81 319.32 5.37
16K 20.64 1094.99 311.01 4.45
32K 26.69 1061.74 320.79 4.2
Build 1724, lite workload
0 10.95 464.07 169.62 5.04
64 9.89 531.2 289.73 3.93
256 10.31 513.05 307.09 4.19
DEFAULT (64K) 10.26 527.62 325.37 3.78
{code}

Comment by Farshid Ghods (Inactive) [ 18/Oct/12 ]
Karan,
based on the performance test results seems like 512 is the best candidate that does not impact view latency and query througput

have system test team run any test with 512 ?
if we can use this value for different workloads and it passes system tests we can assign this ticket to Alk and suggest the right value
Comment by Aleksey Kondratenko [ 18/Oct/12 ]
It appears people are confusing timeouts (software doesn't work) and performance numbers (how fast software _works_)

The only effect of fullsweep is _supposedly_ full collections all the time. This coupled with erlang's supposedly 'soft-realtime' design where GC is run independently and separately on different erlang process heaps IMHO doesn't explain timeouts. At all.
Comment by Aleksey Kondratenko [ 18/Oct/12 ]
BTW, spreadsheet linked above doesn't have definition of columns. I was unable to read people's mind on first column meaning particularly.
Comment by Karan Kumar (Inactive) [ 18/Oct/12 ]
The first column is value of ERL_FULLSWEEP
Comment by Steve Yen [ 19/Oct/12 ]
alk says there's 2 issues...

- a better full-sweep setting (e.g., the "512" from the spreadsheet -- i'll create a separate jira issue for that)

- other mysterious erlang timeouts that need root cause understanding/analysis (alk wants to own this, so assigning to him)

a better the full-sweep setting can help but not fully solve the root cause.

Comment by Steve Yen [ 24/Oct/12 ]
see issue on MB-6974 for the full-sweep 512 setting
Comment by Steve Yen [ 24/Oct/12 ]
with MB-6974 (full-sweep 512 setting) change, is this bug (MB-6595) no longer blocker level priority?
Comment by Aleksey Kondratenko [ 24/Oct/12 ]
Updated title to reflect reality. We have no idea it's caused by GC.

Steve, answering your question. We have those weird and maybe GC-related timeouts before. Now we also have a danger of timeouts caused by lack of async threads. We need to get on top of this. I think this is blocker #1, personally.

And btw, I just hit ns_config:get timeout on pretty harmless indexing on 9E6 simple items.
Comment by Farshid Ghods (Inactive) [ 25/Oct/12 ]
changing the priority to critical because with the new erlang gc setting we do not have reports of timeouts from system tests and performance tests.
Comment by Aleksey Kondratenko [ 25/Oct/12 ]
Farshid, yet. You don't have reports, yet.
Comment by Thuan Nguyen [ 25/Oct/12 ]
Tested with build 2.0.0-1862 with erlang gc set to 10 and consistent view enable, I don't see any timeout during rebalance.
Comment by Farshid Ghods (Inactive) [ 26/Oct/12 ]
Tony,

please confirm that after running 2.0 cluster on builds which have MB-7002 fix ( build 1888+ ) you dont see timeouts and rebalance failures.

Also please do not change the default value of erlang_sweep to 10 anymore because those builds have the value set to 512 already
Comment by Aleksey Kondratenko [ 26/Oct/12 ]
Also, stop tuning kernel's swappiness as well.

Also let me add that we decided QEs normal rebalance tests would be run on physical hardware with durable disk setup. That would be better test imho.
Comment by Farshid Ghods (Inactive) [ 26/Oct/12 ]
QE will run one of the system tests on a physical environment multiple times but most tests will still be run on ec2 or virtualized xen or vmware environment.

unfortunately we dont have enough hardware to run all tests on physical environment.

functional tests are run on virtual environment. the statement applies only to system tests.
Comment by Aleksey Kondratenko [ 26/Oct/12 ]
I'm fine with that. But just wanted to note that this bug is not only for system tests anymore. It's for all kinds of unknown timeouts happening. Because we have no idea what's causing them. Hopefully indeed it's GC.
Comment by Farshid Ghods (Inactive) [ 26/Oct/12 ]
yes. thats a valid point. we dont see these timeouts happening at all on functional testing.

also i heard that if swappiness value is not set to 10% there is a higher chance that erlang uses some swap and we saw strange failures in xdcr and other system testing on ec2 and hence now its set back to 10%

if you think this is not necessary after the garbage collection fix i will let the system test team know.
Comment by Aleksey Kondratenko [ 26/Oct/12 ]
On swappiness. We should use whatever our customers will use. So if we are certain to recommend or even demand that out users decrease swappiness, I'm 100% fine.

I just want to make sure we test with same settings that people will use.
Comment by Aleksey Kondratenko [ 31/Oct/12 ]
I'm seeing occasional timeouts in logs of systems tests.

Plus just hit this again when trying to benchmark rebalance time with views.
Comment by Steve Yen [ 01/Nov/12 ]
Aliaksey A. is working on systemtap instrumentation.
Comment by Steve Yen [ 01/Nov/12 ]
Farshid reports not seeing timeouts from user viewpoint in system tests.

Assigning to Alk per bug-scrub mtg.
Comment by Dipti Borkar [ 26/Nov/12 ]
This master bug has other similar bugs marked as sub-tasks.
Comment by Aleksey Kondratenko [ 02/Jan/13 ]
Aliaksey is looking at this
Comment by Farshid Ghods (Inactive) [ 22/Jan/13 ]
related fixes were merged and build 138-rel+ has those fixes. system test team will update the ticket if those issues are discovered.

Aliaksey, is it okay to assign this to QE ?
Comment by Aleksey Kondratenko [ 28/Jan/13 ]
We're fine if this is assigned to QE. But given there's still some ongoing investigation of current problems I'm not sure bouncing this ticket is best idea.

We hope our remaining issues are either windows (where none of our fixes apply) or our virtualization environment.

And I hope ongoing investigation will confirm that.
Comment by Farshid Ghods (Inactive) [ 31/Jan/13 ]
we have not seen these timeouts on the linux SSD Xen virtualized cluster.

we saw these behaviors however significantly on windows VMware virtualized cluster but i recommend resolving this ticket as this was opened for linux clusters
Comment by Jin Lim [ 11/Feb/13 ]
Trond and Sri will be working on many optimization and enhancements that make Couchbase Server overall run better on Windows. As QE (Farshid) pointed out we should add this to the list of things to enhance/address on Windows and move on at this point.

Aliaksey - please advise if it it ok to resolve this issue and add it to Windows enhancement bin. Also this bug is flagged as release note candidate. Please advise if the timeout behavior on Windows should be described and that we recommend users decrease swappiness (and how much?) in the release note. Thanks.
Comment by Aleksey Kondratenko [ 11/Feb/13 ]
Indeed, lets create new ticket for windows work.

Aliaksey figured out that our tuning option that helped to decrease minor page faults on GNU/Linux does not apply to windows as apparently they are not using equivalent of mmap to allocate memory on windows.
Comment by Farshid Ghods (Inactive) [ 11/Feb/13 ]
http://www.couchbase.com/issues/browse/MB-7658 is created already for addressing issues on windows
Comment by Farshid Ghods (Inactive) [ 11/Feb/13 ]
QE has not seen this issue on linxu cluster on any of the 2.0.1 builds ( 135....148)
Comment by Jin Lim [ 11/Feb/13 ]
Alk, thanks for quick update. Do you concur with the idea we putting a suggestion for decreasing swappiness in the release note? If so how much we recommend user decrease the setting?
Comment by Aleksey Kondratenko [ 11/Feb/13 ]
AFAIK all latest testing was done with default swappiness. I cannot recommend any particular value.
Comment by Farshid Ghods (Inactive) [ 11/Feb/13 ]
per system test spec and other experiments the swappiness is set to 10%

as observed in system test graphs swap usage growth is unbounded ( ketaki to attach some graphs ) during the query and indexing phase so reducing swappiness to 10% instead of default with 60% has helped before
Comment by Ketaki Gangal [ 11/Feb/13 ]
Link to system -test results here ( report_atop, Page 5/6 for Swap used) https://github.com/couchbaselabs/couchbase-qe-docs/blob/master/system-tests/linux-views-ssd/2013_24_01/report_atop_10.6.2.42_default_simple-view_test.pdf
Link to system test specs for the test runs here http://hub.internal.couchbase.com/confluence/display/QA/views-test
Comment by Ketaki Gangal [ 11/Feb/13 ]
swap results file attached.
Comment by Jin Lim [ 11/Feb/13 ]
Assign it to Karen quickly so she has time to capture what needs to go in to the release note. Karen, please review the above comments about swappiness. Thanks!
Comment by Jin Lim [ 11/Feb/13 ]
Please close this ticket once after you collect all information required for the release note. Thanks much!
Comment by kzeller [ 13/Feb/13 ]
RN: The server experienced severe timeouts during
rebalance if views were being indexed or compacted at the same time.
This resulted in the rebalance to fail. This has been fixed.
Comment by kzeller [ 13/Feb/13 ]
RN: The server experienced severe timeouts during
rebalance if views were being indexed or compacted at the same time.
This resulted in the rebalance to fail. This has been fixed.


Note from Alk: the guidance on swapness % should not be part of note.
Comment by Aleksey Kondratenko [ 13/Feb/13 ]
correction, this was fixed on GNU/Linux, but not on windows.
Generated at Fri Oct 24 10:11:52 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.