[MB-6494] [longevity] rebalance fails with error "Got error while trying to send close confirmation: {error,enotconn} during rebalance" when adding a failed over node back to the cluster and failing over another node Created: 30/Aug/12  Updated: 10/Jan/13  Resolved: 20/Nov/12

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0-beta
Fix Version/s: 2.0.1
Security Level: Public

Type: Bug Priority: Major
Reporter: Thuan Nguyen Assignee: Chiyoung Seo
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 6.2 64 bit


 Description   
Cluster information:
- 11 centos 6.2 64bit server with 4 cores CPU
- Each server has 10 GB RAM and 150 GB disk.
- 8 GB RAM for couchbase server at each node
- Each server has its own drive, no disk sharing with other server.
- Cluster has 2 buckets, default (3GB) and saslbucket (3GB)
- Each bucket has one doc and 2 views for each doc.
- Loader 10.3.2.4

* Build 2.0.0-1645 on 11 nodes
* Create cluster with 10 nodes
10.3.121.13
10.3.121.14
10.3.121.15
10.3.121.16
10.3.121.17
10.3.121.20
10.3.121.22
10.3.121.24
10.3.121.25
10.3.121.23
* Data path /data
* View path /data

* Activities testing:
- Do failover, rebalance, swap rebalance in and out.

* Latest step failed for this bug.
- Failover node 14
- Rebalance and stop rebalance. So node 14 was kicked out.
- Add node 14 back to cluster and failover node 13.
- Rebalance. Rebalance failed and during rebalance, got error point to node 16

 "Got error while trying to send close confirmation: {error,enotconn} 13:56:20 - Thu Aug 30, 2012 " on node 16.

Analyze diags of node 16, see this error at time 2012-08-30T13:56:20

[ns_server:debug,2012-08-30T13:56:20.614,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:kill_tapname:932]killing tap named: replication_building_30_'ns_1@10.3.121.16'
[rebalance:info,2012-08-30T13:56:20.620,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:init:522]Starting tap stream:
[{vbuckets,[]},
 {checkpoints,[]},
 {name,<<"replication_building_30_'ns_1@10.3.121.16'">>},
 {takeover,false}]
{{"10.3.121.25",11209},
 {"10.3.121.16",11209},
 [{vbuckets,[30]},
  {takeover,false},
  {suffix,"building_30_'ns_1@10.3.121.16'"},
  {username,"saslbucket"},
  {password,"password"}]}
[rebalance:debug,2012-08-30T13:56:20.623,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:init:555]upstream_sender pid: <0.13244.35>
[rebalance:info,2012-08-30T13:56:20.665,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:process_upstream:880]Initial stream for vbucket 18
[rebalance:info,2012-08-30T13:56:20.666,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:process_upstream:880]Initial stream for vbucket 24
[rebalance:info,2012-08-30T13:56:20.666,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:process_upstream:880]Initial stream for vbucket 25
[rebalance:error,2012-08-30T13:56:20.666,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:confirm_sent_messages:674]Got error while trying to send close confirmation: {error,enotconn}

[views:info,2012-08-30T13:56:20.852,ns_1@10.3.121.16:'capi_ddoc_replication_srv-saslbucket':capi_set_view_manager:apply_index_states:351]
Calling couch_set_view:set_partition_states([<<"saslbucket">>,
                                             <<"_design/d11">>,
                                             [12,13,14,15,16,17,19,20,21,22,
                                              23,24,25,26,36,37,38,39,40,41,
                                              42,69,70,71,72,73,84,85,86,87,
                                              92,93,94,103,104,105,106,107,
                                              108,109,110,111,112,113,114,142,
                                              143,150,151,152,153,154,155,156,
                                              178,179,180,181,182,196,197,218,
                                              219,220,221,222,223,224,225,226,
                                              227,228,229,230,231,232,233,234,
                                              235,236,237,242,243,244,245,246,
                                              247,248,249,250,251,252,253,265,
                                              266,267,268,288,289,290,291,309,
                                              310,311,312,313,314,315,321,322,
                                              323,333,334,335,336,337,338,339,
                                              340,341,342,343,344,347,348,349,
                                              350,351,352,353,354,355,362,363,
                                              364,365,366,367,379,380,381,382,
                                              383,384,385,395,396,406,407,408,
                                              409,410,411,446,447,448,449,450,
                                              451,452,453,503,504,505,506,507,
                                              508,509,510,511,517,518,519,520,
                                              521,522,523,524,603,604,613,614,
                                              615,632,633,634,635,636,637,695,
                                              696,748,749,750,811,812,813,843,
                                              844,845,846,847,848,854,876,916,
                                              917,918,919,920,927,928,929,930,
                                              931,932,939,940,945,946,947,948,
                                              949,950,951,965,966,978,979,
                                              1004,1005,1012,1013,1014,1015,
                                              1016,1017,1018,1019,1020,1021,
                                              1022,1023],
                                             [],
                                             [0,1,2,3,4,5,6,7,8,9,10,11,18,27,
                                              28,29,30,31,32,33,34,35,43,44,
                                              45,46,47,48,49,50,51,52,53,54,
                                              55,56,57,58,59,60,61,62,63,64,
                                              65,66,67,68,74,75,76,77,78,79,
                                              80,81,82,83,88,89,90,91,95,96,
                                              97,98,99,100,101,102,115,116,
                                              117,118,119,120,121,122,123,124,
                                              125,126,127,128,129,130,131,132,
                                              133,134,135,136,137,138,139,140,
                                              141,144,145,146,147,148,149,157,
                                              158,159,160,161,162,163,164,165,
                                              166,167,168,169,170,171,172,173,
                                              174,175,176,177,183,184,185,186,
                                              187,188,189,190,191,192,193,194,
                                              195,198,199,200,201,202,203,204,
                                              205,206,207,208,209,210,211,212,
                                              213,214,215,216,217,238,239,240,
                                              241,254,255,256,257,258,259,260,
                                              261,262,263,264,269,270,271,272,
                                              273,274,275,276,277,278,279,280,
                                              281,282,283,284,285,286,287,292,
                                              293,294,295,296,297,298,299,300,
                                              301,302,303,304,305,306,307,308,
                                              316,317,318,319,320,324,325,326,
                                              327,328,329,330,331,332,345,346,
                                              356,357,358,359,360,361,368,369,
                                              370,371,372,373,374,375,376,377,
                                              378,386,387,388,389,390,391,392,
                                              393,394,397,398,399,400,401,402,
                                              403,404,405,412,413,414,415,416,
                                              417,418,419,420,421,422,423,424,
                                              425,426,427,428,429,430,431,432,
                                              433,434,435,436,437,438,439,440,
                                              441,442,443,444,445,454,455,456,
                                              457,458,459,460,461,462,463,464,
                                              465,466,467,468,469,470,471,472,
                                              473,474,475,476,477,478,479,480,
                                              481,482,483,484,485,486,487,488,
                                              489,490,491,492,493,494,495,496,
                                              497,498,499,500,501,502,512,513,
                                              514,515,516,525,526,527,528,529,
                                              530,531,532,533,534,535,536,537,
                                              538,539,540,541,542,543,544,545,
                                              546,547,548,549,550,551,552,553,
                                              554,555,556,557,558,559,560,561,
                                              562,563,564,565,566,567,568,569,
                                              570,571,572,573,574,575,576,577,
                                              578,579,580,581,582,583,584,585,
                                              586,587,588,589,590,591,592,593,
                                              594,595,596,597,598,599,600,601,
                                              602,605,606,607,608,609,610,611,
                                              612,616,617,618,619,620,621,622,
                                              623,624,625,626,627,628,629,630,
                                              631,638,639,640,641,642,643,644,
                                              645,646,647,648,649,650,651,652,
                                              653,654,655,656,657,658,659,660,
                                              661,662,663,664,665,666,667,668,
                                              669,670,671,672,673,674,675,676,
                                              677,678,679,680,681,682,683,684,
                                              685,686,687,688,689,690,691,692,
                                              693,694,697,698,699,700,701,702,
                                              703,704,705,706,707,708,709,710,
                                              711,712,713,714,715,716,717,718,
                                              719,720,721,722,723,724,725,726,
                                              727,728,729,730,731,732,733,734,
                                              735,736,737,738,739,740,741,742,
                                              743,744,745,746,747,751,752,753,
                                              754,755,756,757,758,759,760,761,
                                              762,763,764,765,766,767,768,769,
                                              770,771,772,773,774,775,776,777,
                                              778,779,780,781,782,783,784,785,
                                              786,787,788,789,790,791,792,793,
                                              794,795,796,797,798,799,800,801,
                                              802,803,804,805,806,807,808,809,
                                              810,814,815,816,817,818,819,820,
                                              821,822,823,824,825,826,827,828,
                                              829,830,831,832,833,834,835,836,
                                              837,838,839,840,841,842,849,850,
                                              851,852,853,855,856,857,858,859,
                                              860,861,862,863,864,865,866,867,
                                              868,869,870,871,872,873,874,875,
                                              877,878,879,880,881,882,883,884,
                                              885,886,887,888,889,890,891,892,
                                              893,894,895,896,897,898,899,900,
                                              901,902,903,904,905,906,907,908,
                                              909,910,911,912,913,914,915,921,
                                              922,923,924,925,926,933,934,935,
                                              936,937,938,941,942,943,944,952,
                                              953,954,955,956,957,958,959,960,
                                              961,962,963,964,967,968,969,970,
                                              971,972,973,974,975,976,977,980,
                                              981,982,983,984,985,986,987,988,
                                              989,990,991,992,993,994,995,996,
                                              997,998,999,1000,1001,1002,1003,
                                              1006,1007,1008,1009,1010,1011]])
[ns_server:info,2012-08-30T13:56:20.867,ns_1@10.3.121.16:ns_port_memcached:ns_port_server:log:169]memcached<0.674.0>: Thu Aug 30 20:56:20.666298 3: TAP (Consumer) eq_tapq:anon_56 - Failed to reset a vbucket 18. Force disconnect
memcached<0.674.0>: Thu Aug 30 20:56:20.666374 3: TAP (Consumer) eq_tapq:anon_56 - disconnected

Link to diags of all nodes https://s3.amazonaws.com/packages.couchbase/diag-logs/large_cluster_2_0/11ndoes-1645-reb-failed-close-confirmation-error-enotconn-20120830.tgz

Link to atop file of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/2.0.0/atop-11ndoes-1645-reb-failed-close-confirmation-error-enotconn-20120830.tgz


 Comments   
Comment by Karan Kumar (Inactive) [ 30/Aug/12 ]
Retrying rebalance results in the same error.
Comment by Aleksey Kondratenko [ 05/Oct/12 ]
Something really weird happens:

We have replica building downstream (in this case on node .16) closing connection:

[error_logger:error,2012-08-30T13:56:20.656,ns_1@10.3.121.23:error_logger:ale_error_logger_handler:log_msg:76]** Generic server <15202.13236.35> terminating
** Last message in was {tcp,#Port<15202.1659527>,
                            <<128,68,0,0,8,0,0,0,0,0,0,12,0,0,0,2,0,0,0,0,0,0,
                              0,0,0,4,0,1,255,0,0,0,0,0,0,0,128,68,0,0,8,0,0,
                              0,0,0,0,12,0,0,0,4,0,0,0,0,0,0,0,0,0,4,0,1,255,
                              0,0,0,0,0,0,2,128,68,0,0,8,0,0,18,0,0,0,12,0,0,
                              0,6,0,0,0,0,0,0,0,0,0,4,0,1,255,0,0,0,0,0,0,1,
                              128,68,0,0,8,0,0,24,0,0,0,12,0,0,0,8,0,0,0,0,0,
                              0,0,0,0,4,0,1,255,0,0,0,0,0,0,1,128,68,0,0,8,0,
                              0,25,0,0,0,12,0,0,0,10,0,0,0,0,0,0,0,0,0,4,0,1,
                              255,0,0,0,0,0,0,1,128,68,0,0,8,0,0,26,0,0,0,12,
                              0,0,0,12,0,0,0,0,0,0,0,0,0,4,0,1,255,0,0,0,0,0,
                              0,1,128,68,0,0,8,0,0,31,0,0,0,12,0,0,0,14,0,0,0,
                              0,0,0,0,0,0,4,0,1,255,0,0,0,0,0,0,1,128,68,0,0,
                              8,0,0,32,0,0,0,12,0,0,0,16,0,0,0,0,0,0,0,0,0,4,
                              0,1,255,0,0,0,0,0,0,1,128,68,0,0,8,0,0,47,0,0,0,
                              12,0,0,0,18,0,0,0,0,0,0,0,0,0,4,0,1,255,0,0,0,0,
                              0,0,1,128,68,0,0,8,0,0,48,0,0,0,12,0,0,0,20,0,0,
                              0,0,0,0,0,0,0,4,0,1,255,0,0,0,0,0,0,1>>}
** When Server state == {state,#Port<15202.1659527>,#Port<15202.1659523>,
                               #Port<15202.1659528>,#Port<15202.1659524>,
                               <15202.13244.35>,<<>>,<<>>,
                               {set,0,16,16,8,80,48,
                                    {[],[],[],[],[],[],[],[],[],[],[],[],[],
                                     [],[],[]},
                                    {{[],[],[],[],[],[],[],[],[],[],[],[],[],
                                      [],[],[]}}},
                               -1,false,false,0,
                               {1346,360180,624142},
                               not_started,undefined,
                               <<"replication_building_30_'ns_1@10.3.121.16'">>,
                               <15202.13236.35>,
                               {had_backfill,undefined,undefined,[]}}
** Reason for termination ==
** {{badmatch,{error,closed}},
    [{ebucketmigrator_srv,process_upstream,2},
     {ebucketmigrator_srv,process_data,3},
     {ebucketmigrator_srv,process_data,4},
     {ebucketmigrator_srv,handle_info,2},
     {gen_server,handle_msg,5},
     {proc_lib,init_p_do_apply,3}]}


Here are messages from this process. You can see that for some reason we see no open checkpoint on source node (current master) _and_ most weirdly we see backfill for some totally unrelated vbuckets. That may be reason why destination closes socket.

[ns_server:info,2012-08-30T13:56:20.574,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:init:470]Setting {"10.3.121.16",11209} vbucket 30 to state replica
[views:debug,2012-08-30T13:56:20.598,ns_1@10.3.121.16:mc_couch_events:capi_set_view_manager:handle_mc_couch_event:452]Got set_vbucket event for saslbucket/27. Updated state: replica
[views:debug,2012-08-30T13:56:20.598,ns_1@10.3.121.16:mc_couch_events:capi_set_view_manager:handle_mc_couch_event:452]Got set_vbucket event for saslbucket/27. Updated state: replica
[rebalance:info,2012-08-30T13:56:20.614,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:init:485]Some vbuckets were not yet ready to replicate from:
[30]

[ns_server:debug,2012-08-30T13:56:20.614,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:kill_tapname:932]killing tap named: replication_building_30_'ns_1@10.3.121.16'
[rebalance:info,2012-08-30T13:56:20.620,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:init:522]Starting tap stream:
[{vbuckets,[]},
 {checkpoints,[]},
 {name,<<"replication_building_30_'ns_1@10.3.121.16'">>},
 {takeover,false}]
{{"10.3.121.25",11209},
 {"10.3.121.16",11209},
 [{vbuckets,[30]},
  {takeover,false},
  {suffix,"building_30_'ns_1@10.3.121.16'"},
  {username,"saslbucket"},
  {password,"password"}]}
[rebalance:debug,2012-08-30T13:56:20.623,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:init:555]upstream_sender pid: <0.13244.35>
[rebalance:info,2012-08-30T13:56:20.665,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:process_upstream:880]Initial stream for vbucket 18
[rebalance:info,2012-08-30T13:56:20.666,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:process_upstream:880]Initial stream for vbucket 24
[rebalance:info,2012-08-30T13:56:20.666,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:process_upstream:880]Initial stream for vbucket 25
[rebalance:error,2012-08-30T13:56:20.666,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:confirm_sent_messages:674]Got error while trying to send close confirmation: {error,enotconn}
Comment by Aleksey Kondratenko [ 16/Oct/12 ]
Chiyoung, I think this probably happened because for some reason we actually passed empty vbucket filter to tap start command. If that implies that ep-engine is going to send us _all_ vbuckets rather than nothing, please assign to me so that we can fix on ns_server side.
Comment by Chiyoung Seo [ 16/Oct/12 ]
Yes, if the vbucket filter is empty, it means all the vbuckets. As you probably know, this behavior exists from 1.6.x.

I'd like to change this definition in ep-engine, but it might require some changes in the TAP client applications that our customers developed for their own internal purpose.

If it doesn't require much work in the ns-server, please go ahead and fix it in the ns-server side. Otherwise, please reassign it to me and I will then work with Steve's team to see what would be the impact on our existing customers.
Comment by Aleksey Kondratenko [ 17/Oct/12 ]
Liang will take a look at why we thought vbucket 30 wasn't ready to replicate from master. I'll create separate bug for handling empty ready vbuckets set in ebucketmigrator
Comment by Liang Guo (Inactive) [ 18/Oct/12 ]
The first incident of "Some vbuckets were not ready to replication from [30]" on subsequent {error, enotconn} happened at time 2012-08-30T13:47:33.576 on node 16.

[rebalance:info,2012-08-30T13:47:33.576,ns_1@10.3.121.16:<0.8085.35>:ebucketmigrator_srv:init:485]Some vbuckets were not yet ready to replicate from:
[30]

On source node 25, the connection for replication_building_30_'ns_1@10.3.121.16' was closed by force, porbably due to the error condition on node 16.

memcached<0.654.0>: Thu Aug 30 20:47:33.579731 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - VBucket 84 not found for TAP cursor. Skip it...
memcached<0.654.0>: Thu Aug 30 20:47:33.579736 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - VBucket 85 not found for TAP cursor. Skip it...
memcached<0.654.0>: Thu Aug 30 20:47:33.579740 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - VBucket 86 not found for TAP cursor. Skip it...
memcached<0.654.0>: Thu Aug 30 20:47:33.579745 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - VBucket 87 not found for TAP cursor. Skip it...

[ns_server:warn,2012-08-30T13:47:33.678,ns_1@10.3.121.25:ns_port_memcached:ns_port_server:log:174]Dropped 2440 log lines from memcached
[rebalance:info,2012-08-30T13:47:33.733,ns_1@10.3.121.25:<0.28940.34>:ebucketmigrator_srv:do_confirm_sent_messages:650]Got close ack!

[rebalance:info,2012-08-30T13:47:33.857,ns_1@10.3.121.25:<0.28930.34>:ebucketmigrator_srv:do_confirm_sent_messages:650]Got close ack!

[ns_server:info,2012-08-30T13:47:33.903,ns_1@10.3.121.25:ns_port_memcached:ns_port_server:log:169]memcached<0.654.0>: Thu Aug 30 20:47:33.703424 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Connection is close by force.

Interesting thing was that there was "no_db_file" exception exit with couch_set_view_group right before this incident on source node 25. vbucket 30 was part of add_partitions[].

[error_logger:error,2012-08-30T13:47:33.518,ns_1@10.3.121.25:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: couch_set_view_group:init/1
    pid: <0.846.0>
    registered_name: []
    exception exit: {{{badmatch,{not_found,no_db_file}},
                      [{couch_db_set,'-handle_call/3-fun-1-',3},
                       {lists,foldl,3},
                       {couch_db_set,handle_call,3},
                       {gen_server,handle_msg,5},
                       {proc_lib,init_p_do_apply,3}]},
                     {gen_server,call,
                         [<0.6008.0>,
                          {add_partitions,
                              [30,31,32,52,53,54,55,56,57,58,59,74,75,76,77,
                               200,201,202,203,204,211,294,295,296,297,303,
                               304,305,306,307,308,358,359,360,368,369,370,
                               371,372,373,374,375,376,377,378,393,394,455,
                               456,469,470,471,472,473,474,477,478,479,480,
                               492,493,494,570,600,601,602,608,609,610,611,
                               612,616,617,618,619,620,621,622,627,628,629,
                               630,631,638,639,640,643,666,667,668,669,670,
                               671,680,681,682,683,684,685,686,687,688,689,
                               690,700,701,702,703,723,724,725,726,727,729,
                               730,731,732,733,734,735,736,737,738,739,740,
                               741,742,743,744,745,762,763,764,765,766,767,
                               768,769,797,798,799,800,801,839,840,841,859,
                               860,861,871,872,873,874,899,900,901,902,903,
                               921,936,937,938,955,980,981,982,983,1007,12,47,
                               48,49,50,51,88,157,158,159,160,301,302,356,357,
                               454]},
                          infinity]}}
      in function gen_server:terminate/6
    ancestors: [<0.845.0>]
    messages: [{'EXIT',<0.6008.0>,
                          {{badmatch,{not_found,no_db_file}},
                           [{couch_db_set,'-handle_call/3-fun-1-',3},
                            {lists,foldl,3},
                            {couch_db_set,handle_call,3},
                            {gen_server,handle_msg,5},
                            {proc_lib,init_p_do_apply,3}]}}]
    links: [<0.223.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 10946
    stack_size: 24
    reductions: 50046050
  neighbours:

[ns_server:error,2012-08-30T13:47:33.525,ns_1@10.3.121.25:'capi_set_view_manager-saslbucket':capi_set_view_manager:do_apply_vbucket_states:133]Failed to apply index states for the following ddocs:
[{<<"_design/d11">>,
  {'EXIT',
      {{{{badmatch,{not_found,no_db_file}},
         [{couch_db_set,'-handle_call/3-fun-1-',3},
          {lists,foldl,3},
          {couch_db_set,handle_call,3},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]},
        {gen_server,call,
            [<0.6008.0>,
             {add_partitions,
                 [30,31,32,52,53,54,55,56,57,58,59,74,75,76,77,200,201,202,
                  203,204,211,294,295,296,297,303,304,305,306,307,308,358,359,
                  360,368,369,370,371,372,373,374,375,376,377,378,393,394,455,
                  456,469,470,471,472,473,474,477,478,479,480,492,493,494,570,
                  600,601,602,608,609,610,611,612,616,617,618,619,620,621,622,
                  627,628,629,630,631,638,639,640,643,666,667,668,669,670,671,
                  680,681,682,683,684,685,686,687,688,689,690,700,701,702,703,
                  723,724,725,726,727,729,730,731,732,733,734,735,736,737,738,
                  739,740,741,742,743,744,745,762,763,764,765,766,767,768,769,
                  797,798,799,800,801,839,840,841,859,860,861,871,872,873,874,
                  899,900,901,902,903,921,936,937,938,955,980,981,982,983,
                  1007,12,47,48,49,50,51,88,157,158,159,160,301,302,356,357,
                  454]},
             infinity]}},
       {gen_server,call,
           [<0.846.0>,
            {set_state,
                [30,31,32,47,48,49,50,51,52,53,54,55,56,57,58,59,74,75,76,77,
                 157,158,159,160,200,201,202,203,204,211,294,295,296,297,301,
                 302,303,304,305,306,307,308,356,357,358,359,360,368,369,370,
                 371,372,373,374,375,376,377,378,393,394,454,455,456,469,470,
                 471,472,473,474,477,478,479,480,492,493,494,570,600,601,602,
                 608,609,610,611,612,616,617,618,619,620,621,622,627,628,629,
                 630,631,638,639,640,643,666,667,668,669,670,671,680,681,682,
                 683,684,685,686,687,688,689,690,700,701,702,703,723,724,725,
                 726,727,729,730,731,732,733,734,735,736,737,738,739,740,741,
                 742,743,744,745,762,763,764,765,766,767,768,769,797,798,799,
                 800,801,839,840,841,859,860,861,871,872,873,874,899,900,901,
                 902,903,921,936,937,938,955,980,981,982,983,1007],
                "\fX",
                [0,1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,23,
                 24,25,26,27,28,29,33,34,35,36,37,38,39,40,41,42,43,44,45,46,
                 60,61,62,63,64,65,66,67,68,69,70,71,72,73,78,79,80,81,82,83,
                 84,85,86,87,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,
                 104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,
                 119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,
                 134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,
                 149,150,151,152,153,154,155,156,161,162,163,164,165,166,167,
                 168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,
                 183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,
                 198,199,205,206,207,208,209,210,212,213,214,215,216,217,218,
                 219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,
                 234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,
                 249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,
                 264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,
                 279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,
                 298,299,300,309,310,311,312,313,314,315,316,317,318,319,320,
                 321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,
                 336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,
                 351,352,353,354,355,361,362,363,364,365,366,367,379,380,381,
                 382,383,384,385,386,387,388,389,390,391,392,395,396,397,398,
                 399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,
                 414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,
                 429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,
                 444,445,446,447,448,449,450,451,452,453,457,458,459,460,461,
                 462,463,464,465,466,467,468,475,476,481,482,483,484,485,486,
                 487,488,489,490,491,495,496,497,498,499,500,501,502,503,504,
                 505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,
                 520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,
                 535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,
                 550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,
                 565,566,567,568,569,571,572,573,574,575,576,577,578,579,580,
                 581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,
                 596,597,598,599,603,604,605,606,607,613,614,615,623,624,625,
                 626,632,633,634,635,636,637,641,642,644,645,646,647,648,649,
                 650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,
                 665,672,673,674,675,676,677,678,679,691,692,693,694,695,696,
                 697,698,699,704,705,706,707,708,709,710,711,712,713,714,715,
                 716,717,718,719,720,721,722,728,746,747,748,749,750,751,752,
                 753,754,755,756,757,758,759,760,761,770,771,772,773,774,775,
                 776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,
                 791,792,793,794,795,796,802,803,804,805,806,807,808,809,810,
                 811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,
                 826,827,828,829,830,831,832,833,834,835,836,837,838,842,843,
                 844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,
                 862,863,864,865,866,867,868,869,870,875,876,877,878,879,880,
                 881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,
                 896,897,898,904,905,906,907,908,909,910,911,912,913,914,915,
                 916,917,918,919,920,922,923,924,925,926,927,928,929,930,931,
                 932,933,934,935,939,940,941,942,943,944,945,946,947,948,949,
                 950,951,952,953,954,956,957,958,959,960,961,962,963,964,965,
                 966,967,968,969,970,971,972,973,974,975,976,977,978,979,984,
                 985,986,987,988,989,990,991,992,993,994,995,996,997,998,999,
                 1000,1001,1002,1003,1004,1005,1006,1008,1009,1010,1011,1012,
                 1013,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023]},
            infinity]}}}}]
Comment by Liang Guo (Inactive) [ 18/Oct/12 ]
Search for vb_30 backward in node 25 log, I saw that vb_30 has "persisted_checkpoint_id" is <<"0">>. But, this seems too far back in time.

[ns_server:info,2012-08-29T19:21:51.992,ns_1@10.3.121.25:<0.31883.2>:diag_handler:log_all_tap_and_checkpoint_stats:128]checkpoint:saslbucket:
[{<<"vb_1007:persisted_checkpoint_id">>,<<"99">>},
 {<<"vb_1007:checkpoint_extension">>,<<"false">>},
 {<<"vb_1007:num_items_for_persistence">>,<<"0">>},
 {<<"vb_1007:num_checkpoints">>,<<"1">>},
 {<<"vb_1007:num_open_checkpoint_items">>,<<"0">>},
 {<<"vb_1007:num_checkpoint_items">>,<<"1">>},
 {<<"vb_1007:num_tap_cursors">>,<<"0">>},
 {<<"vb_1007:last_closed_checkpoint_id">>,<<"99">>},
 {<<"vb_1007:open_checkpoint_id">>,<<"100">>},
 {<<"vb_1007:state">>,<<"active">>},
 ....

{<<"vb_30:persisted_checkpoint_id">>,<<"0">>},
 {<<"vb_30:checkpoint_extension">>,<<"false">>},
 {<<"vb_30:num_items_for_persistence">>,<<"0">>},
 {<<"vb_30:num_checkpoints">>,<<"1">>},
 {<<"vb_30:num_open_checkpoint_items">>,<<"0">>},
 {<<"vb_30:num_checkpoint_items">>,<<"1">>},
 {<<"vb_30:num_tap_cursors">>,<<"0">>},
 {<<"vb_30:last_closed_checkpoint_id">>,<<"0">>},
 {<<"vb_30:open_checkpoint_id">>,<<"0">>},
 {<<"vb_30:state">>,<<"active">>},
 
  
 
Comment by Chiyoung Seo [ 18/Oct/12 ]
Liang,

>On source node 25, the connection for replication_building_30_'ns_1@10.3.121.16' was closed by force, porbably due to the error condition on node 16.

>memcached<0.654.0>: Thu Aug 30 20:47:33.579731 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - VBucket 84 not found for TAP cursor. Skip it...
>memcached<0.654.0>: Thu Aug 30 20:47:33.579736 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - VBucket 85 not found for TAP cursor. Skip it...
>memcached<0.654.0>: Thu Aug 30 20:47:33.579740 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - VBucket 86 not found for TAP cursor. Skip it...
>memcached<0.654.0>: Thu Aug 30 20:47:33.579745 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - VBucket 87 not found for TAP cursor. Skip it...


This was caused by the empty vbucket filter for the TAP producer "replication_building_30_'ns_1@10.3.121.16'". If the ns-server passed the empty vbucket filter to a given TAP producer, the TAP producer assumes all vbuckets and iterates them one by one.

The expected vbucket filter for the above producer should have vbucket 30 only.

Comment by Chiyoung Seo [ 18/Oct/12 ]
{<<"vb_30:persisted_checkpoint_id">>,<<"0">>},
 {<<"vb_30:checkpoint_extension">>,<<"false">>},
 {<<"vb_30:num_items_for_persistence">>,<<"0">>},
 {<<"vb_30:num_checkpoints">>,<<"1">>},
 {<<"vb_30:num_open_checkpoint_items">>,<<"0">>},
 {<<"vb_30:num_checkpoint_items">>,<<"1">>},
 {<<"vb_30:num_tap_cursors">>,<<"0">>},
 {<<"vb_30:last_closed_checkpoint_id">>,<<"0">>},
 {<<"vb_30:open_checkpoint_id">>,<<"0">>},
 {<<"vb_30:state">>,<<"active">>},


These checkpoint stats are interesting. We don't expect the active vbucket to have 0 as the open checkpoint id. We reset the open checkpoint id to 0 for a replica vbucket right before it receives backfill items from the producer. After backfill is completed, the replica vbucket receives the positive open checkpoint id from the producer. Then, we change the vbucket state from replica to active iff this is for vbucket takeover.
Comment by Liang Guo (Inactive) [ 18/Oct/12 ]
actually, node_16 and node_22 hit the "empty vbucket filter" problem and {error,enotconn} even earlier. The first incident is right after config change in vbucket map w.r.t vbucket 30.

[rebalance:info,2012-08-29T19:21:45.693,ns_1@10.3.121.16:<0.3763.3>:ebucketmigrator_srv:init:485]Some vbuckets were not yet ready to replicate from:
[30]

Some sequence of what-is-happening in time order in node-25 log,

1, When replica chain was {node_13,node_16, node_22}

[ns_server:debug,2012-08-29T19:11:10.271,ns_1@10.3.121.25:ns_config_log:ns_config_log:log_common:111]config change:
buckets ->
[{configs,[{"default",
...........
                              {30,[],
                               ['ns_1@10.3.121.13','ns_1@10.3.121.16',
                                'ns_1@10.3.121.22']},
                              {31,[],

The diag had vb_30:persisted_checkpoint_id = <<"0">>

[ns_server:info,2012-08-29T19:12:00.001,ns_1@10.3.121.25:<0.25668.2>:diag_handler:log_all_tap_and_checkpoint_stats:128]checkpoint:saslbucket:
[{<<"vb_1007:persisted_checkpoint_id">>,<<"99">>},
 {<<"vb_1007:eq_tapq:replication_ns_1@10.3.121.13:cursor_checkpoint_id">>,
  <<"100">>},
........
 {<<"vb_30:persisted_checkpoint_id">>,<<"0">>},
 {<<"vb_30:checkpoint_extension">>,<<"false">>},
 {<<"vb_30:num_items_for_persistence">>,<<"0">>},
 {<<"vb_30:num_checkpoints">>,<<"1">>},
 {<<"vb_30:num_open_checkpoint_items">>,<<"0">>},
 {<<"vb_30:num_checkpoint_items">>,<<"1">>},
 {<<"vb_30:num_tap_cursors">>,<<"0">>},
 {<<"vb_30:last_closed_checkpoint_id">>,<<"0">>},
 {<<"vb_30:open_checkpoint_id">>,<<"0">>},
 {<<"vb_30:state">>,<<"active">>},


Before the next config change w.r.t vbucket 30, I saw that replication_building_30 to node_13,node_16,and node_22 completed and closed.

2,, Now, some config change that has a new replica chain for vb_30 [node_25, node_16, node_22]. This is before node_22 received not-ready-to-replicate-from [30] error.

[ns_server:debug,2012-08-29T19:21:44.660,ns_1@10.3.121.25:ns_config_log:ns_config_log:log_common:111]config change:
buckets ->
[{configs,[{"default",
.............
                              {30,[],
                               ['ns_1@10.3.121.25','ns_1@10.3.121.16',
                                'ns_1@10.3.121.22']},

3, Node_16 (and node_22) both hit vbucket-not-ready [30] shortly after.

[rebalance:info,2012-08-29T19:21:45.693,ns_1@10.3.121.16:<0.3763.3>:ebucketmigrator_srv:init:485]Some vbuckets were not yet ready to replicate from:
[30]


4, At source node_25, due to receiving empty vbucket filter, it was trying to send all vbuckets ....

[ns_server:info,2012-08-29T19:21:45.923,ns_1@10.3.121.25:ns_port_memcached:ns_port_server:log:169]memcached<0.654.0>: Thu Aug 30 02:21:45.709565 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - VBucket 1 not found for TAP cursor. Skip it...
memcached<0.654.0>: Thu Aug 30 02:21:45.709583 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - VBucket 2 not found for TAP cursor. Skip it...
memcached<0.654.0>: Thu Aug 30 02:21:45.709590 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - VBucket 3 not found for TAP cursor. Skip it...

5, At destination node_16 and node_22, {error, enotconn} happened thereafter.

6, Interest log is that vb_30 has persisted_checkpoint_id = "99" as shown by diag

[ns_server:info,2012-08-29T19:21:51.992,ns_1@10.3.121.25:<0.31883.2>:diag_handler:log_all_tap_and_checkpoint_stats:128]checkpoint:saslbucket:
[{<<"vb_1007:persisted_checkpoint_id">>,<<"99">>},
 {<<"vb_1007:checkpoint_extension">>,<<"false">>},
 .........
 {<<"vb_30:persisted_checkpoint_id">>,<<"99">>},
 {<<"vb_30:eq_tapq:replication_ns_1@10.3.121.26:cursor_checkpoint_id">>,
  <<"100">>},
 {<<"vb_30:checkpoint_extension">>,<<"false">>},
 {<<"vb_30:num_items_for_persistence">>,<<"0">>},
 {<<"vb_30:num_checkpoints">>,<<"1">>},
 {<<"vb_30:num_open_checkpoint_items">>,<<"0">>},
 {<<"vb_30:num_checkpoint_items">>,<<"1">>},
 {<<"vb_30:num_tap_cursors">>,<<"1">>},
 {<<"vb_30:last_closed_checkpoint_id">>,<<"99">>},
 {<<"vb_30:open_checkpoint_id">>,<<"100">>},

7, Confusing thing is - the next diag shortly after that show "0" for vb_30 persisted_checkpoint_id.

[ns_server:info,2012-08-29T19:21:51.992,ns_1@10.3.121.25:<0.31883.2>:diag_handler:log_all_tap_and_checkpoint_stats:128]checkpoint:saslbucket:
[{<<"vb_1007:persisted_checkpoint_id">>,<<"99">>},
 {<<"vb_1007:checkpoint_extension">>,<<"false">>},
.............
{<<"vb_30:persisted_checkpoint_id">>,<<"0">>},
 {<<"vb_30:checkpoint_extension">>,<<"false">>},
 {<<"vb_30:num_items_for_persistence">>,<<"0">>},
 {<<"vb_30:num_checkpoints">>,<<"1">>},
 {<<"vb_30:num_open_checkpoint_items">>,<<"0">>},
 {<<"vb_30:num_checkpoint_items">>,<<"1">>},
 {<<"vb_30:num_tap_cursors">>,<<"0">>},
 {<<"vb_30:last_closed_checkpoint_id">>,<<"0">>},
 {<<"vb_30:open_checkpoint_id">>,<<"0">>},
 {<<"vb_30:state">>,<<"active">>},
Comment by Liang Guo (Inactive) [ 19/Oct/12 ]
Some additional findings of how vbucket 30 persisted_checkpoint_id became "0" on node_25.

1, At timestamp 08-29T19:11:59, vbucket 30 had persisted_checkpoint_id "98".

[ns_server:info,2012-08-29T19:11:59.841,ns_1@10.3.121.25:<0.25668.2>:diag_handler:log_all_tap_and_checkpoint_stats:128]checkpoint:default:
[{<<"vb_1007:persisted_checkpoint_id">>,<<"97">>},
 {<<"vb_1007:eq_tapq:replication_ns_1@10.3.121.13:cursor_checkpoint_id">>,
  <<"98">>},
..............
 {<<"vb_30:persisted_checkpoint_id">>,<<"98">>},
 {<<"vb_30:eq_tapq:replication_ns_1@10.3.121.13:cursor_checkpoint_id">>,
  <<"99">>},
 {<<"vb_30:checkpoint_extension">>,<<"false">>},
 {<<"vb_30:num_items_for_persistence">>,<<"0">>},
 {<<"vb_30:num_checkpoints">>,<<"1">>},
 {<<"vb_30:num_open_checkpoint_items">>,<<"63">>},
 {<<"vb_30:num_checkpoint_items">>,<<"64">>},
 {<<"vb_30:num_tap_cursors">>,<<"1">>},
 {<<"vb_30:last_closed_checkpoint_id">>,<<"98">>},
 {<<"vb_30:open_checkpoint_id">>,<<"99">>},
 {<<"vb_30:state">>,<<"active">>},

2, Some couchdb set view log shows vbucket 30 in cleanup partition, and then in pending active partition. Apparently, something happened here and memcached cleared checkpoint id for vbucket 30.

[couchdb:info,2012-08-29T19:11:59.918,ns_1@10.3.121.25:<0.846.0>:couch_log:info:39]Set view `saslbucket`, main group `_design/d11`, partition states updated
active partitions before: [31,32,52,53,54,55,56,57,58,59,74,75,76,77,200,201,202,203,204,211,294,295,296,297,303,304,305,306,307,308,358,359,360,368,369,370,371,372,373,374,375,376,377,378,393,394,469,470,471,472,473,474,492,493,494,570,600,601,602,608,609,610,611,612,616,617,618,619,620,621,622,627,628,629,630,631,638,639,640,666,667,668,669,670,671,683,684,685,686,687,688,689,690,700,701,702,703,723,724,725,726,727,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,762,763,764,765,766,767,768,769,839,840,841,859,860,861,871,872,873,874,936,937,938,955,980,981,982,983]
active partitions after: [31,32,52,53,54,55,56,57,58,59,74,75,76,77,200,201,202,203,204,211,294,295,296,297,303,304,305,306,307,308,358,359,360,368,369,370,371,372,373,374,375,376,377,378,393,394,469,470,471,472,473,474,492,493,494,570,600,601,602,608,609,610,611,612,616,617,618,619,620,621,622,627,628,629,630,631,638,639,640,666,667,668,669,670,671,683,684,685,686,687,688,689,690,700,701,702,703,723,724,725,726,727,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,762,763,764,765,766,767,768,769,839,840,841,859,860,861,871,872,873,874,936,937,938,955,980,981,982,983]
passive partitions before: [47,48,49,50,51,301,302,356,357,479,480,899,900,901,902,903,921,1007]
passive partitions after: [47,48,49,50,51,301,302,356,357,479,480,899,900,901,902,903,921,1007]
cleanup partitions before: [30,189]
cleanup partitions after: [30,189]
unindexable partitions: []
replica partitions before: [0,18,24,25,26,27,47,48,49,50,51,60,61,62,88,89,99,100,101,102,121,122,123,124,125,126,133,134,135,136,137,138,157,158,159,160,198,199,225,226,227,248,249,250,251,252,253,276,277,278,279,280,292,293,298,299,300,301,302,356,357,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,429,430,431,432,433,454,455,456,457,458,459,460,461,462,477,478,479,480,488,489,490,491,495,496,497,498,509,510,511,513,533,534,535,554,555,565,566,567,568,569,582,583,584,585,586,587,595,596,597,598,599,623,624,625,626,641,642,643,649,650,651,652,653,654,655,656,657,658,659,672,673,674,675,676,677,678,679,680,681,682,697,698,699,710,711,712,718,719,720,721,722,728,757,758,759,760,761,773,774,775,776,777,778,779,785,786,787,788,789,790,791,792,797,798,799,800,801,819,820,821,822,823,824,825,826,827,832,833,834,835,836,837,838,842,877,878,879,881,882,883,884,886,893,894,895,896,897,899,900,901,902,903,910,911,912,913,914,921,962,963,964,967,968,969,970,971,1001,1002,1003,1007]
replica partitions after: [0,18,24,25,26,27,47,48,49,50,51,60,61,62,88,89,99,100,101,102,121,122,123,124,125,126,133,134,135,136,137,138,157,158,159,160,198,199,225,226,227,248,249,250,251,252,253,276,277,278,279,280,292,293,298,299,300,301,302,356,357,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,429,430,431,432,433,454,455,456,457,458,459,460,461,462,477,478,479,480,488,489,490,491,495,496,497,498,509,510,511,513,533,534,535,554,555,565,566,567,568,569,582,583,584,585,586,587,595,596,597,598,599,623,624,625,626,641,642,643,649,650,651,652,653,654,655,656,657,658,659,672,673,674,675,676,677,678,679,680,681,682,697,698,699,710,711,712,718,719,720,721,722,728,757,758,759,760,761,773,774,775,776,777,778,779,785,786,787,788,789,790,791,792,797,798,799,800,801,819,820,821,822,823,824,825,826,827,832,833,834,835,836,837,838,842,877,878,879,881,882,883,884,886,893,894,895,896,897,899,900,901,902,903,910,911,912,913,914,921,962,963,964,967,968,969,970,971,1001,1002,1003,1007]
replicas on transfer before: [47,48,49,50,51,301,302,356,357,479,480,899,900,901,902,903,921,1007]
replicas on transfer after: [47,48,49,50,51,301,302,356,357,479,480,899,900,901,902,903,921,1007]
pending transition before:
  active: [30]
  passive: [189]
pending transition after:
  active: [30]
  passive: []

3, Right after that, replication_building_30_* tap queues were cleared, from node_25 to node_16, node_13, and node_22. Note that node_13 left cluster at timestamp 08-29T19:21:42.

memcached<0.654.0>: Thu Aug 30 02:11:59.798390 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Clear the tap queues by force
memcached<0.654.0>: Thu Aug 30 02:11:59.798626 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Clear the tap queues by force
memcached<0.654.0>: Thu Aug 30 02:11:59.798687 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Clear the tap queues by force

4, Now, diag showed that vbucket 30 persisted_checkpoint_id became "0".

[ns_server:info,2012-08-29T19:12:00.001,ns_1@10.3.121.25:<0.25668.2>:diag_handler:log_all_tap_and_checkpoint_stats:128]checkpoint:saslbucket:
[{<<"vb_1007:persisted_checkpoint_id">>,<<"99">>},
 {<<"vb_1007:eq_tapq:replication_ns_1@10.3.121.13:cursor_checkpoint_id">>,
  <<"100">>},
 ...............
{<<"vb_30:persisted_checkpoint_id">>,<<"0">>},
 {<<"vb_30:checkpoint_extension">>,<<"false">>},
 {<<"vb_30:num_items_for_persistence">>,<<"0">>},
 {<<"vb_30:num_checkpoints">>,<<"1">>},
 {<<"vb_30:num_open_checkpoint_items">>,<<"0">>},
 {<<"vb_30:num_checkpoint_items">>,<<"1">>},
 {<<"vb_30:num_tap_cursors">>,<<"0">>},
 {<<"vb_30:last_closed_checkpoint_id">>,<<"0">>},
 {<<"vb_30:open_checkpoint_id">>,<<"0">>},
 {<<"vb_30:state">>,<<"active">>},
 
Continue with the log messages I put in the previous comment. Node_13 left the cluster at timestamp 08-29T19:21:42, and then config change at 08-29T19:21:44. Hopefully, we can get some idea now.
Comment by Liang Guo (Inactive) [ 19/Oct/12 ]
Going a bit further back from last comment, more interesting findings -

1, All 3 replication_building_30_* tap connections were healthy at timestamp 08-29T19:11:41

[ns_server:info,2012-08-29T19:11:41.305,ns_1@10.3.121.25:<0.25209.2>:diag_handler:log_all_tap_and_checkpoint_stats:128]checkpoint:saslbucket:

{<<"vb_30:persisted_checkpoint_id">>,<<"98">>},
 {<<"vb_30:eq_tapq:replication_building_30_'ns_1@10.3.121.22':cursor_checkpoint_id">>,
  <<"99">>},
 {<<"vb_30:eq_tapq:replication_building_30_'ns_1@10.3.121.16':cursor_checkpoint_id">>,
  <<"99">>},
 {<<"vb_30:eq_tapq:replication_building_30_'ns_1@10.3.121.13':cursor_checkpoint_id">>,
  <<"99">>},
 {<<"vb_30:checkpoint_extension">>,<<"false">>},
 {<<"vb_30:num_items_for_persistence">>,<<"0">>},
 {<<"vb_30:num_checkpoints">>,<<"1">>},
 {<<"vb_30:num_open_checkpoint_items">>,<<"0">>},
 {<<"vb_30:num_checkpoint_items">>,<<"1">>},
 {<<"vb_30:num_tap_cursors">>,<<"3">>},
 {<<"vb_30:last_closed_checkpoint_id">>,<<"98">>},
 {<<"vb_30:open_checkpoint_id">>,<<"99">>},
 {<<"vb_30:state">>,<<"active">>},

2, Replication_building_30_'node_13' closed because backfill of bbucket 30 completed. But, I didn't see backfill complete for node_16 and node_22. Note, vbucket 30 showe dup in replication_ns_1@node_13 vbucket filter laster.

[ns_server:info,2012-08-29T19:11:53.989,ns_1@10.3.121.25:ns_port_memcached:ns_port_server:log:169]memcached<0.654.0>: Thu Aug 30 02:11:53.786917 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Backfill is completed with VBuckets 30,
memcached<0.654.0>: Thu Aug 30 02:11:53.787071 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Sending TAP_OPAQUE with command "close_backfill" and vbucket 30

3, For some reason, replication_building_30_ to node_22 and node_16 were closed by force, at time 08-29T19:21:59. It seems that backfill for vbucket 30 were not complete on both destination nodes.

[ns_server:info,2012-08-29T19:11:59.240,ns_1@10.3.121.25:ns_port_memcached:ns_port_server:log:169]memcached<0.654.0>: Thu Aug 30 02:11:59.038795 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - disconnected, keep alive for 300 seconds
memcached<0.654.0>: Thu Aug 30 02:11:59.043319 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - disconnected, keep alive for 300 seconds
memcached<0.654.0>: Thu Aug 30 02:11:59.060846 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Connection is closed by force.
memcached<0.654.0>: Thu Aug 30 02:11:59.061353 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Connection is closed by force.
memcached<0.654.0>: Thu Aug 30 02:11:59.061713 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Connection is closed by force.
memcached<0.654.0>: Thu Aug 30 02:11:59.084733 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - disconnected

4, Janitor_agent does "undoing temp vbucket state", and note that vbucket 30 was in active partition as shown below. But, it was moved to cleanup partition which caused persisted_checkpoint_id changes. See log msg in above comment.

[ns_server:info,2012-08-29T19:11:59.537,ns_1@10.3.121.25:'janitor_agent-saslbucket':janitor_agent:handle_info:640]Undoing temporary vbucket states caused by rebalance
[ns_server:info,2012-08-29T19:11:59.550,ns_1@10.3.121.25:<0.25668.2>:diag_handler:log_all_tap_and_checkpoint_stats:126]logging tap & checkpoint stats
[views:info,2012-08-29T19:11:59.545,ns_1@10.3.121.25:'capi_ddoc_replication_srv-saslbucket':capi_set_view_manager:apply_index_states:351]
Calling couch_set_view:set_partition_states([<<"saslbucket">>,
                                             <<"_design/d11">>,
                                             [30,31,32,47,48,49,50,51,52,53,
                                              54,55,56,57,58,59,74,75,76,77,

I think the dots are connected here ......
Comment by Liang Guo (Inactive) [ 19/Oct/12 ]
Sum it up -

1, There were replication_building for vbucket 30 from node_25 to node_13, node_16, and node_22;
2, Backfill vbucket 30 to node_13 completed;
3, Rebalance somehow was stopped before it even reached takeover phase;
4, At this point, node_25 diag had vbucket 30 persisted_checkpoint_id = "98" and open_checkpoint_id = "99";
5, Some couch_set_view to set partition state, with cleanup_partitions = {30};
6, Now, node_25 diag had vbucket 30 persisted_checkpoint_id became "0";
7, Node_13 left the cluster, and this resulted in some config change which cause replication building from node_25 to node_16 and node_22 again;
8, Vbucket 30 not ready on node_25 (="0") to replicate.

Talked to Chiyoung. For a vbucket checkpoint id to become 0, there must be 1) vbucket deletion or 2) initial vbucket stream for backfill. However, initial vbucket stream to node_25 does not make sense because it is the current master. Need to understand whether there was vbucket deletion by couch_set_view.

Comment by Liang Guo (Inactive) [ 19/Oct/12 ]
Hi Filipe,

I tried to understand what couch_set_view does for cleanup_partitions = {30} but the erlang code there totally puzzles me :). So, I want to ask you what exactly couch_set_view_group do with a vbucket (30) in cleanup_partitions[]. Any vbucket deletion? Please see my comments above. Feel free to assign it back to me.

Thanks,
Liang
Comment by Filipe Manana [ 20/Oct/12 ]
Hi Liang,

Nop, the indexes only read from databases. They never write to databases nor delete them.

Marking a vbucket (database) for cleanup in an index means to delete all the indexed data originated from that database from the index (doesn't touch database files, nor deletes them).
Comment by Steve Yen [ 23/Oct/12 ]
Hi Liang,
Any news/status on this one?
Thanks,
Steve
Comment by Liang Guo (Inactive) [ 23/Oct/12 ]
I'd narrowed down to where vbucket 30 open checkpoint id became 0 from 99 in a tiny window. There is no additional traces in ns_server log for me to dig. There is no memcached log or ns_server log showing any vbucket state change on node_25. Dunno what else I could do with this.
Comment by Farshid Ghods (Inactive) [ 26/Oct/12 ]
Aliaksey/Liang,

given that we did not see this rebalance failure in system testing do you think this was a one-time issue that is very rare to occur ?
Comment by Aleksey Kondratenko [ 26/Oct/12 ]
It is possible one of recent ep-engine fixes closed this. Hard to say. Still we'd like to understand. Rareness is IMHO not a good argument.
Comment by Aleksey Kondratenko [ 26/Oct/12 ]
I.e. there's possibility of massive data loss at stake here. Entire master vbucket was deleted somehow.
Comment by Aleksey Kondratenko [ 26/Oct/12 ]
I.e. there's possibility of massive data loss at stake here. Entire master vbucket was deleted somehow.
Comment by Aliaksey Artamonau [ 26/Oct/12 ]
Here're my observations. Node 23 is the master node of the cluster. At some
point during rebalance it decides to build replicas of vbucket 30 on nodes 13,
16 and 22. Current owner of this vbucket is node 25:

[rebalance:debug,2012-08-29T19:11:20.428,ns_1@10.3.121.23:<0.31805.466>:ns_single_vbucket_mover:mover_inner_old_style:190]child replicas builder for vbucket 30 is <0.31941.466>
[ns_server:debug,2012-08-29T19:11:20.436,ns_1@10.3.121.23:<0.31941.466>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 30 into 'ns_1@10.3.121.13' is <15197.23780.3>
[ns_server:debug,2012-08-29T19:11:20.447,ns_1@10.3.121.23:<0.31941.466>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 30 into 'ns_1@10.3.121.16' is <15202.29015.2>
[ns_server:debug,2012-08-29T19:11:20.453,ns_1@10.3.121.23:<0.31941.466>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 30 into 'ns_1@10.3.121.22' is <15205.9270.2>


Destination nodes start ebucketmigrators for replica building. It means that
at this particular moment of time node 25's vbucket 30 still has non-zero
checkpoint id:

[ns_server:debug,2012-08-29T19:11:20.487,ns_1@10.3.121.13:<0.23780.3>:ebucketmigrator_srv:kill_tapname:932]killing tap named: replication_building_30_'ns_1@10.3.121.13'
[rebalance:info,2012-08-29T19:11:20.507,ns_1@10.3.121.13:<0.23780.3>:ebucketmigrator_srv:init:522]Starting tap stream:
[{vbuckets,[30]},
 {checkpoints,[{30,0}]},
 {name,<<"replication_building_30_'ns_1@10.3.121.13'">>},
 {takeover,false}]
{{"10.3.121.25",11209},
 {"10.3.121.13",11209},
 [{vbuckets,[30]},
  {takeover,false},
  {suffix,"building_30_'ns_1@10.3.121.13'"},
  {username,"saslbucket"},
  {password,"password"}]}
[rebalance:debug,2012-08-29T19:11:20.528,ns_1@10.3.121.13:<0.23780.3>:ebucketmigrator_srv:init:555]upstream_sender pid: <0.23783.3>
[rebalance:info,2012-08-29T19:11:20.543,ns_1@10.3.121.13:<0.23780.3>:ebucketmigrator_srv:process_upstream:880]Initial stream for vbucket 30


On node 25 we see that memcached accordingly schedules backfills to 13, 16 and 22:

memcached<0.654.0>: Thu Aug 30 02:11:20.515187 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Schedule the backfill for vbucket 30
memcached<0.654.0>: Thu Aug 30 02:11:20.515253 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0
memcached<0.654.0>: Thu Aug 30 02:11:20.515279 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0
memcached<0.654.0>: Thu Aug 30 02:11:20.515289 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Sending TAP_OPAQUE with command "initial_vbucket_stream" and vbucket 30
memcached<0.654.0>: Thu Aug 30 02:11:20.545809 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Schedule the backfill for vbucket 30
memcached<0.654.0>: Thu Aug 30 02:11:20.545856 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0
memcached<0.654.0>: Thu Aug 30 02:11:20.545872 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0
memcached<0.654.0>: Thu Aug 30 02:11:20.545882 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Sending TAP_OPAQUE with command "initial_vbucket_stream" and vbucket 30
memcached<0.654.0>: Thu Aug 30 02:11:20.551288 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Schedule the backfill for vbucket 30
memcached<0.654.0>: Thu Aug 30 02:11:20.551456 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0
memcached<0.654.0>: Thu Aug 30 02:11:20.551477 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0
memcached<0.654.0>: Thu Aug 30 02:11:20.551487 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Sending TAP_OPAQUE with command "initial_vbucket_stream" and vbucket 30
memcached<0.654.0>: Thu Aug 30 02:11:20.559238 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Suspend for 5.00 secs


And sooner or later those backfills complete:

[ns_server:info,2012-08-29T19:11:23.286,ns_1@10.3.121.25:ns_port_memcached:ns_port_server:log:169]memcached<0.654.0>: Thu Aug 30 02:11:23.086272 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Backfill is completed with VBuckets 30,
memcached<0.654.0>: Thu Aug 30 02:11:23.086351 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Sending TAP_OPAQUE with command "close_backfill" and vbucket 30
[ns_server:info,2012-08-29T19:11:24.864,ns_1@10.3.121.25:ns_port_memcached:ns_port_server:log:169]memcached<0.654.0>: Thu Aug 30 02:11:24.663901 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Backfill is completed with VBuckets 30,
memcached<0.654.0>: Thu Aug 30 02:11:24.670059 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Sending TAP_OPAQUE with command "close_backfill" and vbucket 30
[ns_server:info,2012-08-29T19:11:25.887,ns_1@10.3.121.25:ns_port_memcached:ns_port_server:log:169]memcached<0.654.0>: Thu Aug 30 02:11:25.686315 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.13 - Suspend for 5.00 secs


[ns_server:info,2012-08-29T19:11:53.989,ns_1@10.3.121.25:ns_port_memcached:ns_port_server:log:169]memcached<0.654.0>: Thu Aug 30 02:11:53.786917 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Backfill is completed with VBuckets 30,
memcached<0.654.0>: Thu Aug 30 02:11:53.787071 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Sending TAP_OPAQUE with command "close_backfill" and vbucket 30


Then rebalance gets stopped by user:

[ns_server:info,2012-08-29T19:11:59.054,ns_1@10.3.121.23:<0.31941.466>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.3.121.25': [<<"replication_building_30_'ns_1@10.3.121.13'">>,
                                                       <<"replication_building_30_'ns_1@10.3.121.16'">>,
                                                       <<"replication_building_30_'ns_1@10.3.121.22'">>]

[user:info,2012-08-29T19:11:59.529,ns_1@10.3.121.23:<0.6719.464>:ns_orchestrator:handle_info:289]Rebalance stopped by user.


As part of the code that handles rebalance completion (for any reason) we
immediately log all tap and checkpoint stats on all the nodes. In these stats
on 25 we can see that vbucket 30 already has checkpoint id of zero:

[ns_server:info,2012-08-29T19:12:00.001,ns_1@10.3.121.25:<0.25668.2>:diag_handler:log_all_tap_and_checkpoint_stats:128]checkpoint:saslbucket:
[{<<"vb_1007:persisted_checkpoint_id">>,<<"99">>},
 {<<"vb_1007:eq_tapq:replication_ns_1@10.3.121.13:cursor_checkpoint_id">>,
                 .....
 {<<"vb_30:persisted_checkpoint_id">>,<<"0">>},
 {<<"vb_30:checkpoint_extension">>,<<"false">>},
 {<<"vb_30:num_items_for_persistence">>,<<"0">>},
 {<<"vb_30:num_checkpoints">>,<<"1">>},
 {<<"vb_30:num_open_checkpoint_items">>,<<"0">>},
 {<<"vb_30:num_checkpoint_items">>,<<"1">>},
 {<<"vb_30:num_tap_cursors">>,<<"0">>},
 {<<"vb_30:last_closed_checkpoint_id">>,<<"0">>},
 {<<"vb_30:open_checkpoint_id">>,<<"0">>},
 {<<"vb_30:state">>,<<"active">>},


I could not find anything abnormal from ns_server perspective during that
period of time. So probably it's something wrong in ep_engine. Apparently more
logs from ep_engine would help here. Particularly it would be good if vbucket
resets were logged.
Comment by Farshid Ghods (Inactive) [ 29/Oct/12 ]
Chiyoung,

do you require more information from QE to proceed ? in that case please assign te bug back to Tony.

also this issue is not reproduced easily in system testing so not sure if Tony can reproduce and collect the new logs so in that case would you be okay with deferring this bug for now ?
Comment by Chiyoung Seo [ 29/Oct/12 ]
This issue looks too strange to me. When the rebalance got stopped by the user, then the current master should still hold the active vbucket 30 with non-zero open checkpoint id. The open checkpoint id is reset to zero iff its vbucket get reset by receiving the backfill stream from the master or when it's deleted and recreated with non-active state.

I will add more logs to ep-engine, but don't see anything suspicious in ep-engine at this time. We will see if this issue happens again in 2.0 full regressions. I'm fine with moving this to post 2.0 given the fact that it's not easily reproducible.
Comment by Thuan Nguyen [ 31/Oct/12 ]
Integrated in github-ep-engine-2-0 #454 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/454/])
    MB-6494 Add more logs to vbucket reset and deletion functions (Revision 96d50b2038777e25f671475ef8fad76ba3a76683)

     Result = SUCCESS
Chiyoung Seo :
Files :
* src/ep_engine.cc
Comment by Chiyoung Seo [ 20/Nov/12 ]
We spent lots of time on debugging this issue, but not able to reproduce it. We added more logs to the ep-engine. Let's open the new bug when we hit the this issue again.
Generated at Sat Sep 20 08:47:07 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.