Details
Description
1) create 4 node cluster
2) load 10 million docs
3) create views and run queries
4) reboot 1 node, and wait for warmup
5) remove and rebalance out node
5) add back and rebalance in, reports fails with:
Rebalance exited with reason {{bad_return_value,
{badmatch,
{error,
{error,
{db_open_error,<<"default/128">>,
{not_found,no_db_file},
<<"Couldn't open database `default/128`, reason: {not_found,no_db_file}">>}}}}},
{gen_server,call,
[{'capi_set_view_manager-default',
'ns_1@10.1.2.223'},
{wait_until_added,128},
infinity]}}
2) load 10 million docs
3) create views and run queries
4) reboot 1 node, and wait for warmup
5) remove and rebalance out node
5) add back and rebalance in, reports fails with:
Rebalance exited with reason {{bad_return_value,
{badmatch,
{error,
{error,
{db_open_error,<<"default/128">>,
{not_found,no_db_file},
<<"Couldn't open database `default/128`, reason: {not_found,no_db_file}">>}}}}},
{gen_server,call,
[{'capi_set_view_manager-default',
'ns_1@10.1.2.223'},
{wait_until_added,128},
infinity]}}
-
- 10.1.2.222-8091-diag.txt.gz
- 02/May/12 5:47 PM
- 975 kB
- Tommie McAfee
-
- 10.1.2.223-8091-diag.txt.gz
- 02/May/12 5:47 PM
- 913 kB
- Tommie McAfee
-
- 10.1.3.181-8091-diag.txt.gz
- 02/May/12 5:47 PM
- 857 kB
- Tommie McAfee
-
- 10.1.3.182-8091-diag.txt.gz
- 02/May/12 5:47 PM
- 951 kB
- Tommie McAfee
Activity
- All
- Comments
- Work Log
- History
- Activity
- Gerrit Reviews
Hide
Tommie McAfee
added a comment -
next rebalance succeeded. I lost about 100k items after rebalance stopped during add back.
Show
Tommie McAfee
added a comment - next rebalance succeeded. I lost about 100k items after rebalance stopped during add back.
Hide
Aleksey Kondratenko
added a comment -
As far as data loss I don't yet know.
Aliaksey, please take a look at potential capi_set_view_manager or couch_set_views or integration issue if you have some cycles.
I'm still a bit behind in terms of understanding what those couch_set_view api calls actually do but maybe there's some evidence it's not doing what it's supposed to do (logs of .223 where main issue happens). Particularly _design/dev_ddoc1 seems to be thinking vbucket 128 is part of replica index, while we didn't seem to ask for that. And later that's causing bad problems when vbucket 128 is actually deleted.
[couchdb:info] [2012-05-02 15:25:21] [ns_1@10.1.2.223:<0.16313.1>:couch_log:info:39] Reading changes (since sequence 34865) from passive partition default/19 to update replica set view group `_design/dev_ddoc1` from set `default`
[ns_server:info] [2012-05-02 15:25:21] [ns_1@10.1.2.223:ns_config_rep:ns_config_rep:do_pull:257] Pulling config from: 'ns_1@10.1.3.181'
[couchdb:info] [2012-05-02 15:25:21] [ns_1@10.1.2.223:<0.30728.0>:couch_log:info:39] Set view `default`, main group `_design/dev_ddoc1`, partition states updated
active partitions before: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,117,118,119,120,121,122,123,124,125,126,127,234]
active partitions after: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,117,118,119,120,121,122,123,124,125,126,127,234]
passive partitions before: [128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
passive partitions after: [129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
cleanup partitions before: []
cleanup partitions after: [128]
replica partitions before: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replica partitions after: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replicas on transfer before: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
replicas on transfer after: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
[views:info] [2012-05-02 15:25:21] [ns_1@10.1.2.223:'capi_set_view_manager-default':capi_set_view_manager:apply_ddoc_map:419]
couch_set_view:set_partition_states([<<"default">>,<<"_design/dev_ddoc1">>,
[64,65,66,67,68,69,70,71,72,73,74,75,76,
77,78,79,80,81,82,83,84,85,87,88,89,90,
91,92,93,94,95,96,97,98,99,100,101,102,
103,104,105,106,107,117,118,119,120,121,
122,123,124,125,126,127,149,150,151,152,
153,154,155,156,157,158,159,160,161,162,
163,164,165,166,167,168,169,234],
[129,130,131,132,133,134,135,136,137,138,
139,140,141,142,143,144,145,146,147,148],
[128]]) returned ok in 4152ms
[views:info] [2012-05-02 15:25:21] [ns_1@10.1.2.223:'capi_set_view_manager-default':capi_set_view_manager:apply_ddoc_map:423]
Calling couch_set_view:add_replica_partitions([<<"default">>,
<<"_design/dev_ddoc1">>,
[0,1,2,3,4,5,6,7,8,9,10,11,12,
13,14,15,16,17,18,19,20,21,
213,214,215,216,217,218,219,
220,221,222,223,224,225,226,
227,228,229,230,231,232,233,
235,236,237,238,239,240,241,
242,243,244,245,246,247,248,
249,250,251,252,253,254,255]])
Aliaksey, please take a look at potential capi_set_view_manager or couch_set_views or integration issue if you have some cycles.
I'm still a bit behind in terms of understanding what those couch_set_view api calls actually do but maybe there's some evidence it's not doing what it's supposed to do (logs of .223 where main issue happens). Particularly _design/dev_ddoc1 seems to be thinking vbucket 128 is part of replica index, while we didn't seem to ask for that. And later that's causing bad problems when vbucket 128 is actually deleted.
[couchdb:info] [2012-05-02 15:25:21] [ns_1@10.1.2.223:<0.16313.1>:couch_log:info:39] Reading changes (since sequence 34865) from passive partition default/19 to update replica set view group `_design/dev_ddoc1` from set `default`
[ns_server:info] [2012-05-02 15:25:21] [ns_1@10.1.2.223:ns_config_rep:ns_config_rep:do_pull:257] Pulling config from: 'ns_1@10.1.3.181'
[couchdb:info] [2012-05-02 15:25:21] [ns_1@10.1.2.223:<0.30728.0>:couch_log:info:39] Set view `default`, main group `_design/dev_ddoc1`, partition states updated
active partitions before: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,117,118,119,120,121,122,123,124,125,126,127,234]
active partitions after: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,117,118,119,120,121,122,123,124,125,126,127,234]
passive partitions before: [128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
passive partitions after: [129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
cleanup partitions before: []
cleanup partitions after: [128]
replica partitions before: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replica partitions after: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replicas on transfer before: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
replicas on transfer after: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
[views:info] [2012-05-02 15:25:21] [ns_1@10.1.2.223:'capi_set_view_manager-default':capi_set_view_manager:apply_ddoc_map:419]
couch_set_view:set_partition_states([<<"default">>,<<"_design/dev_ddoc1">>,
[64,65,66,67,68,69,70,71,72,73,74,75,76,
77,78,79,80,81,82,83,84,85,87,88,89,90,
91,92,93,94,95,96,97,98,99,100,101,102,
103,104,105,106,107,117,118,119,120,121,
122,123,124,125,126,127,149,150,151,152,
153,154,155,156,157,158,159,160,161,162,
163,164,165,166,167,168,169,234],
[129,130,131,132,133,134,135,136,137,138,
139,140,141,142,143,144,145,146,147,148],
[128]]) returned ok in 4152ms
[views:info] [2012-05-02 15:25:21] [ns_1@10.1.2.223:'capi_set_view_manager-default':capi_set_view_manager:apply_ddoc_map:423]
Calling couch_set_view:add_replica_partitions([<<"default">>,
<<"_design/dev_ddoc1">>,
[0,1,2,3,4,5,6,7,8,9,10,11,12,
13,14,15,16,17,18,19,20,21,
213,214,215,216,217,218,219,
220,221,222,223,224,225,226,
227,228,229,230,231,232,233,
235,236,237,238,239,240,241,
242,243,244,245,246,247,248,
249,250,251,252,253,254,255]])
Show
Aleksey Kondratenko
added a comment - As far as data loss I don't yet know.
Aliaksey, please take a look at potential capi_set_view_manager or couch_set_views or integration issue if you have some cycles.
I'm still a bit behind in terms of understanding what those couch_set_view api calls actually do but maybe there's some evidence it's not doing what it's supposed to do (logs of .223 where main issue happens). Particularly _design/dev_ddoc1 seems to be thinking vbucket 128 is part of replica index, while we didn't seem to ask for that. And later that's causing bad problems when vbucket 128 is actually deleted.
[couchdb:info] [2012-05-02 15:25:21] [ ns_1@10.1.2.223 :<0.16313.1>:couch_log:info:39] Reading changes (since sequence 34865) from passive partition default/19 to update replica set view group `_design/dev_ddoc1` from set `default`
[ns_server:info] [2012-05-02 15:25:21] [ ns_1@10.1.2.223 :ns_config_rep:ns_config_rep:do_pull:257] Pulling config from: ' ns_1@10.1.3.181 '
[couchdb:info] [2012-05-02 15:25:21] [ ns_1@10.1.2.223 :<0.30728.0>:couch_log:info:39] Set view `default`, main group `_design/dev_ddoc1`, partition states updated
active partitions before: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,117,118,119,120,121,122,123,124,125,126,127,234]
active partitions after: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,117,118,119,120,121,122,123,124,125,126,127,234]
passive partitions before: [128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
passive partitions after: [129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
cleanup partitions before: []
cleanup partitions after: [128]
replica partitions before: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replica partitions after: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replicas on transfer before: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
replicas on transfer after: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
[views:info] [2012-05-02 15:25:21] [ ns_1@10.1.2.223 :'capi_set_view_manager-default':capi_set_view_manager:apply_ddoc_map:419]
couch_set_view:set_partition_states([<<"default">>,<<"_design/dev_ddoc1">>,
[64,65,66,67,68,69,70,71,72,73,74,75,76,
77,78,79,80,81,82,83,84,85,87,88,89,90,
91,92,93,94,95,96,97,98,99,100,101,102,
103,104,105,106,107,117,118,119,120,121,
122,123,124,125,126,127,149,150,151,152,
153,154,155,156,157,158,159,160,161,162,
163,164,165,166,167,168,169,234],
[129,130,131,132,133,134,135,136,137,138,
139,140,141,142,143,144,145,146,147,148],
[128]]) returned ok in 4152ms
[views:info] [2012-05-02 15:25:21] [ ns_1@10.1.2.223 :'capi_set_view_manager-default':capi_set_view_manager:apply_ddoc_map:423]
Calling couch_set_view:add_replica_partitions([<<"default">>,
<<"_design/dev_ddoc1">>,
[0,1,2,3,4,5,6,7,8,9,10,11,12,
13,14,15,16,17,18,19,20,21,
213,214,215,216,217,218,219,
220,221,222,223,224,225,226,
227,228,229,230,231,232,233,
235,236,237,238,239,240,241,
242,243,244,245,246,247,248,
249,250,251,252,253,254,255]])
Hide
Aleksey Kondratenko
added a comment -
In order to understand possible reason of data loss we need logs of second succeeded rebalance
Show
Aleksey Kondratenko
added a comment - In order to understand possible reason of data loss we need logs of second succeeded rebalance
Hide
Aliaksey Artamonau
added a comment -
There seem to be two issues here. The one in ns_server (which I don't completely understand yet) and in set view indexes. The latter one was triggered by the former.
Log file from .223 shows that before failing rebalance attempt vbucket 128 was in replica index on that node:
[couchdb:info] [2012-05-02 15:21:09] [ns_1@10.1.2.223:<0.30728.0>:couch_log:info:39] Set view `default`, main group `_design/dev_ddoc1`, defined new replica partitions: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
New full set of replica partitions is: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
Then during rebalance this vbucket was set to be passive in the index:
[views:info] [2012-05-02 15:22:41] [ns_1@10.1.2.223:'capi_set_view_manager-default':capi_set_view_manager:apply_ddoc_map:418]
Calling couch_set_view:set_partition_states([<<"default">>,
<<"_design/dev_ddoc1">>,
[64,65,66,67,68,69,70,71,72,73,
74,75,76,77,78,79,80,81,82,83,
84,85,87,88,89,90,91,92,93,94,
95,96,97,98,99,100,101,102,103,
104,105,106,107,108,109,110,111,
112,113,114,115,116,117,118,119,
120,121,122,123,124,125,126,127,
149,150,151,152,153,154,155,156,
157,158,159,160,161,162,163,164,
165,166,167,168,169,234],
[128,129,130,131,132,133,134,135,
136,137,138,139,140,141,142,143,
144,145,146,147,148],
[]])
[couchdb:info] [2012-05-02 15:22:48] [ns_1@10.1.2.223:<0.30728.0>:couch_log:info:39] Set view `default`, main group `_design/dev_ddoc1`, partition states updated
active partitions before: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,234]
active partitions after: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,234]
passive partitions before: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
passive partitions after: [128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
cleanup partitions before: []
cleanup partitions after: []
replica partitions before: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replica partitions after: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replicas on transfer before: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
replicas on transfer after: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
Here you can see that vbucket 128 is added to passive index but not removed from replica index. Later for some unclear reason this vbucket is deleted on the node:
[views:info] [2012-05-02 15:25:17] [ns_1@10.1.2.223:'capi_set_view_manager-default':capi_set_view_manager:apply_ddoc_map:418]
Calling couch_set_view:set_partition_states([<<"default">>,
<<"_design/dev_ddoc1">>,
[64,65,66,67,68,69,70,71,72,73,
74,75,76,77,78,79,80,81,82,83,
84,85,87,88,89,90,91,92,93,94,
95,96,97,98,99,100,101,102,103,
104,105,106,107,117,118,119,120,
121,122,123,124,125,126,127,149,
150,151,152,153,154,155,156,157,
158,159,160,161,162,163,164,165,
166,167,168,169,234],
[129,130,131,132,133,134,135,136,
137,138,139,140,141,142,143,144,
145,146,147,148],
[128]])
[couchdb:info] [2012-05-02 15:25:21] [ns_1@10.1.2.223:<0.30728.0>:couch_log:info:39] Set view `default`, main group `_design/dev_ddoc1`, partition states updated
active partitions before: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,117,118,119,120,121,122,123,124,125,126,127,234]
active partitions after: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,117,118,119,120,121,122,123,124,125,126,127,234]
passive partitions before: [128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
passive partitions after: [129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
cleanup partitions before: []
cleanup partitions after: [128]
replica partitions before: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replica partitions after: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replicas on transfer before: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
replicas on transfer after: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
So it's successfully removed from from passive index but again left in a replica index. This seems to cause this exact failure. But another challenge is to understand why the vbucket was deleted on new master in the first place.
Log file from .223 shows that before failing rebalance attempt vbucket 128 was in replica index on that node:
[couchdb:info] [2012-05-02 15:21:09] [ns_1@10.1.2.223:<0.30728.0>:couch_log:info:39] Set view `default`, main group `_design/dev_ddoc1`, defined new replica partitions: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
New full set of replica partitions is: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
Then during rebalance this vbucket was set to be passive in the index:
[views:info] [2012-05-02 15:22:41] [ns_1@10.1.2.223:'capi_set_view_manager-default':capi_set_view_manager:apply_ddoc_map:418]
Calling couch_set_view:set_partition_states([<<"default">>,
<<"_design/dev_ddoc1">>,
[64,65,66,67,68,69,70,71,72,73,
74,75,76,77,78,79,80,81,82,83,
84,85,87,88,89,90,91,92,93,94,
95,96,97,98,99,100,101,102,103,
104,105,106,107,108,109,110,111,
112,113,114,115,116,117,118,119,
120,121,122,123,124,125,126,127,
149,150,151,152,153,154,155,156,
157,158,159,160,161,162,163,164,
165,166,167,168,169,234],
[128,129,130,131,132,133,134,135,
136,137,138,139,140,141,142,143,
144,145,146,147,148],
[]])
[couchdb:info] [2012-05-02 15:22:48] [ns_1@10.1.2.223:<0.30728.0>:couch_log:info:39] Set view `default`, main group `_design/dev_ddoc1`, partition states updated
active partitions before: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,234]
active partitions after: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,234]
passive partitions before: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
passive partitions after: [128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
cleanup partitions before: []
cleanup partitions after: []
replica partitions before: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replica partitions after: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replicas on transfer before: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
replicas on transfer after: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
Here you can see that vbucket 128 is added to passive index but not removed from replica index. Later for some unclear reason this vbucket is deleted on the node:
[views:info] [2012-05-02 15:25:17] [ns_1@10.1.2.223:'capi_set_view_manager-default':capi_set_view_manager:apply_ddoc_map:418]
Calling couch_set_view:set_partition_states([<<"default">>,
<<"_design/dev_ddoc1">>,
[64,65,66,67,68,69,70,71,72,73,
74,75,76,77,78,79,80,81,82,83,
84,85,87,88,89,90,91,92,93,94,
95,96,97,98,99,100,101,102,103,
104,105,106,107,117,118,119,120,
121,122,123,124,125,126,127,149,
150,151,152,153,154,155,156,157,
158,159,160,161,162,163,164,165,
166,167,168,169,234],
[129,130,131,132,133,134,135,136,
137,138,139,140,141,142,143,144,
145,146,147,148],
[128]])
[couchdb:info] [2012-05-02 15:25:21] [ns_1@10.1.2.223:<0.30728.0>:couch_log:info:39] Set view `default`, main group `_design/dev_ddoc1`, partition states updated
active partitions before: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,117,118,119,120,121,122,123,124,125,126,127,234]
active partitions after: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,117,118,119,120,121,122,123,124,125,126,127,234]
passive partitions before: [128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
passive partitions after: [129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
cleanup partitions before: []
cleanup partitions after: [128]
replica partitions before: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replica partitions after: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replicas on transfer before: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
replicas on transfer after: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
So it's successfully removed from from passive index but again left in a replica index. This seems to cause this exact failure. But another challenge is to understand why the vbucket was deleted on new master in the first place.
Show
Aliaksey Artamonau
added a comment - There seem to be two issues here. The one in ns_server (which I don't completely understand yet) and in set view indexes. The latter one was triggered by the former.
Log file from .223 shows that before failing rebalance attempt vbucket 128 was in replica index on that node:
[couchdb:info] [2012-05-02 15:21:09] [ ns_1@10.1.2.223 :<0.30728.0>:couch_log:info:39] Set view `default`, main group `_design/dev_ddoc1`, defined new replica partitions: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
New full set of replica partitions is: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
Then during rebalance this vbucket was set to be passive in the index:
[views:info] [2012-05-02 15:22:41] [ ns_1@10.1.2.223 :'capi_set_view_manager-default':capi_set_view_manager:apply_ddoc_map:418]
Calling couch_set_view:set_partition_states([<<"default">>,
<<"_design/dev_ddoc1">>,
[64,65,66,67,68,69,70,71,72,73,
74,75,76,77,78,79,80,81,82,83,
84,85,87,88,89,90,91,92,93,94,
95,96,97,98,99,100,101,102,103,
104,105,106,107,108,109,110,111,
112,113,114,115,116,117,118,119,
120,121,122,123,124,125,126,127,
149,150,151,152,153,154,155,156,
157,158,159,160,161,162,163,164,
165,166,167,168,169,234],
[128,129,130,131,132,133,134,135,
136,137,138,139,140,141,142,143,
144,145,146,147,148],
[]])
[couchdb:info] [2012-05-02 15:22:48] [ ns_1@10.1.2.223 :<0.30728.0>:couch_log:info:39] Set view `default`, main group `_design/dev_ddoc1`, partition states updated
active partitions before: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,234]
active partitions after: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,234]
passive partitions before: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
passive partitions after: [128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
cleanup partitions before: []
cleanup partitions after: []
replica partitions before: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replica partitions after: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replicas on transfer before: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
replicas on transfer after: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
Here you can see that vbucket 128 is added to passive index but not removed from replica index. Later for some unclear reason this vbucket is deleted on the node:
[views:info] [2012-05-02 15:25:17] [ ns_1@10.1.2.223 :'capi_set_view_manager-default':capi_set_view_manager:apply_ddoc_map:418]
Calling couch_set_view:set_partition_states([<<"default">>,
<<"_design/dev_ddoc1">>,
[64,65,66,67,68,69,70,71,72,73,
74,75,76,77,78,79,80,81,82,83,
84,85,87,88,89,90,91,92,93,94,
95,96,97,98,99,100,101,102,103,
104,105,106,107,117,118,119,120,
121,122,123,124,125,126,127,149,
150,151,152,153,154,155,156,157,
158,159,160,161,162,163,164,165,
166,167,168,169,234],
[129,130,131,132,133,134,135,136,
137,138,139,140,141,142,143,144,
145,146,147,148],
[128]])
[couchdb:info] [2012-05-02 15:25:21] [ ns_1@10.1.2.223 :<0.30728.0>:couch_log:info:39] Set view `default`, main group `_design/dev_ddoc1`, partition states updated
active partitions before: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,117,118,119,120,121,122,123,124,125,126,127,234]
active partitions after: [64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,117,118,119,120,121,122,123,124,125,126,127,234]
passive partitions before: [128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
passive partitions after: [129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
cleanup partitions before: []
cleanup partitions after: [128]
replica partitions before: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replica partitions after: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
replicas on transfer before: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
replicas on transfer after: [149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169]
So it's successfully removed from from passive index but again left in a replica index. This seems to cause this exact failure. But another challenge is to understand why the vbucket was deleted on new master in the first place.
Hide
Aleksey Kondratenko
added a comment -
No we're doing another pass over diags and then we'll see
Show
Aleksey Kondratenko
added a comment - No we're doing another pass over diags and then we'll see
Hide
Aleksey Kondratenko
added a comment -
BTW, simply repeating all same tests with replica indexes _disabled_ may be helpful.
Show
Aleksey Kondratenko
added a comment - BTW, simply repeating all same tests with replica indexes _disabled_ may be helpful.
Hide
I figured out what's happened here.
a) we were incorrectly thinking you can transfer replica index to passive index. And did that. Which simply added passive index and kept vbucket in replica index
b) vbucket 128 was deleted (and than recreated) as part of incoming backfill into this vbucket. We normally handle (but note that stuff will have to be re-indexed because of that) it ok because when vbucket is deleted we remove it from index and when it's created back we add it back. But in this case because we were unaware that vbucket 128 is also in replica index we did not. And that crashed set index and us and rebalance.
Not using replica indexes seems to be right workaround. We'll work on that misunderstanding about set index api usage as part of DP5.
a) we were incorrectly thinking you can transfer replica index to passive index. And did that. Which simply added passive index and kept vbucket in replica index
b) vbucket 128 was deleted (and than recreated) as part of incoming backfill into this vbucket. We normally handle (but note that stuff will have to be re-indexed because of that) it ok because when vbucket is deleted we remove it from index and when it's created back we add it back. But in this case because we were unaware that vbucket 128 is also in replica index we did not. And that crashed set index and us and rebalance.
Not using replica indexes seems to be right workaround. We'll work on that misunderstanding about set index api usage as part of DP5.
Show
Aleksey Kondratenko
added a comment - - edited I figured out what's happened here.
a) we were incorrectly thinking you can transfer replica index to passive index. And did that. Which simply added passive index and kept vbucket in replica index
b) vbucket 128 was deleted (and than recreated) as part of incoming backfill into this vbucket. We normally handle (but note that stuff will have to be re-indexed because of that) it ok because when vbucket is deleted we remove it from index and when it's created back we add it back. But in this case because we were unaware that vbucket 128 is also in replica index we did not. And that crashed set index and us and rebalance.
Not using replica indexes seems to be right workaround. We'll work on that misunderstanding about set index api usage as part of DP5.
Hide
Farshid Ghods
added a comment -
yes we verified that rebalance does not fail with this error when replica indexes are disabled
Show
Farshid Ghods
added a comment - yes we verified that rebalance does not fail with this error when replica indexes are disabled
Show
Aleksey Kondratenko
added a comment - Fixed some time ago already
do we lose items after rebalance succeeds the second time ?