[MB-4500] reduce map show different count number at each node Created: 01/Dec/11  Updated: 09/Jan/13  Resolved: 18/Jan/12

Status: Closed
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 2.0-developer-preview-3
Fix Version/s: 2.0-developer-preview-4
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Aliaksey Artamonau
Resolution: Fixed Votes: 0
Labels: 2.0-DP3-release-notes, 2.0-dev-preview-4-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 5.4 64bit on ec2

Attachments: GZip Archive log153.gz     GZip Archive log187.gz     GZip Archive log206.gz    

 Description   
Install couchbase server 2.0.0r-266 on 3 nodes at ec2.
Use mcsoda to load 200000 items to cluster.
Create a view name one and do some query.
Shutdown one node (A) and failover
Check reduce count on view one. Ok
Reinstall couchbase server 2.0.0r-266 on node A and add it back to cluster.
Rebalance. Ok
Check reduce count on view one. Ok
Shutdown node B and failover.
Check reduce count on cluster. Ok
Reinstall couchbase server 2.0.0r-266 on node B and add it back to cluster.
Rebalance Ok.
Check reduce count on full cluster. Failed
Restart couchbase server on 3 nodes of cluster.
See reduce count is different on each node



 Comments   
Comment by Filipe Manana [ 02/Dec/11 ]
So greping each log for the last occurrence of "Set view `default`, group `_design/dev_one`, partition states updated", I can see that node 187 has the index without any active partitions defined.

log187:

[couchdb:info] [2011-12-01 18:28:54] [ns_1@10.98.186.187:<0.19609.9>:couch_log:info:39] Set view `default`, group `_design/dev_one`, partition states updated
abitmask before 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, abitmask after 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
pbitmask before 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000111111111111111111111111111111111111111111, pbitmask after 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
cbitmask before 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000, cbitmask after 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000011111111111111111111111111111111111111111111000000000000000000000000000000000000000000111111111111111111111111111111111111111111

log153:

[couchdb:info] [2011-12-01 18:28:54] [ns_1@10.124.193.153:<0.32606.0>:couch_log:info:39] Set view `default`, group `_design/dev_one`, partition states updated
abitmask before 1111111111111111111111111111111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000001111111111111111111111111111111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000, abitmask after 1111111111111111111111111111111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000
pbitmask before 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, pbitmask after 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
cbitmask before 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, cbitmask after 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

log206:

[couchdb:info] [2011-12-01 18:28:52] [ns_1@10.90.182.206:<0.31550.0>:couch_log:info:39] Set view `default`, group `_design/dev_one`, partition states updated
abitmask before 1000000000000000000000000000000000000000000000000000000000000000000000000000001110000011111111111111111111111111111111111111111100000000000000000000000000000000000000000001111111111111111111111111111111111111111111000000000000000000000000000000000000000000, abitmask after 0000000000000000000000000000000000000000000000000000000000000000000000000000001110000011111111111111111111111111111111111111111100000000000000000000000000000000000000000001111111111111111111111111111111111111111111000000000000000000000000000000000000000000
pbitmask before 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, pbitmask after 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
cbitmask before 0111111111111111111111111111111111111111111111111111111111111111111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, cbitmask after 1111111111111111111111111111111111111111111111111111111111111111111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

In an Erlang shell, we can see that in the whole cluster, we only have 173 active partitions instead of 256:

2> N206 = couch_set_view_util:decode_bitmask(2#0000000000000000000000000000000000000000000000000000000000000000000000000000001110000011111111111111111111111111111111111111111100000000000000000000000000000000000000000001111111111111111111111111111111111111111111000000000000000000000000000000000000000000).
[42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,
 61,62,63,64,65,66,67,68,69,70|...]
3> N187 = couch_set_view_util:decode_bitmask(2#0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000).
[]
4> N153 = couch_set_view_util:decode_bitmask(2#1111111111111111111111111111111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000).
[85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,
 103,104,105,106,107,108,109,110,111,112,113|...]
5>
5> io:format("~w~n", [N206]).
[42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,175,176,177]
ok
6> io:format("~w~n", [N153]).
[85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255]
ok
7>
7> O1 = ordsets:from_list(N153).
[85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,
 103,104,105,106,107,108,109,110,111,112,113|...]
8> O2 = ordsets:union(O1, ordsets:from_list(N206)).
[42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,
 61,62,63,64,65,66,67,68,69,70|...]
9>
9> io:format("~p~n", [O2]).
[42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,
 67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,
 92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,
 113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,
 132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,
 151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,
 175,176,177,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,
 229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,
 248,249,250,251,252,253,254,255]
ok
10> length(O2).
174

Is it possible ns_server missed an index state update on node 187?
Comment by Dipti Borkar [ 08/Dec/11 ]
Is the reduce result incorrect only if node fails over , cluster gets rebalanced and node gets added?
Comment by Aleksey Kondratenko [ 19/Dec/11 ]
Tony, can you retest with latest UI? We think we're hitting indexing timeouts. Newer UI will indicate that.

Also as part of fixing it we need to be able to specify very big timeouts so that any indexing activity can be performed.
Comment by Steve Yen [ 18/Jan/12 ]
marking this resolved as Aliaksey A (standing over my desk here) believe it's fixed
Generated at Thu Aug 21 10:19:13 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.