[MB-4692] Reduce view returns incorrect results Created: 24/Jan/12  Updated: 23/Jul/12  Resolved: 06/Feb/12

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0-developer-preview-4
Fix Version/s: 2.0-developer-preview-4
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aliaksey Artamonau Assignee: Karan Kumar (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File add.py     File del.py     File incorrect_results.tar.bz2     File logs.tar.bz2     File ns-diag-20120124155027.txt.bz2    

 Description   
Created 10 node cluster. Created a view {"reduce":{"map":"function (doc) {\n emit(doc._id, null);\n}","reduce":"_count"} and uploaded 100k json items using mcsoda. Queried the view with stale=false. Result was correct. Started removing nodes one by one from a cluster while running view queries. After second node was removed the view started returning more than 100k items. I figured out that all duplicated rows come from a single node. And on this node all the duplicated rows come from three vbuckets: 215, 216, 217. There was a period of time when these vbuckets were reported by set views both as passive and replicas:

 Set view `default`, main group `_design/dev_test`, partition states updated
active partitions before: [73,74,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,101,102,103,240,241,242]
active partitions after: [73,74,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,101,102,103,240,241,242]
passive partitions before: [215,216,217]
passive partitions after: [215,216,217]
cleanup partitions before: []
cleanup partitions after: []
replica partitions before: [6,7,8,32,33,34,58,59,60,113,114,115,127,139,140,141,155,164,165,188,189,190,208,211,214,215,216,217,233,236,239,244,249]
replica partitions after: [6,7,8,32,33,34,58,59,60,113,114,115,127,139,140,141,155,164,165,188,189,190,208,211,214,215,216,217,233,236,239,244,249]
replicas on transfer before: [215,216,217]
replicas on transfer after: [215,216,217]

Sequence of calls that was performed by ns_server seems to be correct. I'm attaching full logs and diag from this node.

 Comments   
Comment by Aliaksey Artamonau [ 26/Jan/12 ]
Following Filipe's advice I added additional check on intersection between active partitions in main index and partitions from replica index to couch_set_view:modify_bitmasks. But these sets are disjoint even when view give an incorrect result.
Comment by Aliaksey Artamonau [ 26/Jan/12 ]
http://review.couchbase.org/#change,12711 does not make any difference for me.
Comment by damien [ 27/Jan/12 ]
This is a exact list of instructions how to reproduce bad views with discrete, non-concurrent steps (100% reproducible, no need to try to make race conditions happen).

Get a fresh repo named couchbase, and copy the del.py and add.py files to couchbase/ep-engine/management/

cd couchbase
make
cd ns_server
make dataclean
./cluster_run --nodes=2

from another terminal:

cd couchbase/ns_server
./cluster_connect -n 1
cd ../ep-engine/management
python add.py

From the web ui, create a new view.
Click "Views" at top
Click "Create Development View" button
Enter in test names
Edit the view and change map function to:
function (doc) {
emit(doc._id, 1);
}
For the reduce, click _count:
Click "Save" button
Click "Full Cluster Data Set" button
Click the generated Url to open the raw json view in another browser window
Keep refreshing until Value is 100000 (or whatever you expect)

From previous terminal window:
python del.py

NOTE: DO NOT REFRESH THE VIEW FROM BROWSER YET!
From the web ui click "Server Nodes" at top.
Click "Add Server" button
Enter in same ip address, and increment the port by one:
Example:
Server:10.2.1.60:9001
Username:Administrator
Password: asdasd

Click "Add Server" button
Click "Rebalance" button

When rebalance finishes, go back to your raw JSON view in the browser, and refresh. Keep refreshing until the value stops changes.

The value should be non-zero. BUG!!!!!

Now go back to "Server Nodes" in web ui
Click the "Remove" button for the newly added node.
Click "Rebalance" button

When rebalance finishes, go back to your raw JSON view in the browser, and refresh. The value should be the same as before. This indicates the bad values are coming from the first node.

Comment by damien [ 27/Jan/12 ]
Used to reproduce steps from damien
Comment by damien [ 27/Jan/12 ]
Used to reproduce steps from damien
Comment by Filipe Manana [ 27/Jan/12 ]
http://review.couchbase.org/#change,12767 fixes it
Comment by Aliaksey Artamonau [ 28/Jan/12 ]
Reproduced it with all the latest fixes using the same scenario (though it definitely happens less frequently). After another rebalance out view constantly returns more items than there are in bucket. I figured out that one of the nodes returns items from vbucket 250 that is not activated in the index. It used to be active but then set_partition_states with cleanup_partitions=[250] was called. Will attach full logs from this node soon.
Comment by Farshid Ghods (Inactive) [ 28/Jan/12 ]
./testrunner -i b/resources/dev-4-nodes.ini -t viewtests.ViewTests.test_count_reduce_100k_docs

it happens even with a single node but less frequeent than before
Comment by Filipe Manana [ 29/Jan/12 ]
@Aliaksey

Need more info on how to reproduce this. Are the query results inconsistent during failover or rebalance (or both)? Are they temporary (only during rebalance or failover) or permanent?

Please make sure all your nodes have the following couchdb commit:
https://github.com/couchbase/couchdb/commit/43c6b744c8a110c5a1f6f9a2039fcc405cbff1a9


@Farshid

Farshid, I ran that test locally, sometimes fails for me too.
One thing I notice is that the test's queries don't specify ?stale=false. I think this is what making the test fail often.
I changed locally the test viewtests.ViewTests.test_count_reduce_100k_docs to add stale=false to all queries, and like this the test passes always for me:

http://friendpaste.com/5OUPCfOUHxEG4HBB0qU7r9

Can you verify that?
Comment by Aliaksey Artamonau [ 29/Jan/12 ]
Results where permanently inconsistent after rebalancing out several nodes. All the nodes where build with the commit you're referring.
Comment by Steve Yen [ 06/Feb/12 ]
need repro?
Comment by Karan Kumar (Inactive) [ 06/Feb/12 ]
Confirmed that test_count_reduce_x_docs passes.
Generated at Wed Aug 27 16:07:18 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.