[MB-6967] dev-views index seems not to be built (at least for _count reduce) during rebalance with consistent view (1DD, 4 views) - build 1868 Created: 18/Oct/12  Updated: 06/Dec/12  Resolved: 21/Nov/12

Status: Resolved
Project: Couchbase Server
Component/s: documentation, ns_server, view-engine
Affects Version/s: 2.0-beta-2
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Sharon Barr (Inactive) Assignee: MC Brown (Inactive)
Resolution: Fixed Votes: 0
Labels: 2.0-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: MAC OS
build 1868

<manifest><remote name="couchbase" fetch="git://10.1.1.210/"/><remote name="membase" fetch="git://10.1.1.210/"/><remote name="apache" fetch="git://github.com/apache/"/><remote name="erlang" fetch="git://github.com/erlang/"/><default remote="couchbase" revision="master"/><project name="tlm" path="tlm" revision="ab70f6d42f46621ec576889e57cb37ac2d64a84b"><copyfile dest="Makefile" src="Makefile.top"/></project><project name="bucket_engine" path="bucket_engine" revision="70b3624abc697b7d18bf3d57f331b7674544e1e7"/><project name="ep-engine" path="ep-engine" revision="10b593cf4d97eaf062a6076878c5f8000d093ee9"/><project name="libconflate" path="libconflate" revision="2cc8eff8e77d497d9f03a30fafaecb85280535d6"/><project name="libmemcached" path="libmemcached" revision="ca739a890349ac36dc79447e37da7caa9ae819f5" remote="membase"/><project name="libvbucket" path="libvbucket" revision="00d3763593c116e8e5d97aa0b646c42885727398"/><project name="membase-cli" path="membase-cli" revision="7fe4121e7e83952a4cb032e25a2cb9fca1709354" remote="membase"/><project name="memcached" path="memcached" revision="06ab906e6702917c4b6b90a6b0051644719a357d" remote="membase"/><project name="moxi" path="moxi" revision="52a5fa887bfff0bf719c4ee5f29634dd8707500e"/><project name="ns_server" path="ns_server" revision="82c7c95e33445edcc6665e64216983e1027756e1"/><project name="portsigar" path="portsigar" revision="1bc865e1622fb93a3fe0d1a4cdf18eb97ed9d600"/><project name="sigar" path="sigar" revision="63a3cd1b316d2d4aa6dd31ce8fc66101b983e0b0"/><project name="couchbase-examples" path="couchbase-examples" revision="21e6161a1d064979b5c6aa99cd34ccc41c9d7aca"/><project name="couchbase-python-client" path="couchbase-python-client" revision="86b398e4fbc1f2e38d356e14df0c1bb4e3d2427b"/><project name="couchdb" path="couchdb" revision="69032c6dbcd64c056265416d2c49b1a9ee06f9c3"/><project name="couchdbx-app" path="couchdbx-app" revision="76d79be79c1454cff0f878d5a88a792270ec1b17"/><project name="couchstore" path="couchstore" revision="772f66a29a81be59d8bdaaa74b3898bb44fcc7e2"/><project name="geocouch" path="geocouch" revision="b0bd742551639c52030c070e5bf9390edbb536ba"/><project name="mccouch" path="mccouch" revision="88701cc326bc3dde4ed072bb8441be83adcfb2a5"/><project name="testrunner" path="testrunner" revision="621f470d068b2d20fce7b323e5f9fd44304db76c"/><project name="otp" path="otp" revision="b6dc1a844eab061d0a7153d46e7e68296f15a504" remote="erlang"/><project name="icu4c" path="icu4c" revision="26359393672c378f41f2103a8699c4357c894be7" remote="couchbase"/><project name="snappy" path="snappy" revision="5681dde156e9d07adbeeab79666c9a9d7a10ec95" remote="couchbase"/><project name="v8" path="v8" revision="447decb75060a106131ab4de934bcc374648e7f2" remote="couchbase"/><project name="gperftools" path="gperftools" revision="8f60ba949fb8576c530ef4be148bff97106ddc59" remote="couchbase"/><project name="pysqlite" path="pysqlite" revision="0ff6e32ea05037fddef1eb41a648f2a2141009ea" remote="couchbase"/></manifest>

Attachments: GZip Archive 10.3.3.104-8091-diag.txt.gz     GZip Archive 10.3.3.106-8091-diag.txt.gz     GZip Archive 10.3.3.107-8091-diag.txt.gz     GZip Archive 10.3.3.95-8091-diag.txt.gz     Zip Archive 204.236.154.91 - new node 2.txt.zip     Zip Archive 54.241.117.117 - new node 1.txt.zip    

 Description   
i tried the following scenario on the latest build:
1. Create cluster of 2 nodes. load 1M items, 1K each with cbworkloadgen ./cbworkloadgen -n localhost:8091 -i10000000 -t6 -j -s1000
2. Create 1DD, with 4 simple views, of which 2 had _count reduce function so i can count all documents on that view.
3. I built the index and confirmed that the _count views return 1M items, i.e. index was built successfully.
4. I added 2 more nodes to the cluster and rebalance.
5. During the rebalance, i refreshed the _count views.
Expected behavior (with consistent views enabled by default) is that i will get the same 1M items during the reblance
Observed behavior: the number kept changing, and decreasing over time to 900K, 800K..

Iryna, can you collect the neccesary logs from this cluster and post it on this bug.
Cluster can be found at: http://184.169.209.178:8091/ (usual credentials)
Here are all the IP addresses:
ns_1@10.176.29.176 10.176.29.176 184.169.209.178 ns_1@10.176.9.41
ns_1@10.176.9.41 10.176.9.41 50.18.23.114 ns_1@10.176.9.41
ns_1@10.168.103.76 10.168.103.76 204.236.154.91 ns_1@10.176.9.41
ns_1@10.176.145.104 10.176.145.104 54.241.117.117 ns_1@10.176.9.41


I did a few things on this cluster like creating and removing indexes/nodes. so look at the logs at the end.




 Comments   
Comment by Sharon Barr (Inactive) [ 18/Oct/12 ]
More info:
at the end of the rebalance, there were 630K items on this view, and it seems that the index was rebuilt until it reached the 1M number.

it seems that the index is NOT being built during the rebalance.
Comment by Sharon Barr (Inactive) [ 18/Oct/12 ]
4 logs are 70MB compressed. let me know where you want them to be uploaded.
i am continuing to mess up with this cluster.
Comment by Sharon Barr (Inactive) [ 18/Oct/12 ]
4 logs are 70MB compressed. let me know where you want them to be uploaded.
i am continuing to mess up with this cluster.
Comment by Sharon Barr (Inactive) [ 18/Oct/12 ]
and another important information. there was NO load on the system. so the consistent view feature is not really have an impact here.
Comment by Sharon Barr (Inactive) [ 18/Oct/12 ]
Attached are logs from the new nodes that entered the cluster.
logs from the original nodes are too large to attache.

The accurate observation is that at least the reduce views are not being (or partially) built. the count drops during the rebalance and stay low after it's done. once i trigger the full view, it is being built up again.


Comment by Sharon Barr (Inactive) [ 19/Oct/12 ]
Attached are logs from the new nodes that entered the cluster.
logs from the original nodes are too large to attache.

The accurate observation is that at least the reduce views are not being (or partially) built. the count drops during the rebalance and stay low after it's done. once i trigger the full view, it is being built up again.


Comment by Filipe Manana [ 19/Oct/12 ]
What I see in the logs, is that the initial index build is being done during rebalance.

During rebalance, vbucket state changes are happening all the time, which cause the updater to be stopped and restarted after each state transition. Remember that the initial index build is not resumable, so each restart means it starts from scratch.

I don't think there's anything that can be done here. Only way I see, is if it were possible for ns_server to know what's the final state for all vbuckets in each node, and then do a single bulk state transition request to the indexes.

This is something that was always possible since the faster initial index build method was added months ago.
Comment by Deepkaran Salooja [ 19/Oct/12 ]
Couldn't access the cluster. Reproduced the same behavior on Centos64 with latest build(1870). Attaching the diags from 4 nodes.

Steps followed:

Load 500K items in default bucket
Rebalance 1->2 nodes
Create 1 dev view,
map
function (doc, meta) {
  emit(meta.id, null);
}

reduce
_count

Query full_set to create index
curl -X GET 'http://10.3.3.95:8092/default/_design/dev_d1/_view/v1?full_set=true&connection_timeout=60000&limit=10&skip=0&#39;
{"rows":[
{"key":null,"value":500000}
]
}


Rebalance 2->4

During rebalance, query full_set returns < 500K in the output

root@ubuntu1104-64:~# curl -X GET 'http://10.3.3.95:8092/default/_design/dev_d1/_view/v1?full_set=true&connection_timeout=60000&limit=10&skip=0&#39;
{"rows":[
{"key":null,"value":472640}
]
}
root@ubuntu1104-64:~# curl -X GET 'http://10.3.3.95:8092/default/_design/dev_d1/_view/v1?full_set=true&connection_timeout=60000&limit=10&skip=0&#39;
{"rows":[
{"key":null,"value":453125}
]
}

After rebalance & indexing is finished, full 500K are returned
root@ubuntu1104-64:~# curl -X GET 'http://10.3.3.95:8092/default/_design/dev_d1/_view/v1?full_set=true&connection_timeout=60000&limit=10&skip=0&#39;
{"rows":[
{"key":null,"value":500000}
]
}
Comment by Filipe Manana [ 19/Oct/12 ]
Actually, it's not doing initial build here, but what I said before it's still valid.

For node .117 (node 1), what I see is that the index update is being stopped and restarted every 1 or 2 seconds due to vbucket state transition requests from ns_server:

https://friendpaste.com/4qH8PqvWO4tpGANlSxp0n3

In each transition, only one vbucket state changes.

I don't know if things could be better batched or not in ns_server.
Comment by Farshid Ghods (Inactive) [ 19/Oct/12 ]
Deep,

in your testing were you running load during rebalancing ?
did you make sure index update was completed before rebalancing ?
Comment by Farshid Ghods (Inactive) [ 19/Oct/12 ]
Deep , Iryna and I had a conversation about this scenario and we will discuss with Filipe and update the ticket to understand consistent views and index updating better
Comment by Deepkaran Salooja [ 19/Oct/12 ]
I didn't run any load during rebalancing. After rebalance is finished, I can see indexing running if I do full_set queries.
Comment by Aleksey Kondratenko [ 19/Oct/12 ]
Every 1 or 2 seconds should be impossible, Filipe. At least for main index. We are supposed to wait for index building completion for each bucket movement.

Incoming vbucket transfers are not synchronized w.r.t. each other and it's possible to have _some_ config changes due to that: i.e. we started moving _in_ vbucket 0 and 1 second later we can start moving _in_ vbucket 1, which AFAIK will cause all indexing progress to be thrown out if it's initial index building.

But there's inherent limit of concurrent incoming vbucket movements. Constant changes should be impossible.

Replica index is somewhat different currently. My code doesn't wait for replica indexing completion. Which I'll be happy to fix. Does couch_set_view:monitor_partition_update/3 work for replica indexes as well?
Comment by Filipe Manana [ 20/Oct/12 ]
Well, unless the log timestamps are not well calculated, I see at least 1 state transition per second in average:

https://friendpaste.com/1Aoo4eeFhzLGFlpAW6lHQI

And no, it's not possible to wait for replica indexing of particular vbuckets like it is for main index.

I see lots (if not all, or the majority) of transitions only changing one vbucket state. For example:

[couchdb:info,2012-10-19T0:39:28.779,ns_1@10.176.145.104:<0.20650.1>:couch_log:info:39]Set view `default`, main group `_design/dev_test`, partition states updated
active partitions before: [256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510]
active partitions after: [256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510]
passive partitions before: []
passive partitions after: [511]
cleanup partitions before: []
cleanup partitions after: []
unindexable partitions: []
replica partitions before: []
replica partitions after: []
replicas on transfer before: []
replicas on transfer after: []
pending transition before:
  active: []
  passive: []
  unindexable: []
pending transition after:
  active: []
  passive: []
  unindexable: []


Here it made one change only, added vbucket 511 to the passive state.

Then 4 seconds later it does a single change, to move vbucket 511 from the passive state to the active state:

[couchdb:info,2012-10-19T0:39:31.262,ns_1@10.176.145.104:<0.20650.1>:couch_log:info:39]Set view `default`, main group `_design/dev_test`, partition states updated
active partitions before: [256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510]
active partitions after: [256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511]
passive partitions before: [511]
passive partitions after: []
cleanup partitions before: []
cleanup partitions after: []
unindexable partitions: []
replica partitions before: []
replica partitions after: []
replicas on transfer before: []
replicas on transfer after: []
pending transition before:
  active: []
  passive: []
  unindexable: []
pending transition after:
  active: []
  passive: []
  unindexable: []


I understand it might not be possible for ns_server to batch things better, but this doesn't help.

I can revert some optimizations done several months ago for query performance, which will increase the incremental indexing checkpoint frequency. While they will likely help for such small datasets (and decreasing query performance on high concurrency scenarios), for large datasets, or when there's much load on the system (many ddocs, xdcr, bucket compaction, etc etc) we'll very likely run into the same scenario as here.

Comment by Aleksey Kondratenko [ 20/Oct/12 ]
When I said impossible, I was referring to initial index building. In which case waiting for single vbucket will obviously wait for all of them.

For some next release we'll definitely seek better ways to interact with indexes. The problem, as I pointed our earlier, is that from our high-level perspective we're not very aware of performance implication of what we do.
Comment by Aleksey Kondratenko [ 22/Oct/12 ]
btw incorrect _count doesn't imply index is not being built. I recommend actually checking if results you expect are there. We have that somewhat controversial "excluding" of reduction values during non-steady-state. And I don't know if we heavily tested this path.
Comment by Farshid Ghods (Inactive) [ 22/Oct/12 ]
the user in this case is trying a reduce query which should return the count as expected.

when you say we have controversial excluding of reduction values during non-steady state ? is that expected behavior in view-engine ?

do we expect users to see the reduction to return inconsistent results but the view results alway be consistent ?
Comment by Aleksey Kondratenko [ 22/Oct/12 ]
We expect it to be consistent. When I said 'somewhat controversial' I was pointing out that it actually has to do quite a bit of work to return reduction value in non-steady state, compared to steady state. And I was trying to say I have no idea if we used to test this path a lot.
Comment by Iryna Mironava [ 23/Oct/12 ]
Please take into account that the issue can be reproduced only with development views, production views are working ok.
Comment by damien [ 23/Oct/12 ]
Apparently this only happens for development views, which are not supposed to be accurate across a cluster and rebalances.
Comment by Farshid Ghods (Inactive) [ 23/Oct/12 ]
looked at diags posted by Sharon and confirmed they are also on dev views

               {design_documents,[<<"_design/dev_test">>]},
               {indexer_type,main},
               {set,<<"default">>},
               {signature,<<"701550d970bd93f8c7c9736d27d90b51">>},
               {started_on,1350606438},
               {type,blocked_indexer},
               {updated_on,1350606438}],
              [{pid,<<"<0.4324.6>">>},
               {design_documents,[<<"_design/dev_test">>]},
               {indexer_type,main},
               {set,<<"default">>},
               {signature,<<"701550d970bd93f8c7c9736d27d90b51">>},
               {started_on,1350606564},
               {type,blocked_indexer},
               {updated_on,1350606564}],
              [{pid,<<"<0.18271.6>">>},
               {design_documents,[<<"_design/dev_test">>]},
Comment by Farshid Ghods (Inactive) [ 23/Oct/12 ]
im not quite sure why this only happens on dev views and not on production views though.
Comment by Farshid Ghods (Inactive) [ 23/Oct/12 ]
if this happens for dev views we should add this to release notes that user might get incnsistent results during rebalancing with development views
Comment by Filipe Manana [ 24/Oct/12 ]
From the view engine perspective, there's no distinction (or concept) about development and production views. They're all treated the same way.
ns_server on the other hand gives different treatment to them, like for example not triggering index updates for development views.
Comment by Farshid Ghods (Inactive) [ 25/Oct/12 ]
Alk,

form ns_server point of view w.r.t rebalancing and consistent views do we treat dev views any differently
Comment by Aleksey Kondratenko [ 25/Oct/12 ]
Farshid, I could not understand you question completely.

Pure dev views are not at all affected during rebalance. After all they only cover single vbucket. Even if this vbucket is moved I believe it'll safely use other one.

If we're speaking about dev views will full_set option set to true, they are no different than any production views.
Comment by Farshid Ghods (Inactive) [ 25/Oct/12 ]
according to the bug description we are getting inconsistent results during rebalancing but does not happen for production views.

the tester also waits until index is built before running the rebalance and full_Set is used for all view queries.
if there is no difference between dev views and productions views during rebalancing then we should not be seeing different behavior.
Comment by Aleksey Kondratenko [ 25/Oct/12 ]
Well, there is is _no_ difference with dev with full_set and production views. Thus it appears that this bug is genuine and needs to be fixed.
Comment by Farshid Ghods (Inactive) [ 25/Oct/12 ]
ok thanks.
Comment by Rahim Yaseen [ 26/Oct/12 ]
Alk, if you think this is a genuine bug, can you please review and do some triage - let me know what we should do with this bug?
Comment by Aleksey Kondratenko [ 26/Oct/12 ]
Ok. I was somewhat wrong. Here's full explanation.

As per Frank's original idea, development design docs are supposed to be used on subset of data. People are supposed to play with map/reduce functions without throwing entire cluster's power on building full index. When that development process is done, people can try it own on full subset few times. That would build full index. We expect that once this is done, people are satisfied and will publish dev_ design document to production. Our (and stock) indexes have a nice feature where same ddocs are represented by same one index file. So work spent on building full index will not be lost as production design doc will be 100% same as just indexed in full mode dev_ design doc.

Key point is, system explicitly avoids automagically building or updating that full index underneath dev_ design docs. Because that would waste cluster's resources without explicit human request for that.

Thus indeed as part of rebalance we trigger and wait for index updates for all non-dev design docs. Explicitly skipping development ddocs. Even then user should still be able to see consistent result with stale=false queries, which would force index update before returning results.

So indeed not a bug and perhaps worth documenting.
Comment by Aleksey Kondratenko [ 26/Oct/12 ]
See above
Comment by Steve Yen [ 27/Oct/12 ]
Thanks Aliaksey.

Based on Aliaksey's explanation, marking this non-blocker and for MC to document this non-obvious but "working as designed" behavior for dev-views.
Comment by Farshid Ghods (Inactive) [ 29/Oct/12 ]
please note that this behavior is only expected for development views ( full_set = True and partial set ) . It does not apply for production views.
Comment by MC Brown (Inactive) [ 21/Nov/12 ]
Documentation has been updated with this information in both the view types, and auto-update sections of the manual.
Comment by kzeller [ 06/Dec/12 ]
Added to RN as:

"If you are using development views, be aware you may see
inconsistent results if you query a development view during rebalance. For production views,
you are able to query during rebalance and get results consistent with those you would have recieved
if no rebalance were occurring."
Generated at Fri Aug 22 02:48:21 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.