[MB-6726] Rebalance is slow when indexing/compaction and query load are going on in parallel Created: 25/Sep/12  Updated: 01/Nov/13  Resolved: 01/Nov/13

Status: Closed
Project: Couchbase Server
Component/s: performance
Affects Version/s: 2.0-beta-2
Fix Version/s: 2.5.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Thuan Nguyen
Resolution: Won't Fix Votes: 0
Labels: 2.0.1-release-notes, pblock, system-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 6.2 64bit build 2.0.0-1746

Attachments: File master_events.log.svg     GZip Archive master_events.tar.gz     PNG File rebalance_progress.png    
Issue Links:
Dependency
depends on MB-8045 [RN 2.0.2] Try view performance test ... Resolved
Duplicate
is duplicated by MB-6769 Very slow rebalance progress while re... Closed

 Description   
Cluster information:
- 8 centos 6.2 64bit server with 4 cores CPU
- Each server has 32 GB RAM and 400 GB SSD disk.
- SSD disk format ext4 on /data
- Each server has its own drive, no disk sharing with other server.
- Cluster has 2 buckets, default (11GB) and saslbucket (11GB) with consistent view enable. For 2 buckets, we use only 68% total RAM of system.
- Load 12 million items to saslbucket and 45 million items to default bucket. Each key has size from 512 bytes to 1500 bytes
- Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)

* Create cluster with 4 nodes installed couchbase server 2.0.0-1746

10.6.2.37
10.6.2.38
10.6.2.39
10.6.2.40

* Data path /data
* View path /data

* Let load running at 6K ops, query 400 to 500 per second
* Add 4 nodes to cluster and rebalance
10.6.2.42
10.6.2.43
10.6.2.44
10.6.2.45

* rebalance hang. Filed bug MB-6706
* Try rebalance again. Cluster rebalances saslbucket first and took 9 hours to complete rebalance saslbucket
* default bucket took 32 hours to complete

Link to collect_info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201209/8nodes-col-info-1746-reb-slow-constn-enable-20120925-141314.tgz

Each atop file size from 1.5 GB to 2.5 GB. If atop file needed, I will upload them as request


 Comments   
Comment by Chiyoung Seo [ 25/Sep/12 ]
http://www.couchbase.com/issues/browse/MB-6714

The fix was just merged today. Please test it with the next build.
Comment by Ketaki Gangal [ 27/Sep/12 ]
Seeing slower rebalance as on MB-6769 with latest ep-engine changes.
Comment by Aleksey Kondratenko [ 27/Sep/12 ]
Ketaki, XDCR is a very different beast I've seen it to cause very severe CPU load and weird memory usage spikes which could affect anything.

I need:

a) diags

b) I think brand new bug is a good idea here
Comment by Ketaki Gangal [ 28/Sep/12 ]
Ok, I had filed bug 6769 for the same issue. But after some discussion w/ other folks, closed that as a duplicate of this one.

We can have a xdcr-specific bug for understanding this behaviour.
Closing this one and re-opening 6769 , with more logs.
Comment by Ketaki Gangal [ 02/Oct/12 ]
This bug was opened by Tony for slowness on rebalance and was fixed with http://www.couchbase.com/issues/browse/MB-6714.

Closed this bug since we have another rebalance-xdcr slow bug MB-6769.

Please re-open this, if this case is valid on the above system-test scenarios.
Comment by Farshid Ghods (Inactive) [ 03/Oct/12 ]
blocker for 2.0 beta-2 as consistent views needs to be functional and performs reasonably fast for 2.0 beta 2
Comment by Aleksey Kondratenko [ 04/Oct/12 ]
_Original_ bug as indeed fixed by ep-engine folks.

I've heard we're still having some slowness with indexes. Please open new bug with new and relevant diagnostics.
Comment by Karan Kumar (Inactive) [ 04/Oct/12 ]
We are still investigating slowness with indexes+rebalance+consistent views
Comment by Sharon Barr (Inactive) [ 05/Oct/12 ]
isn't this a duplicate of MB-6796
Comment by Farshid Ghods (Inactive) [ 09/Oct/12 ]
once thats fixed we can retest
Comment by Steve Yen [ 18/Oct/12 ]
pavel's test yesterday shows views + rebalance performance of...

> Rebalance from 3 to 4 nodes with views (consistent view enabled).
> just 4.7h (2.0.0-1858) vs. 5.3 h (2.0.0-1792).

https://www.yammer.com/couchbase.com/#/Threads/show?threadId=224286191
Comment by Thuan Nguyen [ 22/Oct/12 ]
Re-test orange cluster with build 2.0.0-1862, swap rebalance took 43 hours to complete. Saslbucket took 20 hours and default bucket took 23.

Cluster has 2 buckets (2 replica), default and saslbucket. Each bucket has 15 million items with key size from 512 to 1024 bytes.
Do swap rebalace, add 2 nodes and remove 2 nodes. During rebalance, maintain the load about 6K ops on each bucket and queries about 200 to 300 per second on each bucket.
Comment by Thuan Nguyen [ 22/Oct/12 ]
Link to manifest file of build 2.0.0-1862 http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.0-1862-rel.rpm.manifest.xml
Comment by Chiyoung Seo [ 22/Oct/12 ]
Farshid,

I don't think there is anything that I can do in ep-engine to resolve this issue.

We know that this slowness is caused by enabling indexing / querying during rebalance with consistence view.

Please assign it to Filipe or Alk for further investigations.
Comment by Farshid Ghods (Inactive) [ 23/Oct/12 ]
assigning this to Siri/Mike for furthere investigation and traige
Comment by Steve Yen [ 23/Oct/12 ]
assigning to mike as sriram is on plane at the moment. need to get alk + damien on this.
Comment by Steve Yen [ 01/Nov/12 ]
latest update: pavel, alk & steve have agreed on a new larger test with more isolation. awaiting results.
Comment by Steve Yen [ 01/Nov/12 ]
notes from bug-scrub discussion...

- farshid: QA now trying system testing with fewer vbuckets (64), which is unsupported configuration but need to turnaround test results can be turned around faster.

- alk's considering batching -- on small-scale his tests show it might help (but unknown what would happen on larger scale).

- alk: consider recommending users disable consistent views before they rebalance.

- yaseen: need to found out boundary conditions around this?
Comment by Pavel Paulau [ 01/Nov/12 ]
Results for Alk's spec:

https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdFVINGEyZ0hzOGUtbWVHeUU3T3Rlc1E#gid=0

We should use more ddoc/views. Otherwise indexing is too fast.
Comment by Pavel Paulau [ 01/Nov/12 ]
Conclusions:
-- indexing is not a bottleneck
-- compaction does't affect results significantly (again)
-- we have to try three additional dimensions: more items (20M), more design ddocs (2), less vbuckets (512).
Comment by Steve Yen [ 05/Nov/12 ]
Alk says no more experiment ideas at the moment for Pavel.
Comment by Steve Yen [ 14/Nov/12 ]
options reviewed in bug-scrub...

- sequential vs parallel fix for rebalancing (start next phase right after backfill's done) [alk]

- vbucket-id as prefix to back-index [filipe] - early laptop results look promising
Comment by Steve Yen [ 20/Nov/12 ]
bug-scrub moved to 2.0.1
Comment by Mike Wiederhold [ 17/Dec/12 ]
Editing the component to "view engine" and "ns_server". If there is anything that needs to be done for this on the ep-engine side please let us know.
Comment by Farshid Ghods (Inactive) [ 09/Jan/13 ]
system test team needs to rerun the view tests ( 8 node SSD )
Comment by Pavel Paulau [ 10/Jan/13 ]
As for build 121 there is no significant improvement in perf. tests.
Comment by Farshid Ghods (Inactive) [ 10/Jan/13 ]
can you paste the link to the results ? ( comparing this build against 2.0 GA )
Comment by Pavel Paulau [ 10/Jan/13 ]
2.0.0-1971:

http://qa.hq.northscale.net/job/eperf-graph-loop/1217/artifact/reb-vperf-10M.loop_2.0.0-1971-rel-enterprise_2.0.0-1971-rel-enterprise_run_1_Dec-04-2012_22:54:50.pdf

2.0.1-121:

http://qa.hq.northscale.net/view/eperf-graphs/job/eperf-graph-loop/1334/artifact/reb-vperf-10M.loop_2.0.1-121-rel-enterprise_2.0.1-121-rel-enterprise_run_1_Jan-09-2013_06%3A40%3A40.pdf

It's not quite fair comparison but re-balance time is still ~ 4 hours.
Comment by Farshid Ghods (Inactive) [ 10/Jan/13 ]
rebalance is 10% faster but could be environment noise
looks better :
cpu util 5% better
the memory usage numbers dont look quite right ( ns-server avg reports 21235692 MB == 21 GB ? )

looks worse:
 runtime is 8 hours longer ,
query ops/sec dropped from 537 to 339 but query latency is far better ( 48msec in 2..0.1 compared to 58 in 2.0 )
drain rate is slower by 15% percent

we need to find out why query ops/sec is very slow.

reb−vperf−10M.conf
# Perf−rebalance test with views
# 1 design ddoc, 8 views
# 8K ops/sec
# 80% reads, 20% writes (12% updates/deletes, 8% inserts)
# 10M dataset
# Rebalance from 3 to 4 nodes
Comment by Farshid Ghods (Inactive) [ 10/Jan/13 ]
Alk/Filipe,

there are initial results available for the litmus test. can you also review the results and comment on what we need to change in the test case.

from my discussion with Alk/Filipe i heard adding more ddocs would help. is there anything else we want to change in the tes t?
Comment by Filipe Manana [ 10/Jan/13 ]
I do have measured results, and documented steps to reproduce, for rebalance out cases.
What I see, is that specially for cases of 1 ddoc with 1 view case, the view engine (indexing and compaction) is poorly utilized:

https://docs.google.com/document/d/1eUV53B5pXj-5X5FPcbOA3jz1U6jOjh1hnZmZa9bUNs0/edit

So you a bit more than theory there.
Comment by Pavel Paulau [ 11/Jan/13 ]
As I mentioned before it is *not* fair comparison.

After 2.0 release I revised config and rebalance tests now use 3 ddocs with 3 views per doc.
Moreover after 2.0 release we discovered issue with one of ssd drives, so for many reasons we need a re-run.

But in any case, 4 hours don't look like something ultra fast.

We do more experiments, for instance rebalance out 4 -> 3 with no ops/queries took 7h (40M items) and 14h (100M items). Is it something that sounds good?
Comment by Filipe Manana [ 11/Jan/13 ]
New results, for build 2.0.1-126:

https://docs.google.com/document/d/1aHHhY-ami84-aEQyx42h95rlp7Vi6WTwlozaIstFgfA/edit

In the meanwhile Pavel already re-triggered the tests he mentioned for this new build.
Comment by Pavel Paulau [ 14/Jan/13 ]
Sorry, it's not faster at all.
The same regression as Ronnie reports:

https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdDV1SXhUZjJ0b0RhU3gtdlUzZGloUFE#gid=0
Comment by Farshid Ghods (Inactive) [ 14/Jan/13 ]
summarizing what i see in the google doc above :

rebalance out 10M items 2.0.1 ( +A ) is 15 percent slower than 2.0 GA ( +S16)
rebalance in 10M items 2.0.1 ( +A ) is 40% slower than 2.0 GA ( +S16)

Pavel is planning on rerunning some tests with +A option and post more results.
Comment by Pavel Paulau [ 14/Jan/13 ]
we should thank Filipe, w/o latest improvements it'd be much worse.
Comment by Filipe Manana [ 22/Jan/13 ]
evperf results for build 2.0.1-139:

https://docs.google.com/document/d/16BQu-sD32U4qZxEK8KcvVZMdcqH_klV90DEbnoCu-_E/edit

specially for 40M, 4 -> 3 rebalance out without load in parallel, indexing + index compaction is still far from being a bottleneck.

Previously, until a month or so ago, indexing was triggered sequentially in every node for rebalance out, so improving that would theoretically give close to 3x faster rebalance, but that didn't happen, as the rebalancer now triggers compactions and waits for them to finish before issuing more indexing work, amongst other details possibly. However index compaction doesn't justify not seeing a significant faster rebalance, lets say around 2x for example.

Talking with Pavel, no significant speedups were observed for evperf rebalance tests after those improvements were merged.

There's still something missing somewhere.
Comment by Farshid Ghods (Inactive) [ 22/Jan/13 ]
based on recent perf runs from Pavel the 2.0.1-139-rel runs are much faster than 2.0.0 but are marked as not consistent.

we are still waiting for Pavel to rerun these tests.
Comment by Farshid Ghods (Inactive) [ 22/Jan/13 ]
looking at the chart
2.0.1-139 IN 10M 3 x 3 ASYNC 2063
2.0.1-139 OUT 10M 3 x 3 ASYNC 2401

but both are marked as not consistent
compared to 2.0.0 1976 GA

2.0.0-1976 OUT 10M 3 x 3 ASYNC 18491, 15391
2.0.0-1976 IN 10M 3 x 3 ASYNC 12035
Comment by Pavel Paulau [ 23/Jan/13 ]
"not consistent" means rebalance aware index disabled.

Alk requested these results.
Comment by Filipe Manana [ 24/Jan/13 ]
Slightly related, MB-7413 brought a 3x speedup to incremental index updates, which was though to be the bottleneck. But after it was added, it didn't seem to have any significant impact on the rebalance time.

There's now an evperf test to verify this, outside the rebalance context (added by CBD-767, thanks Pavel!).

It confirms the 3x speedup:


** build 2.0.1-120 (before MB-7413)

jenkins: http://qa.hq.northscale.net/view/eperf-parents/job/vesta-views/62/console

incremental update time: 5676.96956992 seconds

** build 2.0.1-121 (when MB-7413 was first introduced)

jenkins: http://qa.hq.northscale.net/view/eperf-parents/job/vesta-views/63/console

incremental update time: 1812.51270294 seconds

** build 2.0.1-140 (latest 2.0.1)

jenkins: http://qa.hq.northscale.net/view/eperf-parents/job/vesta-views/64/console

incremental update time: 1939.70245409

Comment by Pavel Paulau [ 26/Jan/13 ]
build 143
master events logs
in 3->4, out 4->3, swap 3->3.
Comment by Jin Lim [ 11/Feb/13 ]
Pavel, please provide latest performance number based on the latest build. I believe we have made a good progress since build 145?
Comment by Pavel Paulau [ 11/Feb/13 ]
Since build 148 actually.

http://dashboard.hq.couchbase.com/litmus/dashboard/
(search for reb-vperf)

+ another summary

https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdDV1SXhUZjJ0b0RhU3gtdlUzZGloUFE#gid=0

In some cases we observe up to 20% improvement, but basically we just fixed MB-7676 regression.
Comment by Aleksey Kondratenko [ 11/Feb/13 ]
Please, point me to logs and master events from this runs.
Comment by Pavel Paulau [ 11/Feb/13 ]
For instance, rebalance-in 60M items.

4DDOCS 2.0.0-1976:
http://172.23.96.10:8080/job/apollo-views/26/artifact/

4DDOCS 2.0.1-148:
http://172.23.96.10:8080/job/apollo-views/25/artifact/

1DDOC 2.0.0-1976:
http://172.23.96.10:8080/job/apollo-views/23/artifact/

1DDOC 2.0.1-148:
http://172.23.96.10:8080/job/apollo-views/22/artifact/
Comment by Aleksey Kondratenko [ 11/Feb/13 ]
May I have all of them and not just "for instance" :) ?
Comment by Pavel Paulau [ 11/Feb/13 ]
Like 30 logs?
Comment by Aleksey Kondratenko [ 11/Feb/13 ]
Sure. If there's some way for me to find all of them, then feel free to explain me how.
Comment by Pavel Paulau [ 11/Feb/13 ]
Well, I get these logs from 2 jobs:
http://172.23.96.10:8080/job/apollo-views/
http://qa.hq.northscale.net/view/eperf-parents/job/vesta-views/

You can check parameters and artifacts of every build. Or you can wait and so that later I can grab all of them and upload single archive.
This is something that I'd not recommend to do manually.
Comment by Aleksey Kondratenko [ 11/Feb/13 ]
Unfortunately I cannot make sense of stuff you linked to above.

I will not necessarily need all logs. I'll start looking at |2.0.1-153 IN 60M 1 x 3 YES 19806 5.50| and it's corresponding 2.0.0 events. But I may find other stuff useful. And thus I may need logs of any test run.
Comment by Pavel Paulau [ 12/Feb/13 ]
I added *logs* column to that spreadsheet. In some cases there are no master events in build artifacts.
Comment by Aleksey Kondratenko [ 12/Feb/13 ]
Thanks a lot. I'll take a look
Comment by Aleksey Kondratenko [ 12/Feb/13 ]
See master_events.log.svg generated from master events of rebalance (cut-n-pasting from spreadsheet)

2.0.1-153 IN 60M 1 x 3 YES 19806 5.5 no replica index http://172.23.96.10:8080/job/apollo-views/28/artifact/

empty intervals (and I've confirmed from logs) represent view compactions. Surprisingly, they eat about 1/2 of all time. At least by looking at picture.

As a workaround I'd like you to try same setup with increased value of vbucket moves before forced view compaction: http://i.imgur.com/T38iPCS.png Internal setting for this variable is named rebalanceMovesBeforeCompaction

Let's try bumping it up a lot, say to 256 and see if that helps.

cc-ing Filipe, in case he wants to add anything.
Comment by Pavel Paulau [ 13/Feb/13 ]
Moving it to 2.1 since we see noticeable improvements in systems tests and even performance tests demonstrate some positive changes. Also we don't expect any particular fix from Filipe and Alk for 2.0.1 and 2.0.2.

So it's not fast enough but obviously this bug is not 2.0.1/2.0.2 blocker anymore.
Comment by Aleksey Kondratenko [ 13/Feb/13 ]
Disagree with 2.0.2 a bit. There may still be reasonably low hanging fruit left. I.e. currently it appears we're spending a bit too much on views compaction. We can at least trade higher disk space usage during rebalance for faster rebalance time at this moment.
Comment by Pavel Paulau [ 15/Feb/13 ]
I've tried 2.0.1-153 with rebalanceMovesBeforeCompaction=256. Brief comparison:
-- rebalance time 3h vs 5.5h with default settings and 5.4h with 2.0.0-1976
-- peak disk size is 415GB vs 366GB with default settings and 613GB with 2.0.0-1976

Build artifacts:
http://172.23.96.10:8080/job/apollo-views/28/artifact/

PDF Report:
http://172.23.96.10:8080/job/apollo-graph-loop/25/artifact/reb-vperf-60M-in-1d-256rmbc.loop_2.0.1-153-rel-enterprise_2.0.1-153-rel-enterprise_APPOLO_Feb-14-2013_19%3A52%3A00.pdf
Comment by Aleksey Kondratenko [ 15/Feb/13 ]
415 vs 366 is including data. Views different is apparently quite a lot bigger.

May I ask you to do few runs with 32, 64 and 128 ?

It appears that we should consider bumping that value for great gain in rebalance time, but it remains to be seen which value works best. That's going to be trade off between disk space usage and rebalance time.
Comment by Pavel Paulau [ 18/Feb/13 ]
No problem, will do.
Comment by Pavel Paulau [ 20/Feb/13 ]
I observe very strange results in recent builds. For instance, "60M+1 ddoc+load" case is 2x slower when compared with 2.0. See spreadsheet for details (rows 16-18):
https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdDV1SXhUZjJ0b0RhU3gtdlUzZGloUFE#gid=0

Alk, do you have any idea why it happens?
Comment by Pavel Paulau [ 20/Feb/13 ]
+ chart for rebalance progress 153 vs 160.
Comment by Ketaki Gangal [ 20/Feb/13 ]
Behaviour similar to http://www.couchbase.com/issues/browse/MB-7785
Comment by Aleksey Kondratenko [ 20/Feb/13 ]
Just spotted Pavel's message.

May I have manifest of build 160 where we supposedly jumped backwards about 2x perf-wise ?
Comment by Pavel Paulau [ 20/Feb/13 ]
Ok, re-run of 160 was just 1.2x slower.

But it's still hard to understand whether it's environment variation or something software specific.
Comment by Jin Lim [ 26/Feb/13 ]
From NS SRV team (Alk):

wget --post-data='rebalanceMovesBeforeCompaction=256' --user=Administrator --password=asdasd http://lh:9000/internalSettings

So that's usual POST to usual /internalSettings API with field name rebalanceMovesBeforeCompaction.
Default value is 16. And that's how many in- or out- going vbuckets will be moved per node before pausing all moves, triggering views compaction and awaiting its completion.
View compaction is forbidden during vbucket moves. So larger values means less frequent view compactions. Which means less time "spent" on it, but it also means greater disk usage by views.
Comment by Dipti Borkar [ 28/Feb/13 ]
For this PCI sprint, we need to find the best default and better understand the impact of changing this parameter. Pavel to drive.
Comment by Thuan Nguyen [ 07/Mar/13 ]
Integrated in ui-testing #11 (See [http://qa.hq.northscale.net/job/ui-testing/11/])
    MB-6726: add config with rebalance_moves_before_compaction=256 (Revision 7fc2baf69f359946f978d6d59f0b7f0458805ac1)
MB-6726: add config with tuned reb. moves before comp. (Revision 043ce1685781d5b7fa207ffc265698cd62db906d)

     Result = SUCCESS
pavelpaulau :
Files :
* conf/perf/reb-vperf-60M-in-1d-256rmbc.conf

pavelpaulau :
Files :
* conf/perf/reb-vperf-60M-in-1d-128rmbc.conf
* conf/perf/reb-vperf-60M-in-1d-32rmbc.conf
* conf/perf/reb-vperf-60M-in-1d-64rmbc.conf
Comment by Pavel Paulau [ 11/Mar/13 ]
I've created separate task for rebalanceMovesBeforeCompaction.
Comment by Maria McDuff [ 25/Mar/13 ]
deferred out of 2.0.2 release.
aliaksey and team need further investigation.
Comment by Maria McDuff [ 11/Jul/13 ]
Tony,

pls refer to MB-8499. Pavel has verified the recommended number of replicators (now 64). pls verify if you are still seeing the slowness in rebalance when indexing/compaction is happening.
Comment by Pavel Paulau [ 12/Jul/13 ]
Please notice that it's still slow, 10-15x slower than the same setup w/o views.

Also what do you mean by number of replicators? I verified impact of rebalanceMovesBeforeCompaction.
Comment by Maria McDuff [ 01/Nov/13 ]
re-open if still exists in 2.5.
Generated at Sun Apr 20 05:40:54 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.