[MB-7523] rebalance performance regression in 2.0.1 vs 2.0.0 (apparently only under load) Created: 11/Jan/13 Updated: 18/Jan/13 Resolved: 18/Jan/13 |
|
| Status: | Resolved |
| Project: | Couchbase Server |
| Component/s: | ns_server |
| Affects Version/s: | 2.0.1 |
| Fix Version/s: | None |
| Security Level: | Public |
| Type: | Bug | Priority: | Major |
| Reporter: | Ronnie Sun | Assignee: | Aleksey Kondratenko |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: | linux | ||
| Description |
|
Recent benchmarks: http://dashboard.hq.couchbase.com/litmus/dashboard/
Summary: Rebalance-in ( 2-4 nodes), 7M items, mixed workload: ec2: 2.0.0-1976 (RTM) took 1687 sec while 2.0.1-116 took 3581 sec thor (data center physical machines): 2.0.0-1976 (RTM) took 901 sec while 2.0.1-123 took 1553 sec Rebalance-out (4-2 nodes), 7M items, mixed workload: ec2: 2.0.0-1976 (RTM) took 2854 sec while 2.0.1-116 took 5075 sec thor (data center physical machines): 2.0.0-1976 (RTM) took 1142 sec while 2.0.1-123 took 2773 sec |
| Comments |
| Comment by Ronnie Sun [ 11/Jan/13 ] |
|
diags:
thor (data center machines), build 2.0.1-123-rel: reb-1: http://172.23.96.10:8080/job/thor-parent/112/ reb-1-out: http://172.23.96.10:8080/job/thor-parent/110/ |
| Comment by Pavel Paulau [ 14/Jan/13 ] |
|
+ summary of test view views:
https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdDV1SXhUZjJ0b0RhU3gtdlUzZGloUFE#gid=0 |
| Comment by Aleksey Kondratenko [ 14/Jan/13 ] |
| Folks, I still see no results from 2.0.0 with +A. May I insist on having some ? |
| Comment by Aleksey Kondratenko [ 14/Jan/13 ] |
| Alternatively we can try 2.0.1 _without_ +A |
| Comment by Aleksey Kondratenko [ 14/Jan/13 ] |
|
Did some runs comparing 2.0.0 and latest branch-2.0.1 and I've found 2.0.1 to be _faster_.
All my data fits in page cache plus I didn't send any mutations during rebalance. Plus I've allowed all nodes to use just one (same) CPU core. I'll rerun with binding different node to different cores. |
| Comment by Farshid Ghods [ 15/Jan/13 ] |
| assigning this back to Ronnie as Alk is expecting results for 2.0.0 with +A |
| Comment by Pavel Paulau [ 15/Jan/13 ] |
|
Hi Aleksey,
I added first results for reb-out with +A, it shows >1.5x regression (see link above). I'm gathering more results but apparently it takes time. Anyway, I believe KV results with +A will be even more helpful. |
| Comment by Aleksey Kondratenko [ 15/Jan/13 ] |
| No. Let's have _separate_ bug for rebalance with views. |
| Comment by Aleksey Kondratenko [ 15/Jan/13 ] |
|
Given that I've tried rebalance myself and saw no regression (in fact speedup) in 2.0.1 I think we can start assuming that problem only occurs if rebalance is performed under load.
BTW, I've also confirmed personally that erlang core performance was not regressed in 2.0.1 (i.e. we have added -fno-strict-aliasing to CFLAGS). |
| Comment by Aleksey Kondratenko [ 18/Jan/13 ] |
|
Fix is merged as part of chain ending at: http://review.couchbase.org/24067 I've found that my original plan to try to end serial phase of vbucket move at end of backfill helps a lot. |