[MB-7523] rebalance performance regression in 2.0.1 vs 2.0.0 (apparently only under load) Created: 11/Jan/13  Updated: 30/Jun/14  Resolved: 18/Jan/13

Status: Closed
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0.1
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Major
Reporter: Ronnie Sun (Inactive) Assignee: Aleksey Kondratenko
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: linux


 Description   
Recent benchmarks: http://dashboard.hq.couchbase.com/litmus/dashboard/

Summary:

Rebalance-in ( 2-4 nodes), 7M items, mixed workload:

ec2: 2.0.0-1976 (RTM) took 1687 sec while 2.0.1-116 took 3581 sec
thor (data center physical machines): 2.0.0-1976 (RTM) took 901 sec while 2.0.1-123 took 1553 sec

Rebalance-out (4-2 nodes), 7M items, mixed workload:

ec2: 2.0.0-1976 (RTM) took 2854 sec while 2.0.1-116 took 5075 sec
thor (data center physical machines): 2.0.0-1976 (RTM) took 1142 sec while 2.0.1-123 took 2773 sec



 Comments   
Comment by Ronnie Sun (Inactive) [ 11/Jan/13 ]
diags:

thor (data center machines), build 2.0.1-123-rel:

reb-1: http://172.23.96.10:8080/job/thor-parent/112/

reb-1-out: http://172.23.96.10:8080/job/thor-parent/110/
Comment by Pavel Paulau [ 14/Jan/13 ]
+ summary of test view views:

https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdDV1SXhUZjJ0b0RhU3gtdlUzZGloUFE#gid=0
Comment by Aleksey Kondratenko [ 14/Jan/13 ]
Folks, I still see no results from 2.0.0 with +A. May I insist on having some ?
Comment by Aleksey Kondratenko [ 14/Jan/13 ]
Alternatively we can try 2.0.1 _without_ +A
Comment by Aleksey Kondratenko [ 14/Jan/13 ]
Did some runs comparing 2.0.0 and latest branch-2.0.1 and I've found 2.0.1 to be _faster_.

All my data fits in page cache plus I didn't send any mutations during rebalance. Plus I've allowed all nodes to use just one (same) CPU core. I'll rerun with binding different node to different cores.
Comment by Farshid Ghods (Inactive) [ 15/Jan/13 ]
assigning this back to Ronnie as Alk is expecting results for 2.0.0 with +A
Comment by Pavel Paulau [ 15/Jan/13 ]
Hi Aleksey,

I added first results for reb-out with +A, it shows >1.5x regression (see link above). I'm gathering more results but apparently it takes time.

Anyway, I believe KV results with +A will be even more helpful.
Comment by Aleksey Kondratenko [ 15/Jan/13 ]
No. Let's have _separate_ bug for rebalance with views.
Comment by Aleksey Kondratenko [ 15/Jan/13 ]
Given that I've tried rebalance myself and saw no regression (in fact speedup) in 2.0.1 I think we can start assuming that problem only occurs if rebalance is performed under load.

BTW, I've also confirmed personally that erlang core performance was not regressed in 2.0.1 (i.e. we have added -fno-strict-aliasing to CFLAGS).
Comment by Aleksey Kondratenko [ 18/Jan/13 ]
Fix is merged as part of chain ending at: http://review.couchbase.org/24067

I've found that my original plan to try to end serial phase of vbucket move at end of backfill helps a lot.
Generated at Fri Sep 19 14:01:29 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.