[MB-7147] swap rebalance slowness Created: 09/Nov/12  Updated: 03/Jan/13  Resolved: 21/Nov/12

Status: Resolved
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0-beta-2
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Ronnie Sun (Inactive) Assignee: Ronnie Sun (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive 10.2.1.58-11142012-1140-diag.zip     Zip Archive 10.2.1.61-11142012-1139-diag.zip     Zip Archive 10.2.1.63-11142012-1142-diag.zip     Zip Archive info_168.zip     Zip Archive info_169.zip     Zip Archive info.zip     Zip Archive ns-diag-168.txt.zip     PDF File reb-swap-6-1.loop_1.8.1-938-rel-enterprise_2.0.0-1954-rel-enterprise_orange_Nov-21-2012_10-07-08.pdf    

 Description   
____________________
From: Ronnie Sun
Sent: Friday, November 09, 2012 6:34 PM
To: Steve Yen
Subject: Re: swap rebalance perf?

Hi Steve,

Got some preliminary results showing 2.0 is 10x slower, which falls into the same area we've seen for reb-litmus tests.

1.8.1-938: 203 seconds
2.0.0-1939: 2983 seconds

Spec: (reb-litmus-swap-2) 3M mixed, start with 2 nodes, swap 1 node with 9 client firing 1k ops each.

Current jenkins automation for this has some problems, so I don't have comparison graphs.

Thanks,
Ronnie


 Comments   
Comment by Steve Yen [ 09/Nov/12 ]
adding affected version and fix version values so this will show up in the right filters for 2.0 bug scrub.

also, changing from major, and if it's Blocker, we'll raise it in the next bug scrub mtg.
Comment by Chiyoung Seo [ 10/Nov/12 ]
Ronnie,

I need cb_collect_info from the nodes that are newly added.
Comment by Ronnie Sun (Inactive) [ 10/Nov/12 ]
chiyoung,

I repeated the test on vms, this time reb took 1 hr.

Left the cluster for you, 10.2.2.168.

Note the ip addresses shown on the UI are internal ips.

Here are the maps fyi:

10.2.2.167 : 192.168.0.20 (reb out)
10.2.2.168 : 192168.0.21 (new node)
10.2.2.169 : 192.168.0.22

Comment by Ronnie Sun (Inactive) [ 10/Nov/12 ]
btw: info.zip is from 10.2.2.168
Comment by Chiyoung Seo [ 12/Nov/12 ]
The followings are the timings of vbucket checkpoint persistence from the node that is newly added:

 chk_persistence_cmd (1336 total)
    16ms - 32ms : ( 0.07%) 1
    32ms - 65ms : ( 0.15%) 1
    65ms - 131ms : ( 0.37%) 3
    131ms - 262ms : ( 0.97%) 8
    262ms - 524ms : ( 13.70%) 170 #################
    524ms - 1s : ( 34.66%) 280 #############################
    1s - 2s : ( 53.22%) 248 ##########################
    2s - 4s : ( 68.49%) 204 #####################
    4s - 8s : ( 78.89%) 139 ##############
    8s - 16s : ( 95.66%) 224 #######################
    16s - 33s : (100.00%) 58 ######

From the above timings, we can easily see they took more than 50minutes overall.
Comment by Steve Yen [ 12/Nov/12 ]
bug-scrub: chiyoung still diagnosing underlying issue
Comment by Chiyoung Seo [ 12/Nov/12 ]
Alk,

I have a quick question on checkpoint persistence. I saw that ep-engine sometimes receives two checkpoint_persistence commands for the same vbucket even though it doesn't send any timeout tmp_fail response to the ebucketmigrator.

For example, this is the timings from the node that was newly added for swap rebalance in two node cluster:

cbstats 10.2.2.168:11210 timings
 chk_persistence_cmd (1354 total)
    32ms - 65ms : ( 0.22%) 3
    65ms - 131ms : ( 0.59%) 5
    131ms - 262ms : ( 1.92%) 18 #
    262ms - 524ms : ( 14.03%) 164 ##############
    524ms - 1s : ( 35.08%) 285 #########################
    1s - 2s : ( 50.89%) 214 ###################
    2s - 4s : ( 71.34%) 277 #########################
    4s - 8s : ( 88.04%) 226 ####################
    8s - 16s : ( 97.71%) 131 ###########
    16s - 33s : (100.00%) 31 ##


There were no timeout responses to the ebucketmigrator. In this case, I expect that there would be 1024 checkpoint_persistence commands (512 for active and 512 for replica vbuckets).

Please let me know if this is still fine and reassign it back to me.
Comment by Chiyoung Seo [ 12/Nov/12 ]
An example of such duplicate checkpoint_persistence requests is for vbucket 37 on 10.2.2.168:

memcached<0.4538.0>: Mon Nov 12 12:52:59.092677 PST 3: Notified the completion of checkpoint persistence for vbucket 37, cookie 0x5c62000
...
memcached<0.4538.0>: Mon Nov 12 12:53:01.096198 PST 3: Notified the completion of checkpoint persistence for vbucket 37, cookie 0x5c63600
Comment by Ronnie Sun (Inactive) [ 12/Nov/12 ]
collect info for chiyoung's comments.
Comment by Chiyoung Seo [ 13/Nov/12 ]
Reassigned it to me as I was told that it's an expected behavior.
Comment by Aleksey Kondratenko [ 13/Nov/12 ]
As part of meeting understanding views and rebalance we decided Ronnie will run same rebalance but with no mutations and same rebalance but with rebalance_index_waiting_disabled set to false. To see if that 10x is due to need to persist deltas from ongoing mutations
Comment by Aleksey Kondratenko [ 14/Nov/12 ]
Ronnie just posted on some internal forum that without mutations we're still 4x slower. I'm still waiting newer diags.

I also lack information about environment. Looks like this is run on vmware, but any details about this environment are missing
Comment by Aleksey Kondratenko [ 14/Nov/12 ]
Here's related messages about vbucket 37:

[rebalance:debug,2012-11-12T12:52:58.562,ns_1@192.168.0.21:<0.7340.0>:janitor_agent:handle_call:651]Going to wait for persistence of checkpoint 2 in vbucket 37
[ns_server:info,2012-11-12T12:52:58.572,ns_1@192.168.0.21:ns_port_memcached<0.4538.0>:ns_port_server:log:171]memcached<0.4538.0>: Mon Nov 12 12:52:58.371396 PST 3: TAP (Consumer) eq_tapq:anon_137 - disconnected
memcached<0.4538.0>: Mon Nov 12 12:52:58.453295 PST 3: TAP (Consumer) eq_tapq:anon_138 - Reset vbucket 37 was completed succecssfully.

memcached<0.4538.0>: Mon Nov 12 12:52:59.092677 PST 3: Notified the completion of checkpoint persistence for vbucket 37, cookie 0x5c62000

I.e. we can see our initial waiting for checkpoint on vbucket 37 completes in about 1 second. That's were we expect bulk of items to get persisted. After that we create another checkpoint on source and wait for it's persistence. It's needed for views consistency. In this case even though we don't have any views we still do that second checkpoint. But we always expect that second checkpoint to be persisted very quickly because it'll have only few items to persist.

Here's relevant messages:

[rebalance:debug,2012-11-12T12:53:00.225,ns_1@192.168.0.21:<0.7363.0>:janitor_agent:handle_call:651]Going to wait for persistence of checkpoint 3 in vbucket 37

(hm, we see about 1 seconds delay from ep-engine's log message to ns_server proceeding to next step, which is potential problem in ns_server or memcached)

[ns_server:info,2012-11-12T12:53:01.296,ns_1@192.168.0.21:ns_port_memcached<0.4538.0>:ns_port_server:log:171]memcached<0.4538.0>: Mon Nov 12 12:53:01.096198 PST 3: Notified the completion of checkpoint persistence for vbucket 37, cookie

We see another 1 second delay. Ronnie mentioned (not on ticket as usual) that load is about 9k ops per second which per vbucket 37 is about 9 items per second. So that's at most few tens of items that we needed to persist for that second "delta" checkpoint.
Comment by Ronnie Sun (Inactive) [ 14/Nov/12 ]
Repeated the test (physical cluster), took ~840 seconds,

cbcollect info stats attached. (including diags)

Starting rebalance, KeepNodes = ['ns_1@10.2.1.63','ns_1@10.2.1.58'], EjectNodes = ['ns_1@10.2.1.61']
Comment by Chiyoung Seo [ 14/Nov/12 ]
I think this is a design defect on the consistence view. I don't want to adapt the ep-engine flusher anymore at this time. I think we really should revisit the overall design of consistent view.

For a safer and more reliable workaround, we should disable the consistence view during rebalance if there are no views defined.
Comment by Ronnie Sun (Inactive) [ 14/Nov/12 ]
And test with index_waiting disabled took 2580 seconds.

Comment by Aleksey Kondratenko [ 14/Nov/12 ]
Actually there is some something interesting and non-flusher related in in timestamps from latest diags (without ongoing mutations).

E.g.:

[rebalance:debug,2012-11-14T11:10:17.702,ns_1@10.2.1.58:<0.17937.1>:janitor_agent:handle_call:651]Going to wait for persistence of checkpoint 2 in vbucket 511
[ns_server:info,2012-11-14T11:10:17.899,ns_1@10.2.1.58:ns_port_memcached<0.2008.0>:ns_port_server:log:171]memcached<0.2008.0>: Wed Nov 14 11:10:17.698884 PST 3: TAP (Consumer) eq_tapq:anon_1537 - Reset vbucket 511 was completed succecssfully.

[ns_server:debug,2012-11-14T11:10:17.951,ns_1@10.2.1.58:janitor_agent-default<0.2070.0>:janitor_agent:handle_info:682]Got done message from subprocess: <0.17937.1> (ok)
[ns_server:info,2012-11-14T11:10:18.151,ns_1@10.2.1.58:ns_port_memcached<0.2008.0>:ns_port_server:log:171]memcached<0.2008.0>: Wed Nov 14 11:10:17.951680 PST 3: Notified the completion of checkpoint persistence for vbucket 511, cookie 0x13fef080

[ns_server:info,2012-11-14T11:10:18.656,ns_1@10.2.1.58:ns_port_memcached<0.2008.0>:ns_port_server:log:171]memcached<0.2008.0>: Wed Nov 14 11:10:18.455993 PST 3: Schedule cleanup of "eq_tapq:anon_1536"

[rebalance:debug,2012-11-14T11:10:19.365,ns_1@10.2.1.58:<0.17944.1>:janitor_agent:handle_call:651]Going to wait for persistence of checkpoint 2 in vbucket 511
[ns_server:debug,2012-11-14T11:10:19.366,ns_1@10.2.1.58:janitor_agent-default<0.2070.0>:janitor_agent:handle_info:682]Got done message from subprocess: <0.17944.1> (ok)

We see that we actually get reply quite quickly from ep-engine in both invokations. Second is especially quick as it's a NOP in this case. But there's some weird 1 second delay in between those two calls. Only significant activity in between there only stats request for "checkpoint <vbucket-id>" and possible create_new_checkpoint request.
Comment by Chiyoung Seo [ 14/Nov/12 ]
Steve,

From now, I wouldn't work on any rebalance slowness issues due to consistent view. As I mentioned, depending on the persistence for view updates is NOT scalable approach.

Initially, I suggested to stream incoming mutations or takeover items to the indexer through TAP without waiting for persistence, but heard from someone that it won't make any difference compared with the current approach. However, I recently showed that the current approach is way slow due to the disk slowness and fsync per vbucket.

I don't expect that we will change the current architecture in post 2.0, and am not interested in giving any suggestions to people who are so stubborn.

Again, my suggestion is to disable the consistent view if there are no views defined.

Please assign it to people who is willing to work on this issue.

In post 2.0, I will remove this prioritization because it's not a good approach, but instead workaround. I will send a notice to the cluster and XDCR teams after 2.0.
Comment by Steve Yen [ 15/Nov/12 ]
Chiyoung made more changes in ep-engine flusher priority with toy-build. 4x faster. But more agressive flusher can likely lead to more starvation of writes.
Comment by Steve Yen [ 15/Nov/12 ]
Ronnie, reassigning to you to please rerun tests with the Chiyoung's toy-build, and help cover Pavel while he's traveling.

From Chiyoung's email...

Ronnie, Pavel,

I made more changes in ep-engine, so that the flusher can work on persisting high priority vbuckets much more aggressively. In my tests, I observed ~ 4X faster rebalance.

However, my main concern on this new change is that it might cause the starvation on flushing regular vbuckets, and consequently grow the number of dirty items on those regular vbuckets (i.e., disk write queue size might grow a lot especially in XDCR tests because we prioritize flushing 32 vbuckets once every 30 minutes)

You can download the toy build from

http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_toy-chiyoung-x86_64_2.0.0-2002-toy.rpm

Ronnie, can you please test it with the swap rebalance?

Pavel, can you test it with the XDCR?

Thanks,
Chiyoung
Comment by Steve Yen [ 16/Nov/12 ]
priority to critical, per bug-scrub, but still want more larger scale test runs (more nodes & more items).

Ronnie,
please send a test case to Tony for larger scale retest, and please reassign.
thanks,
steve
Comment by Ronnie Sun (Inactive) [ 16/Nov/12 ]
test 1:

- 8 nodes, swap 1
- 20M items.
- 30 clients with 500 ops per second (set:get = 50:50)

test 2:
- 8 nodes, swap 3
- 20M items.
- 30 clients with 500 ops per second
Comment by Steve Yen [ 19/Nov/12 ]
See also: http://www.couchbase.com/issues/browse/CBD-682
Comment by Chisheng Hong (Inactive) [ 19/Nov/12 ]
[global]
#username:root
username:root
password:couchbase
port:8091

data_path=/data

[servers]
1:10.6.2.37
2:10.6.2.38
3:10.6.2.39
4:10.6.2.40
5:10.6.2.42
6:10.6.2.43
7:10.6.2.44
8:10.6.2.45

#9:10.3.121.24
#10:10.3.121.25
#11:10.3.121.26
#12:10.3.121.27

[membase]
rest_username:Administrator
rest_password:password


This is the cluster information for orange cluster. Ronnie you can run the test on them.
Comment by Ronnie Sun (Inactive) [ 19/Nov/12 ]
due to the cluster limitations:

change test plan to:

1) 4 node swap 3.

2) 6 nodes swap 1.




Comment by Steve Yen [ 20/Nov/12 ]
bug-scrub - moved to 2.0.1
Comment by Ronnie Sun (Inactive) [ 20/Nov/12 ]
Early results for larger number of nodes swap rebalance (swap 3 nodes from a 4 node cluster).

No foreground load case looks promising.

1.8.1-938: 988.58 seconds.
2.0.0- 1954: 974.67 seconds.
Comment by Ronnie Sun (Inactive) [ 21/Nov/12 ]
4-3 swap rebalance:

without foreground load (1x identical):

1.8.1-938: 988.58 seconds.
2.0.0- 1954: 974.67 seconds.

with ~10k foreground (7x slower):

1.8.1-938: 222 sec
2.0.0- 1954: 1723 sec

6-1 swap rebalance:

with ~10k foreground (1x identical):
1.8.1-938: 639.52 sec
2.0.0- 1954: 654.22 sec
Comment by Ronnie Sun (Inactive) [ 21/Nov/12 ]
Based on 6-1 results (started with 6 nodes, swap 1 node):

2.0 has similar reb time as 1.8.1, while latencies are ~30% worse.


Comment by Steve Yen [ 21/Nov/12 ]
https://www.yammer.com/couchbase.com/#/Threads/show?threadId=236527848

4-3 swap rebalance:

without foreground load (1x identical):

1.8.1-938: 988.58 seconds.
2.0.0- 1954: 974.67 seconds.

with ~10k foreground (7x slower):

1.8.1-938: 222 sec
2.0.0- 1954: 1723 sec

6-1 swap rebalance:

with ~10k foreground (1x identical):
1.8.1-938: 639.52 sec
2.0.0- 1954: 654.22 sec
Like · Reply · Share · More · 13 minutes ago
 
Ronnie Sun: Based on 6-1 results (started with 6 nodes, swap 1 node):

2.0 has similar reb time as 1.8.1, while latencies are ~30% worse.

reb-swap-6-1.loop_1.8.1-938-rel-enterprise_2.0.0-1954-rel-enterprise_orange_Nov-21-2012_10-07-08
 
Steve Yen: @Aliaksey Kandratsenka, @Chiyoung Seo

"4-3 swap rebalance" == 4 nodes initially and swap 3 of them.

"6-1 swap rebalance" == 6 nodes initially and swap 1 of them.

@Ronnie Sun also reports latencies were worse in 2.0 during swap rebalance.

This seems to indicate that we don't need @Chiyoung Seo's patch that prioritizes the vbucket takeover even more.

This was on system test cluster (xen VM's) with SSD, key-value mixed workload (no views), with consistent views enabled (the @Chiyoung Seo .
Comment by Filipe Manana [ 22/Nov/12 ]
No views defined, so updating component list.
Comment by Filipe Manana [ 22/Nov/12 ]
No views defined, so updating component list.
Generated at Sat Sep 20 08:07:46 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.