[MB-6232] ep-engine needs 1.5 minutes to create 1k vbuckets. Seems too slow (faster with barrier=0) Created: 15/Aug/12  Updated: 28/Jul/14

Status: Reopened
Project: Couchbase Server
Component/s: couchbase-bucket, storage-engine
Affects Version/s: 2.0, 2.0.1, 2.1.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Sundar Sridharan
Resolution: Unresolved Votes: 0
Labels: PM-PRIORITIZED, customer, devX, performance, supportability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: GZip Archive debug.1.gz    
Triage: Triaged
Is this a Regression?: Yes

 Description   
SUBJ.

Just try single node dev cluster and try to load beers sample (it'll fail for perhaps different reason).

You can observe in logs:

We're doing set_vbucket_state requests from janitor (ns_memcached messages) and this is the first messages from mccouch about just created vbucket (capi_set_view_manager messages and mc_connection messages). Note: I've added extra log messages at the beginning of sync_notify and end to see if it's couchdb's or ns_servers' fault. From timestamps it's clear that it's not.

[ns_server:debug,2012-08-15T9:47:40.309,n_0@127.0.0.1:<0.868.0>:mc_connection:do_notify_vbucket_update:106]sending out sync_notify
[ns_server:info,2012-08-15T9:47:40.309,n_0@127.0.0.1:<0.870.0>:ns_memcached:do_handle_call:485]Changed vbucket 740 state to active
[views:debug,2012-08-15T9:47:40.309,n_0@127.0.0.1:mc_couch_events:capi_set_view_manager:handle_mc_couch_event:418]Got set_vbucket event for beer-sample/1023. Updated state: active
[ns_server:info,2012-08-15T9:47:40.309,n_0@127.0.0.1:<0.870.0>:ns_memcached:do_handle_call:485]Changed vbucket 739 state to active
[ns_server:debug,2012-08-15T9:47:40.309,n_0@127.0.0.1:<0.868.0>:mc_connection:do_notify_vbucket_update:113]done
[ns_server:info,2012-08-15T9:47:40.310,n_0@127.0.0.1:<0.870.0>:ns_memcached:do_handle_call:485]Changed vbucket 738 state to active
[ns_server:info,2012-08-15T9:47:40.310,n_0@127.0.0.1:<0.870.0>:ns_memcached:do_handle_call:485]Changed vbucket 737 state to active

And last message is at:

[ns_server:debug,2012-08-15T9:49:14.410,n_0@127.0.0.1:<0.868.0>:mc_connection:do_notify_vbucket_update:106]sending out sync_notify
[views:debug,2012-08-15T9:49:14.410,n_0@127.0.0.1:mc_couch_events:capi_set_view_manager:handle_mc_couch_event:418]Got set_vbucket event for beer-sample/1. Updated state: active
[ns_server:debug,2012-08-15T9:49:14.410,n_0@127.0.0.1:<0.868.0>:mc_connection:do_notify_vbucket_update:113]done
[ns_server:debug,2012-08-15T9:49:14.501,n_0@127.0.0.1:<0.868.0>:mc_connection:do_notify_vbucket_update:106]sending out sync_notify
[views:debug,2012-08-15T9:49:14.502,n_0@127.0.0.1:mc_couch_events:capi_set_view_manager:handle_mc_couch_event:418]Got set_vbucket event for beer-sample/0. Updated state: active
[ns_server:debug,2012-08-15T9:49:14.502,n_0@127.0.0.1:<0.868.0>:mc_connection:do_notify_vbucket_update:113]done


I've just retried with remounting with barrier=0 and it was blazing fast. What this mount option changes is short fsync are nearly instant because they merely send data to disk's buffer without waiting until data actually hits platter.

So that's evidence of some excessive use of fsyncs somewhere during vbucket db files creation and that's clearly on ep-engine/couchstore side.

Set this to blocker as it'll affect people trying to create buckets on real hardware, hard disks and any modern linux distro (where barriers are on by default).


 Comments   
Comment by Matt Ingenthron [ 20/Aug/12 ]
Note that on MB-6305 this caused quite a bit of trouble. Currently, one cannot reliably go through a cycle of delete/create bucket in any reliable fashion using public interfaces. Sometime between build 1495 and build 2.0.0c 709, the delete/recreate cycle now takes multiple minutes.

In automated testing some SDKs used to use flush_all(). At the recommendation of engineering, we moved to delete/create. Now using delete/create is intractable. This has broken our testing again, unfortunately.

The larger concern is that our users will probably see this pretty quickly too. What we were doing here isn't all that unusual.
Comment by Aleksey Kondratenko [ 20/Aug/12 ]
Matt, I think you can try workaround of:

mount / -o remount,nobarrier

And if it's not helping in your case then there's something else that makes it slower then needed.
Comment by Aleksey Kondratenko [ 20/Aug/12 ]
Matt, I think you can try workaround of:

mount / -o remount,nobarrier

And if it's not helping in your case then there's something else that makes it slower then needed.
Comment by Aleksey Kondratenko [ 06/Sep/12 ]
BTW I recall now having similar problem pre DP4. I've fixed it by creating all vbuckets in parallel. But that was ofcourse when mccouch was at it's full power.
Comment by Matt Ingenthron [ 06/Sep/12 ]
@alk, I presume that should still be in here, so the issue is further down it would seem.

thanks for the additional info
Comment by Tommie McAfee [ 06/Sep/12 ]
Just tried to reproduce this with 2.0 build 1689 and loading bear samples times out:

[ns_server:error,2012-09-06T10:49:42.436,ns_1@127.0.0.1:<0.752.0>:menelaus_web:handle_post_sample_buckets:506]Loading sample buckets timed out

Same error is shown in the UI. If I click to retry it says data is already loaded, however if I inspect the db there are no documents in the bucket. Not sure if this is similar issue...attaching diag
Comment by Farshid Ghods (Inactive) [ 06/Sep/12 ]
more info here : http://www.couchbase.com/issues/browse/MB-6305
Comment by Aleksey Kondratenko [ 06/Sep/12 ]
Matt, when I referenced mccouch that was old mccouch that was our persistence path. This is not the case anymore. So same issue is still hitting us. We're seemingly doing (potentially few) fsyncs per vbucket and we're doing all that sequentially.

Tommie, docloader thing is most likely distinct issue. I think we already have bug for that.
Comment by Mike Wiederhold [ 25/Sep/12 ]
Alk,

I investigated this issue further and found that the latency to send a vbucket state transition to mccouch was 100ms. Interestingly to do a state transition to dead the latency was an order of magnitude quicker so I think there might be something going on on the mccouch side here.
Comment by Aleksey Kondratenko [ 25/Sep/12 ]
Mike, by looking at timestamps in logs I'm pretty sure there are no delays on mccouch side.
Comment by Peter Wansch (Inactive) [ 25/Sep/12 ]
Mike, can you continue driving to analyze this?
Comment by Mike Wiederhold [ 04/Oct/12 ]
The fix for this bug will require changes on both the ep-engine and mccouch side and requires modifying how we store vbucket state data and how we notify mccouch about vbucket state changes. The changes will not be small and will be risky for the 2.0 release. I recommend this change be move to 2.0.1.
Comment by Chiyoung Seo [ 04/Oct/12 ]
For more details. we currently store a vbucket state file in its vbucket database as a local doc, which means that it will require one fsync for every state change. I was trying to replace those individual local docs with a global vbucket state doc and store it in the "master" meta database, but realized that it requires changes in interactions between ep-engine and erlang, and also affects our 2.0 backup and restore tools at this time.

This is the design defect and limitation and risky changes for 2.0 at this time.
Comment by Aleksey Kondratenko [ 04/Oct/12 ]
Can we consider not using fsync for initial vbucket creation ?
Comment by Mike Wiederhold [ 18/Dec/12 ]
Also see MB-6234 which is a duplicate of this issue.
Comment by Aleksey Kondratenko [ 04/Mar/13 ]
MB-6234 is not necessarily a dup. This bug is about metadata ops.

Particularly, why ep-engine is even needs tons of fsync to create brand new vbuckets ?

Why ep-engine cannot just drop those ops on the floor when during that 1.5 minute there's flush ?

I think at least those actions are possible and perhaps more and sdk folks will finally have a working flush.
Comment by Dipti Borkar [ 04/Mar/13 ]
We need to better understand this issue in ep_engine and couchstore and come up with options to fix it in the 2.0.2 / 2.1 timeframe
Comment by Matt Ingenthron [ 04/Mar/13 ]
Yay!
Comment by Mike Wiederhold [ 27/Mar/13 ]
There are some other issues still in 2.0.2 that will likely help improve this issue, but I am moving this task out to 2.1
Comment by Maria McDuff (Inactive) [ 03/Apr/13 ]
Created CBQE for QE to add in functional automation suite: http://www.couchbase.com/issues/browse/CBQE-1192
  
Comment by Matt Ingenthron [ 11/Jul/13 ]
I removed the note in the summary about the barrier/disk sync because we have some evidence that in some environments, even ramdisk does not make the problem go away.
Comment by Aleksey Kondratenko [ 11/Jul/13 ]
Matt, unfortunately you're not quite correct. 1.5 minutes for bucket creation is only possible on spinning disk and barriers on. Customer SM issues are order of magnitude smaller.
Comment by Matt Ingenthron [ 11/Jul/13 ]
That's a fair criticism, but it still wasn't a correct statement by saying it "gets fast". Well, it may be if we are talking strictly about ep-engine. I've changed it back.
Comment by Aleksey Kondratenko [ 11/Jul/13 ]
"fast" is always not specific enough. For my human feelings fews seconds to create 1k vbuckets without barriers is fast. For automated tests that do flush (and that requires more than just creating 1k vbuckets) it's not fast.

IMHO a bit too many of our bugs including admittedly this are not specific enough. Most importantly a bunch of tickets like "XXX is too slow".
Comment by Matt Ingenthron [ 11/Jul/13 ]
Much agreed. The request from users here would be sub-second, mostly due to the flush-testing use case. I can't speak for the practicality of this if 1024 individual IOs are needed and we have a disk that does only 120 IOPS, but that's what the user's desire is. I'm certain it can be refactored to not being 1024 synchronous IOs.
Comment by Chiyoung Seo [ 24/Jul/13 ]
Points: 3
Comment by Chiyoung Seo [ 24/Jul/13 ]
The ep-engine team will discuss the possible approaches with the other teams to resolve this issue.
Comment by Chiyoung Seo [ 25/Jul/13 ]
Let's discuss this issue offline to see how we can tackle it.
Comment by Chiyoung Seo [ 08/Aug/13 ]
The engine team discussed this issue, but as we mentioned earlier in this ticket, maintaining a vbucket_state local doc per vbucket database file is one of the major reasons for this bucket creation slowness with the barrier on.

We plan to consolidate all the vbucket state local docs into a single local doc and manage it in "master.couch" database file. Then, it won't require 1024 sync IOs. However, this fix is not only in the engine side, but backup / restore and compactor. Note that the DB compactor doesn't compact "master.couch" file.

We plan to address it in the next major or minor release.
Comment by Filipe Manana [ 08/Aug/13 ]
I would suggest a dedicated database file for keeping those states. Using master database will require some coordination between ep-engine and ns_server/couchdb (to prevent concurrent btree updates), leading to all sorts of issues and complexity. Shouldn't add any significant overhead this way.
Comment by Sundar Sridharan [ 10/Oct/13 ]
Just a quick note/update that we don't seem to be seeing this issue anymore with current master branch. Bucket creation on single node does not take more than a minute.
I think Chiyoung and Filipe's suggestion of a separate file for storing the vbucket snapshots would be very useful.
Reducing the priority from Blocker to Minor.
thanks
Comment by Dipti Borkar [ 10/Oct/13 ]
Would be good for someone else to validate this.
Comment by Matt Ingenthron [ 10/Oct/13 ]
Sundar: why reduce the priority? The priority should remain the same until it's fixed.
Comment by Aleksey Kondratenko [ 10/Oct/13 ]
I willing to bet few hundreds that it's _not_ fixed. That on durable spinning disk setup it still needs at least tens of seconds to create a bucket.

Whether that's a blocker or minor or whatever it's hard to say. I.e. given that UPR will de-couple views and XDCR from disk persistence so folks mainly affected by this slowness today (i.e. running tests involving lots of bucket creations) will not hit this as hard as before.
Comment by Matt Ingenthron [ 10/Oct/13 ]
Whether rubles or dollars, I'm not taking that bet.

Regarding the UPR/XDCR decoupling, if that's the resolution that delivers the user needs, that's fine. I don't expect it would fix things, so the issues where this came up in the first place would still be affected.
Comment by Sundar Sridharan [ 10/Oct/13 ]
I just created a bucket on a single node couchbase server without any replicas on 10.5.2.36 - bucket became ready in 8 seconds.
Which seems to be a spindle disk..
cat /sys/block/sda/queue/rotational
1
I repeated the test on 10.5.2.34 and then again on 10.5.2.33 machines and the bucket creation time were all well under 10 seconds.
Please let me know if you still see this issue with the machine name and login credentials and I'll take it from there.
thanks
Comment by Aleksey Kondratenko [ 10/Oct/13 ]
Just tried on my box. Took about 50 seconds with barriers enabled.

In order to see that slowness you need a box where fsync actually waits for data to hit the disk. Our VMs are not "honest" on fsync. And we also have few boxes with battery backed cache where fsync legitimately can be quick.

Get yourself a Linux or Windows laptop with HDD and out of the box you'll observe the slowness.
Comment by Matt Ingenthron [ 10/Oct/13 ]
You can figure out if something is buffering for you pretty easily. A real disk would do somewhere from 80-150 IOPS. WIth dd and the oflags argument, you can ask it to sync between each write. If it's in that range, then it's proper spinning rust. If it's not, something is buffering for you.

For example:
[ingenthr@centosa ~]$ time dd if=/dev/zero of=testfile bs=4096 count=1200 oflag=dsync
1200+0 records in
1200+0 records out
4915200 bytes (4.9 MB) copied, 11.8134 seconds, 416 kB/s

real 0m11.833s
user 0m0.004s
sys 0m0.057s
[ingenthr@centosa ~]$ time dd if=/dev/zero of=testfile bs=4096 count=1200
1200+0 records in
1200+0 records out
4915200 bytes (4.9 MB) copied, 0.01124 seconds, 437 MB/s

real 0m0.014s
user 0m0.000s
sys 0m0.013s

Note the first one did about 100 IOPS with dsync on, and the latter one did about 100x better, simulating the buffering you might get from a VM hypervisor that lies, etc.

(this is a CentOS instance on KVM with two 5900 RPM "green" disks under it, on ZFS on SmartOS)
Comment by Aleksey Kondratenko [ 05/Mar/14 ]
Saw this marked as regression. Which is probably mistake. This was always the case since switch to couch*
Comment by Pavel Paulau [ 05/Mar/14 ]
I guess every old ticket is marked as "Regression" when you update it.
Somebody added this new field to MB project in JIRA recently... and it's probably not configured properly.
Comment by Benjamin ter Kuile [ 02/Apr/14 ]
Leaving out a flush command in a test suit changes the time from 2minutes and 43 seconds to just 9 seconds. Flushing is important for isolated unit tests.
Comment by Sundar Sridharan [ 28/Jul/14 ]
As per Pavel's performance testings this issue is only seen on spindle disks without any RAID setup. This setup is for single developer use only and production environments typically have some form of RAID or SSD setups where we do not see this issue. Hence after consultation with PM the decision is to not support the single HDD case since optimizing for that case can lower performance of the more typical RAID/SSD case.
thanks
Comment by Matt Ingenthron [ 28/Jul/14 ]
Sundar: Could you re-assign this to the PM who made this decision? I'd like to discuss it. This is a significant issue for developer experience. It's caused countless support requests and confusion. It affects multi-spindle and SSD deployments as well, so that's a false comparison.
Comment by Chiyoung Seo [ 28/Jul/14 ]
There is mis-communication among the teams. We identified the root cause and still need to fix this issue, but not for 3.0 as it requires changing a way of storing vbucket_state docs and creating individual vbucket database files during the bucket instantiation.
Comment by Matt Ingenthron [ 28/Jul/14 ]
Chiyoung: who is the appropriate owner for this?
Comment by Chiyoung Seo [ 28/Jul/14 ]
Sundar still owns this ticket. I already communicated with him regarding how we can resolve this issue.
Generated at Thu Jul 31 09:33:26 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.