[MB-7160] if flush times out final stage (janitor creating vbuckets back) it returns success causing clients to see TMPFAIL after flush succeeds (WAS: there are reports that even after invoking FLUSH nodes return TMPFAIL...) Created: 11/Nov/12  Updated: 20/Mar/14

Status: Reopened
Project: Couchbase Server
Component/s: ns_server
Affects Version/s: 2.0-beta-2, 2.0.1, 2.1.0
Fix Version/s: bug-backlog
Security Level: Public

Type: Bug Priority: Major
Reporter: Farshid Ghods (Inactive) Assignee: Aleksey Kondratenko
Resolution: Unresolved Votes: 0
Labels: customer, ns_server-story
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File test.rb     Text File test.txt    
Issue Links:
Relates to
relates to MB-8956 [RN'd] - if flush times out final sta... Closed
Triage: Untriaged
Is this a Regression?: Yes

 Description   
"reported by SDK team"
After a flush (through REST) for a time we still get tmpfail returned by server nodes. This is not expected, and would be kind of annoying from an application and/or cause problems with automated tests.

update:

I've updated subject. Indeed this is possible. But IMHO given that clients should always be prepared to handle TMPFAIL out of everything I've lowered down to minor.

further update:

I disagree on the "always be able to handle TMPFAIL", especially in the case of running tests at the SDK side. At the moment, we ask our users to handle tmpfail directly. That's intentional and seems to make sense as a pressure relief valve for steady state, but between these flushes and especially from unit tests, it'd be best if the cluster could just block either the operation request or the flush response until complete.

Raising this to major owing to end-user reports of trouble here.


 Comments   
Comment by Farshid Ghods (Inactive) [ 11/Nov/12 ]
Deep,

can you please reproduce this case after coordinating with SDK team
Comment by Farshid Ghods (Inactive) [ 11/Nov/12 ]
maybe enabling traffic command is async instead of sync ?

Matt,
if this is easy to reprodcue can you please add more information
1- how many nodes
2- buckets
3- hardware info ( cpu , RAM )

and upload diags if possible
Comment by Matt Ingenthron [ 12/Nov/12 ]
Sergey: this issue originally came from you.

Can you:
1) let us know if you still see this issue and
2) let us know if you'd opened a bug previously?

Thanks.
Comment by Sergey Avseyev [ 12/Nov/12 ]
The issue still there. Take a look at attached files.

1 node
1 bucket
default number of vbuckets
Comment by Steve Yen [ 12/Nov/12 ]
bug-scrub: mike described a workaround to sdk team.
Comment by Sergey Avseyev [ 12/Nov/12 ]
Where can I find that description?
Comment by Aleksey Kondratenko [ 13/Nov/12 ]
Sergey, I see potential for confusion. Is that related to slow flush you're seeing? Are you sure you're getting 200 back ?

Please describe your case better.

As for the question of workaround, its basically same as any tmpfail error: just keep re-trying.
Comment by Sergey Avseyev [ 14/Nov/12 ]
as title says some spec is saying that after flush all operations should be Ok. the script show a that the cluster might return 500 error to flush
Comment by Aleksey Kondratenko [ 14/Nov/12 ]
Sergey, this just can't be fair. 5xx reply means all bets are off.

Real bug you're after is: MB-6232
Comment by Aleksey Kondratenko [ 14/Nov/12 ]
See MB-6232 as pointed out above
Comment by Aleksey Kondratenko [ 14/Nov/12 ]
And workaround is to run off ram disk
Comment by kzeller [ 06/Dec/12 ]
Added to RN as:

Several incidents have been reported that after using Flush on nodes, Couchbase
Server returns TMPFAIL even after a successful flush.
Comment by Matt Ingenthron [ 07/Mar/13 ]
Accidentally thought this was the right one to re-open. It actually is a dupe here, but there's a related issue.
Comment by Aleksey Kondratenko [ 07/Mar/13 ]
Matt also said the following over email:

BUT recent testing by our team indicates the fsync is not the underlying cause. UPDATE, as of 3/7, MB-6232 has now been moved from Backlog to 2.0.2. Need to reopen MB-7160 with current findings

I'm waiting those datails that could indicate there's some other reason for tmpfails
Comment by Matt Ingenthron [ 21/Mar/13 ]
This may or may not be directly related to the specific issue reported here. It seems to be a contributing factor anyway.

Trond carried out a series of experiments recently recently to test whether or not the time problems with flush time were owing to the underlying MB-6232 as previously suspected. MB-6232 says fsyncs and serially doing a lot of work with vbuckets cause lots of the creation/deletion slowness we see.

To isolate this and find a solution for a particular deployment, a number of tests were run. In the last experiment, two tests were run:
1) Reduce the number of vbuckets to 8, but otherwise use a real fileystem, disks, etc. (~2sec)
2) Use ramfs (9.5sec)

1) Reduce the number of vbuckets:

I uninstalled 2.0.1-build 170 from my box and removed /opt/couchbase before reinstalling it and stopping it immediately. Then I edited /opt/couchbase/bin/couchbase-server and added (right under the license header):

COUCHBASE_NUM_VBUCKETS=8
export COUCHBASE_NUM_VBUCKETS

This sets the server to be using only 8 vbuckets instead of the default number of 1024. This number _HAS_ to be a "power-of-two", and should not be less than the number of nodes you have in your cluster (then you'll have nodes without any vbuckets).

Running the same program as I used earlier (who did a flush of an empty bucket) now use ~2 secs on a filesystem that is mounted with "barrier=1" and less than a second (0.7) on a filesystem that is mounted with "barrier=0"

2) Using ramfs

I created the ramfs by running:

mount -t ramfs -o size=100m ramfs /tmp/couchbase_data
chown couchbase:couchbase /tmp/couchbase_data

Then installed couchbase and had it use /tmp/couchbase_data for storage.

Running flush takes ~9.5 sec for an empty bucket. Reducing the number of vbuckets to 8 like described in 1 used 0.36 s

--- --- ---

In an earlier set of experiments, there was more measurement of where time was going. I've edited this slightly from Trond:

Test rig: (Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GH with 4GB memory).
The system is using ext4 as a filesystem, 2GB of memory to Couchbase (and installed the two sample databases during installation).

Test was run with the following PHP script which was intended to simulate a series of flushes as unit tests are run:

<?php
   $counter = 1;
   $errors = 0;

   $cb = new Couchbase("localhost", "Administrator", "password");

   while ($counter < 11) {
      try {
         echo "\rFlushing: " . $counter . " Errors: " . $errors;
         $cb->flush();
      } catch (CouchbaseException $e) {
         $errors++;
      }

      $counter++;
   }
?>

When I initially ran the script it reported:
trond@desktop:~/compile/sdk/php$ time php -c php.ini example/buckets.php
Flushing: 10 Errors: 8
real 5m36.613s
user 0m0.012s
sys 0m0.020s

Please note that we can't really "trust" the numbers above when we had an error, because I don't check for the reason _why_ it failed (the next one could for instance fail due to the fact that we've already got a flush running etc).

When I remounted the filesystem with barrier=0 and I got:
trond@deskttime op:~/compile/sdk/php$ php -c php.ini example/buckets.php
Flushing: 10 Errors: 0
real 3m4.609s
user 0m0.028s
sys 0m0.008s

At least all of the flush calls succeed, but look at the time.. It takes more than 3 _minutes_ to run 10 flush commands, so for a user running 500 unit tests each separated by a flush, just running the flush for their tests would result in a roughly 2 1/2 hours waiting for the flush commands in their 500 tests cycles.. (then depending on how much data they add it may take longer...)

What I do find interesting is when I run top while I'm doing this I see beam.smp (erlang vm) using from 100-170% CPU (aka almost two cores), whereas memcached is relatively idle. From what I can see it looks to me that flushing a vbucket is setting all of the vbuckets to "dead" before activating them again.

I know we don't have stats for _everything_ we do, but from running the "timings" cbstat command we see (removing the printout of the distribution there):

 set_vb_cmd (2048 total)
    Avg : ( 27us)
 del_vb_cmd (2048 total)
    Avg : ( 44us)
 disk_vbstate_snapshot (1622 total)
    Avg : ( 6ms)
 disk_vb_del (1024 total)
    Avg : ( 3ms)

If I'm adding all of these up (assuming that they're all done in sequence) we end up at roughly 13 secs, but the command "time" report of the php process doing a single flush reports:
real 0m23.613s
user 0m0.016s
sys 0m0.012s

So there is 10 secs not accounted for, but this seems to vary.. (starting php and doing a single set use 0m0.033s so it's not the php runtime overhead)..

I'm not entirely sure what the "disk_vbstate_snapshot" is used for but it accounts for roughly 9 of the 13 secs)

Wen I disabled that function in ep-engine I got the example program above running those 10 flushes:

Flushing: 10 Errors: 0
real 0m22.411s
user 0m0.020s
sys 0m0.012s

Interestingly enough eam.smp is now down at ~100% CPU... For fun I tried to run the script with 500 flushes (to see if it changed over time), and we're down to less than 18 minutes flushing (which isn't that bad ;-))

I know we added that call to disk_vbstate_snapshot for some reason, so just removing isn't an alternative. I am however pretty sure we don't need to snapshot them 1600+ times while we're running a flush (I would have guessed that two would do the trick.. one when they are all disabled, one when they are all back up again). It is _ns-server_ and not ep-engine who needs to ensure that the clustermap is consistent after a potential crash during the process, and to be honest ns_server could just persist a flag saying that it is doing a flush before it starts, then nuke the flag when it receives the response that all of the vbuckets are dead if it want to protect itself from coming back up in a "hosed" configuration if a crash occurs during the flush.

Personally I think we should change our code to let ns_server send a _SINGLE_ message to ep_engine listing _ALL_ of the vbuckets it want to shut down, and it should get a single return message when that is in place (and we could then have a _SINGLE_ snapshot vbucket when we've disabled all of them "atomically") then a SINGLE enable message (returning in a single snapshot_vbucket_state).

Right now we're also adding FLUSH markers to our tap connections. Given that ns_server is the organizer of the entire process it could just shutdown the TAP connections before running the flush command, and then restart the replication chains when its done. (the database should be empty anyway, and we wouldn't have to add special flush logic to the tap streams).

It would be interesting to know if all of this message passing and constant vbucket state change is part of its heavy CPU usage during the flush.
Comment by Matt Ingenthron [ 23/Aug/13 ]
Given that this was deferred, we should at least release note it.
Comment by Aleksey Kondratenko [ 08/Oct/13 ]
Don't get "or" above. This is not fixed yet. This is fixable relatively easily (assuming 204 is ok). But I won't be crying if it's not for 3.0.
Comment by Aleksey Kondratenko [ 08/Oct/13 ]
Matt, let's be specific here. This ticket is only about returning 204 if flush takes longer than 30 seconds.

There's another ticket on my team _to investigate_ reported slowness for cases where we know it's not fsyncs. Once we investigate we might or might not be able to improve it.
Generated at Fri Jul 25 02:39:07 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.