Rebalance is stuck

Lennartos · May 29, 2015, 8:40am

While i would like for you to get the logs and see if couchbase staff can help you fix this once and for all, my backup was created from 3.0.3 AFTER i upgraded.
I did have backups before but due to the bug in cbbackup it was incomplete ( it stalls before completing ) and i didnt dare do anything like the situation you are in now.
After the entire process of 3.0.3 upgrade and deleting / restoring bucket i lost only 1 document, which is the corrupted one i assume.

davido · May 29, 2015, 10:33am

Hi Itay,

When trying to use cbbackup, are you giving it the URL of the cluster/node or the path to the data files directly? If you’re trying to back up from the cluster and failing, then doing a backup directly from the data files might actually succeed.

The syntax for backing up from the data files directly is:
cbbackup couchstore-files:///path/to/couchbase/data/ /backup/folder -b <bucket> -u <Administrator> -p <password>

ingenthr · May 29, 2015, 2:04pm

The files have an internal checksum so while it requires work, it’d be possible to distinguish between valid and invalid records. I have not done it recently, but it used to be possible to move the datafiles to a new node and get it to use them by defining the bucket, shutting the service down, copying the files in place and starting it, but there is also the cbtransfer tool, which if I recall correctly (and the documentation seems to agree) can read data files.

Regarding S3, that was just a suggestion if you were using AWS and had it on S3 EBS. The thought there (which can apply to other environments as well) is that at least you’d have a snapshot of the state of the systems.

itay · May 29, 2015, 2:14pm

Hi @davido, How are you ?

Thanks for trying to help.

My backup command is pretty simple:
./cbbackup http://localhost:8091 c:/backups -u $uid -p $pwd --bucket-source=C

I’ll try to backup from the files and if I succeed, I’ll spawn a new cluster and try to cbtransfer to it.

itay · May 29, 2015, 7:13pm

I run cbtransfer.

Bucket A did not complete. reached 99.9%
Bucket B did not complete. reached 96.2%
Bucket C did not complete. reached 99.1% and freeze

itay · May 30, 2015, 6:48am

Update:

I created a new Cluster with an empty default bucket.
Copied entire var/lib/couchbase/data folder from corrupted cluster to the new one.
New Cluster still shows the default bucket.
I also copied var/lib/couchbase/config/config.dat file from old to new. Still only default bucket is visible in new cluster.

Questions:

What else should I copy from the corrupted cluster to the new one ?
Do I need to restart the new cluster ? How ?
Will it work also on 3.0.1 CE or only on 3.0.3 EE ?

P.S I’m also starting to lose docs from views in the corrupted cluster. However, these docs are accessible directly (by key) from API or console.

Please advise quickly.
Thanks.

itay · May 31, 2015, 6:54am

Update:

I manually created buckets for A, B, C in the new cluster and they immediately started to populate with docs from the data folder.

However, cbbackup still hangs for bucket C at 70.7%

This means that the data folder contains corrupted data.
Perhaps if I’ll know what are the corrupted docs I can manually delete them and free cbbackup and then rebalance will hopefully succeed.

How can I detect the corrupted docs and restore data integrity ASAP ?
If I’ll repeat the same process on CE 3.0.3, will the results be different ?

itay · May 31, 2015, 8:09pm

@davido,

I’m trying to run:

./cbbackup couchstore-files:///C:\Program Files\couchbase\server\var\lib\couchbase\data c:\backups -u $uid -p $pwd -b C

but get:

Error: please provide both a source and a backup_dir

I tried with 2 slashes and with a subfolder for the target

What am I doing wrong ?

JeffSaxe · June 3, 2015, 10:09pm

I believe we have had the same issue here. We don’t have any way of intentionally triggering the problem, but I didn’t notice it on an attempted Rebalance — the cluster was just deciding on its own to do a Compacting operation, and it hung during the Compact. Hmm, maybe this is a separate issue? I’m going to post here anyway, in case it is related.

Symptoms: One of the nodes in the cluster starts a “couch_view_grou” thread which consumes 100% of one core (not multi-threaded, apparently), and either leaks or consumes RAM continuously until it reaches the total RAM of the box, then it crashes. That node then stops-and-restarts Couchbase, i.e., in the management GUI, you see it go from green to red and back. This cycles over and over again. The GUI shows a green panel with “Compacting” and a progress bar, but it never progresses. When it crashes and restarts, that dialog about Compacting briefly goes away, then comes back.

Memcache-style buckets seem to be unaffected while this is going on, but I think the particular Couchbase-style bucket that it’s stuck on is affected, but I haven’t gotten confirmation from my testers. The other node in the cluster (this happens to be a small cluster of 2!) was unaffected by any of this.

I tried giving the box more CPUs and more RAM, and it just eats up RAM up until the new, higher limit, so the crash-and-restart is less frequent, but it doesn’t fix the problem. What did fix the problem, though, was dropping and recreating the index views! So maybe this is why Itay found that going in and adding some meaningless whitespace fixed the problem, because what it was really doing was deleting and recreating the index in the process of submitting that change. My developer just made me a simple 4-line script of “curl” commands that uses the REST interface to DELETE and the PUT the views in question.

So basically we have a response procedure we think we can use the next time it happens, and I think that response procedure would not involve loss of data, because we’re not deleting any documents, just deleting and recreating the views on the documents. Correct?

If one of the developers wants to look at logs for this, I will be happy to provide them… except that the “Collect Logs” process also seems to fail on the very node that’s having the issue. So I can get logs from the other node, but not from the one that’s cycling. But I can pull logs out of Couchbase’s logs directory manually, so just let me know how to provide them.

Jeff Saxe, SNL Financial, Charlottesville, Virginia

ingenthr · June 3, 2015, 11:57pm

Which version are you running @JeffSaxe?

JeffSaxe · June 4, 2015, 4:12pm

Sorry, @ingenthr, I should have mentioned. The problem has occurred now on three different clusters, each of which is running Community Edition 3.0.1. It is right now happening on two clusters, so if anyone wants some interesting log excerpts or some tar-gzip’ed up copies of an entire /opt/couchbase directory, I’m happy to provide them; please contact me, JSaxe@SNL.com. As I mentioned above, I can’t post the “collect into” zip file from the malfunctioning node, because that Collect button just times out and complains that the collect process failed on that node (although it succeeds on the other node).

I have installed 3.0.3 Enterprise Edition on a separate cluster to see if the problem surfaces there. Unfortunately, we do not have any way to intentionally trigger the problem, so even though “no news is good news”, I cannot definitively say that 3.0.3 fixes the problem; I have no idea. I suspect that upgrading the cluster from 3.0.1 CE to 3.0.3 EE would fix it, but not necessarily because of any code difference related to this issue, mostly because the upgrade processs would involve completely uninstalling / reinstalling and removing / re-adding the nodes to the cluster.

Aha… interesting update: I left both of the currently-malfunctioning clusters malfunctioning overnight, and on one of the clusters as I type this, both of the nodes have become affected. Both of them have “couch_view_grou” processes listed in top, chewing up CPU and RAM and continuously recycling. The management GUI still seems to be responding, and my Memcache calls to a couple of ports still seem to be working. Also, the “collect logs” request now no longer works on either of the stuck nodes. Is this relevant? Is the same thread that’s responsible for completing or closing out a Compact-in-progress also responsible for responding to requests to collect logs? Fascinating. Anyone is welcome to email me urgently for details on this; it certainly feels like a bug to me, and I am normally very good at troubleshooting. Thanks, helpful community!

JeffSaxe · June 4, 2015, 4:51pm

Another update, again in case it helps someone closer to the compaction code (@ingenthr or otherwise!) to talk to me sooner. As I’m watching this, the compacting thread is clearly writing to this particular subdirectory:

root@DMZASHCouchST1A:~# ls -l /opt/couchbase/var/lib/couchbase/data/@indexes/throttle-service-actions
total 2749856
-rw-rw---- 1 couchbase couchbase     271407 Jun  3 11:57 main_3d6e6044988cb6a1c780d673cd7ab747.view.127
-rw-rw---- 1 couchbase couchbase 2815572496 Jun  4 12:44 main_3d6e6044988cb6a1c780d673cd7ab747.view.127.compact
root@DMZASHCouchST1A:~# ls -l /opt/couchbase/var/lib/couchbase/data/@indexes/throttle-service-actions
total 2755360
-rw-rw---- 1 couchbase couchbase     271407 Jun  3 11:57 main_3d6e6044988cb6a1c780d673cd7ab747.view.127
-rw-rw---- 1 couchbase couchbase 2821208592 Jun  4 12:44 main_3d6e6044988cb6a1c780d673cd7ab747.view.127.compact
root@DMZASHCouchST1A:~# ls -l /opt/couchbase/var/lib/couchbase/data/@indexes/throttle-service-actions
total 2780832
-rw-rw---- 1 couchbase couchbase     271407 Jun  3 11:57 main_3d6e6044988cb6a1c780d673cd7ab747.view.127
-rw-rw---- 1 couchbase couchbase 2847291920 Jun  4 12:44 main_3d6e6044988cb6a1c780d673cd7ab747.view.127.compact

Then, after it crashes-and-burns and starts over, it recreates this file:

root@DMZASHCouchST1A:~# ls -l /opt/couchbase/var/lib/couchbase/data/@indexes/throttle-service-actions
total 563232
-rw-rw---- 1 couchbase couchbase    271407 Jun  3 11:57 main_3d6e6044988cb6a1c780d673cd7ab747.view.127
-rw-rw---- 1 couchbase couchbase 576469520 Jun  4 12:46 main_3d6e6044988cb6a1c780d673cd7ab747.view.127.compact
root@DMZASHCouchST1A:~# ls -l /opt/couchbase/var/lib/couchbase/data/@indexes/throttle-service-actions
total 664736
-rw-rw---- 1 couchbase couchbase    271407 Jun  3 11:57 main_3d6e6044988cb6a1c780d673cd7ab747.view.127
-rw-rw---- 1 couchbase couchbase 680409616 Jun  4 12:46 main_3d6e6044988cb6a1c780d673cd7ab747.view.127.compact
root@DMZASHCouchST1A:~#

So really, not only is it hammering one CPU and chewing up RAM until it gets killed, but it is also writing a lot to disk, then deleting that file and starting over and writing to disk. If it’s supposed to be reading a puny little 271K file and “compacting” it to a fresh file, and it writes out many gigabytes of data, then it’s pretty clearly in some kind of infinite loop.

JeffSaxe · June 4, 2015, 10:53pm

One more update before I leave for the day: The little 4-line curl script that our developer made doesn’t actually work as desired. It’s basically this:

curl -X DELETE -u Administrator:SecretPassword http://DMZASHCouchST1:8092/throttle-service-actions/_design/dev_views
curl -X DELETE -u Administrator:SecretPassword http://DMZASHCouchST1:8092/throttle-service-actions/_design/views
curl -X PUT -u Administrator:SecretPassword -H "Content-Type: application/json" http://DMZASHCouchST1:8092/throttle-service-actions/_design/dev_views --data @Views.json
curl -X PUT -u Administrator:SecretPassword -H "Content-Type: application/json" http://DMZASHCouchST1:8092/throttle-service-actions/_design/views --data @Views.json

…where “Views.json” is a small file on the local machine (the workstation executing these curl commands) containing the text of the views. If I run the DELETE commands, then the views do in fact disappear from the GUI, and the Compacting dialog goes away, and the cluster nodes stop doing the weird symptom. But then I run the two PUT commands, and a few seconds later it actually starts doing it again! Dang it. But we figured out that if we use the Delete button in the Views GUI to perform the delete, and then use just the lower two curl commands to put them back, then it works and stays stable. So I don’t know if this gives any more clues as to what is getting stuck in its mental work queue… I don’t actually know the effective difference between clicking Delete and using the DELETE REST call, but I guess there is some difference under the table.

I haven’t directed any of these posts at @cihangirb specifically, so now I am, just in case this issue is of interest to him. Thanks, folks.

robert_hamon_ · June 5, 2015, 6:36am

Couchbase is getting really bad recommendations from my part because of this (and others) issue with CE 3.0.1… when can we expect CE 3.0.3 to be available so couchbase is actually usable again?
Can we at least compile CE 3.0.3 from github on our own and get the bug fixes that are part of EE 3.0.3?

JeffSaxe · June 5, 2015, 9:31pm

OK, my developer and I have done some more work, and I think I have been raising a red herring (is that even an expression?). The compacting / indexing code has been consuming RAM and CPU and disk because it calls our Map/Reduce code on the view, and our code seems to be trying to do something that is impossible, or at least impossible in Couchbase, or at least that we can’t do in the way we want or that we think is most obvious. Very briefly, we’re essentially trying to do a UNION DISTINCT on some elements in arrays in documents, and we’re trying to return from the Reduce function an array of short strings, and it will return 10 or 100 but completely chokes returning 1000.

Anyway, I will (later tonight, since I need to leave work soon) post a new, fresh topic on what our actual problem is, including some code snippets. But y’all can consider everything I’ve said here in this topic as not necessarily relevant to the original post. I apologize for the distraction; I do try to be relevant and not clutter up the discussion. Thanks for your time and attention, everyone.

robert_hamon_ · June 7, 2015, 2:28am

@JeffSaxe You might be interested in a little known setting:
curl -X POST http://$USER:$PASS@localhost:8091/diag/eval -d ‘rpc:eval_everywhere(erlang, apply, [fun() → couch_config:set(“mapreduce”, “max_kv_size_per_doc”, “$SIZE_IN_BYTES”) end, ]).’
The default $SIZE_IN_BYTES is 1048576 but you should try a much smaller value… this will make the indexer abort the current document after emitting X numbers of bytes and log the event. In the logs you will find the document ID, view name, bucket name, etc…
This should give you an idea of what document/view is having problems.

mm87642 · September 6, 2016, 9:46am

@itay : This reply is very late… However this might help people who are stuck with rebalance on any version of CB server … I am testing out few scenarios in CB cluster … and rebalance got hunged up with a bucket with ~50M records. at 50.x % on various cluster nodes.

I looked into the bucket metrics… and “Intra Replication Queue” was at 12.5M constant and all replication had stoped … 12.5M was very close to ulimit of open files configured by me …

Since ulimit is reached … new file cannot be opened for allocation of “disk write queue” and other required resources …

You might need ti check which system resource you are hitting … that should help…

I reduced the replicas and did rebalance multiple times to achieve 3 replicas… Also Increasing ulimit would help … if thats becoming a bottleneck.

Good thing about rebalance is that it can be stopped and started multiple times

Topic		Replies	Views
Rebalancing Hung Couchbase Server	2	3100	November 29, 2015
Memory consumption increased significantly after rebalance Couchbase Server	22	4812	December 4, 2017
Rebalancing taking lot of time and never completes Couchbase Server	4	2061	May 31, 2017
Failure Recovery - Can't Rebalance Couchbase Server	6	3761	November 9, 2014
Failure during rebalance Couchbase Server	5	5716	July 2, 2013

Rebalance is stuck

Related topics