Json Decoding problems after a rebalance and an upgrade to Enterprise Edition 7.1.4 build 3601

I had a to replace a faulty node, and after 10 hours of downtime (rebalancing the cluster)

I am starting to get randomly errors similar with this one

Fatal error : Uncaught JsonException: Syntax error in /var/www/dev-web/vendor/couchbase/couchbase/Couchbase/JsonTranscoder.php:96 Stack trace:

#0 /var/www/dev-web/vendor/couchbase/couchbase/Couchbase/JsonTranscoder.php(96): json_decode()
#1 /var/www/dev-web/vendor/couchbase/couchbase/Couchbase/GetResult.php(63): Couchbase\JsonTranscoder->decode()

/var/www/dev-web/vendor/couchbase/couchbase/Couchbase/JsonTranscoder.php** on line **98

sometimes it is workink after a retry.

What could be the problem?

I don’t see any errors in the couchbase logs

LATER EDIT
I did some additional tests. A simple get document (on a specific document id) from PHP SDK client sometimes return a result and sometimes just return an error : Fatal error : Uncaught Couchbase\Exception\UnambiguousTimeoutException"

I wonder if the php sdk client retrieve the document from different servers randomly, and one of the server doesn’t have the document, or if is something else.

Not sure how I can investigate further … but this seems very bad … because the cluster finalized the rebalance without any error, and everything seems to look corect from the cluster perspective

Can anybody from Couchbase help me investigate this further?

You’ll get a timeout if the response doesn’t come back within the timeout. (Default is 2.5 seconds for KV operations). It shouldn’t result in a Transcoder error. You can either increase the timeout or retry.

It’s not random. It gets from a particular node based on the key and the configuration. If it has a configuration that is out-of-date it could attempt the get to the wrong server, and that should result in a NOT_MY_VBUCKET exception. That operation should be retried (the SDK might retry it internally - in which case it would retry until either (1) it has an updated config and it succeeds; or (2) it continues to fail and eventually timesout. The config is updated frequently (every 10 second? 2.5 seconds?) but it’s possible that the operation could timeout.

@mreiche Thanks for the response.

The problem is that this is not correcting itself. so it seems that the DB is corrupted,. Is there a way to trigger a check? I think some documents are missing from some servers.

If I run a script to get 20 times (a specific document) in a minute around half of the time will trigger an error

If I run a script to get 20 times (a specific document) in a minute around half of the time will trigger an error

What error? I assume timeout.
If the SDK was going to the wrong node to get that specific document, wouldn’t it fail every time? On the times it succeeds, is the elapsed time close to the timeout? If it is, this behavior sounds like the server (or network or client) is oversubscribed, and that specifying a longer timeout would solve the problem.

Also - can you try the version 4.1.0 of the PHP SDK? It has a change to reduce the number of getConfig requests to the server which will decrease the server load and network traffic.

Also - since you are an enterprise customer, you can open a case with Customer Support.

the servers have 2% load, and the network has 1gbps and the usage is less than 1%

This is happening just with a handfull of documents, not with all the documents

This happened after the rebalance (added a failed node afte a hardware failure). After 10 hours the rebalance finished.

BTW, why was the cluster unusable as it was rebalancing? Is there any way in which I can rebalance (add/remove nodes) without downtime?

The first thing you should do is try using the latest SDK. This will avoid troubleshooting issues that have already been addressed.

The next thing you should do is open a case with customer support to investigate your issues and answer your questions.

This is happening just with a handfull of documents, not with all the documents.

But it is only happening sometimes with those documents, right “If I run a script to get 20 times (a specific document) in a minute around half of the time will trigger an error”. So that’s puzzling for me too.

BTW, why was the cluster unusable as it was rebalancing? Is there any way in which I can rebalance (add/remove nodes) without downtime?

It shouldn’t be unusable. It might be slow.

I have already the latest SDK (4.1.2), so this should be out of question

I have restarted the cluster and now seems that all the problems have gone. This is very weird

Regarding the rebalancing. What was strange to me was that when one of the servers went down the cluster was not responding to all document requests (some were failing with ambiguous timeout).

So definitely something was not working correctly as the server was already auto-failover for more than 30 minutes.

From what I understand when this happens (server down + auto-failover), the client (php sdk) should continue normal operations, something which didn’t happened.

Also, after I have replaced the server, the new server network went 100% which means that it is very hard for any client (php sdk) to use cluster (if if would normally work with a missing server) but if the other servers are bombarding the new server with data, this means that there is no more room for normal users of the cluster to get/set data to it.

How can you limit the rebalance cpu/bandiwth consumption in such moments?

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.