I’m having an issue where Couchbase is returning stale data only first thing in the morning. Let me explain…
I’m running production Couchbase servers (4.6) with 3 nodes. In this scenario, I have 2 web requests that are hit by a client app in succession:
endpoint that (over)writes a couchbase document with data pulled from an external source
endpoint that reads the content of the document that was provided in the response from the first endpoint
So, there are 2 separate requests on the web server cluster, each request is likely to hit a different webserver that will connect to my couchbase cluster.
First thing in the morning, the 2nd endpoint will return old data. If I run the process over and over again I can not reproduce this behavior, as the first endpoint updates the data, the second one reads the updated data as expected. It’s just the first one in the morning (so say after 12 hours of no activity) that I see the problem.
I should also note that in the first endpoint hit, after writing the data, I read it back and log the contents, which are correct. So the 1st endpoint is working properly, and seems to be writing the data properly. This is consistent even on the first attempt of the morning. However the 2nd endpoint still shows the stale data. This specific functionality is to transfer an end user’s data when they move periodically. I have to be able to rely on the durability of the data on the first try after long periods of inactivity.
I have tried passing the replace_to = -1 and persist_to = -1 (a PHP Couchbase lib comment says that “-1” = all active nodes)
I don’t see anything that stands out, but I am curious why you are specify an expiry and why you’re using getAndTouch. Is there anything in your app that is supposed to create documents with a TTL or use document expiration in some way? Since I see "expiry=0’, I’m wondering why you’re using getAndTouch. I can’t think of a reason why this would cause the behavior, but it does appear curious to me.
@matthew.groves, the expiry setting is just because I’ve pulled these lines from wrapper classes that provide the option to set an expiry. In these specific cases the expiry is set to 0 .
Also, I added the “getAndTouch” in a desperate attempt to “shake” the system into propagating the data more reliably. It hasn’t had an effect. What I’ve done now is added a “sleep(1)” after my transfer step, which I will test for tomorrow morning. (Again, there’s no point testing it now because since I tried it earlier today it will work as expected until I leave it alone for a long period of time).
Could there be a configuration issue with our Couchbase cluster that might cause this behavior? We run 3 nodes in the cluster, and the bucket is set to 1 replica. Both the web servers and couchbase servers are hosted in AWS and communicate directly using private DNS in a VPC.
I’ve been able to confirm that the problem is on the read. Yesterday I setup 3 tests to break (stop execution) between the first and second request. All 3 tests resulted in the correct data being written to couchbase. I manually checked couchbase using the admin portal after each of the initial requests went through (again, having disabled the second request which reads the data in quick succession). This tells me the problem is in the second request reading couchbase data that is stale, even as much as 1 second later.
The current code that saves the data is still passing “replicate_to=1” to the ‘upsert’ options.
@matthew.groves , I’m trying new tests every morning now. This morning I changed back to using “persist_to=1” (using the explicit replica value rather then -1) and make 2 reads on the 2nd request, in another attempt to assume the first read might “wake it up” and the second read give real data.
I set 3 tests each day now to run the next morning. In each case I change the source data (different data center where the transfer in data originates. I then run one of the test through the client game (Android app) which hits the first request, takes the key Id provided from the request and includes it in the 2nd request. For the other 2 tests I hit the link manually to avoid the 2nd request being fired at all.
The test run through the game results in stale data, but the 2 tests run manually that never hit the 2nd endpoint (read endpoint) result in the correct data in couchbase (when viewed through the admin portal).
Would it help for me to force a CAS value change, is there a way for me to do that (is it already being done when I save the data in the first request)? Are there any server configurations that could affect this? For example, logging that could show every read/write request along with meta data about each…
Thanks in advance, I appreciate any direction you can point me in…