4.1.X: all nodes are 100% load / + strange memcached log

egrep · May 26, 2016, 3:47pm

Hello,

3 nodes, undersized, many expired documents
UI: “data buckets->bucket.name->documents”, hangs with 100% load for 3-node cluster (all nodes return back to normal after 2-3 minutes); memcached.log.0.txt (below):
QUESTION: is it normal, or should be investigated ?

UPDATE: This is wrong, because this happens each time i click “documents” (even after compaction, but i don’t see log messages now).

UPDATE2: Every-time-repeated activity with docs fetch:

UPDATE3: After activity, there are no docs found (but must be 5):

UPDATE4: In fact, 5 documents exist. But the problem with “each-time-massive-fetches with 100% load” remains.

UPDATE5: reboot of all nodes does not help. Button-press call “GET /pools/default/buckets/bucket.name/docs?inclusive_end=false&skip=0&include_docs=true&limit=6&_=1464290940620” just failed, that is why nothing displayed.

UPDATE6: see same repeated messages within babysitter.log each time i press “Documents” button

UPDATE7: removing views does not help. Description of the problem is like this: “list of all previously created (and expired) docs is rescanned every time, while docs do not exist”.

@vmx, @jwalker, any “initial evaluation” (maybe this is known) ?

seems like "reinstall-with-4.1.0-EE + prove with reproduction " awaits me again

2016-05-26T15:35:53.361267Z WARNING (bucket.name) Warning: failed to fetch data from database, vBucket=5 key=|here was key value| error=document not found [none]
2016-05-26T15:35:53.361432Z WARNING (bucket.name) Warning: failed to fetch data from database, vBucket=217 key=|here was key value| error=document not found [none]

is it normal, or should be investigated ?

matt.carabine · May 26, 2016, 8:48pm

So as far as I know, the UI gets documents to display by hitting a REST endpoint (you can try this for yourself http://<your host name>:8091/pools/default/buckets/[bucket_name]/docs?inclusive_end=false&skip=0&include_docs=true&limit=11).

This first gets some keys from ep-engine (the data service) and then fetches the documents associated with these keys.
As far as I can tell, in versions prior to 4.5 the endpoint to get the keys (which is the same as the one I linked previously with the include_docs=true parameter) also seems to return the keys of items that have been deleted but whose tombstone still exists on disk.
This results in every call that the UI makes to display documents, also tries to fetch many documents which do not exist in memory, hence the warnings Warning: failed to fetch data from database.
This means that in scenarios such as yours where there are large numbers of expired or deleted items on disk, this call will timeout and no documents will be displayed in the UI.

To avoid this you will want to lower the metadata purge interval for this bucket, which decreases the amount of time that the deleted items are left on disk before they are fully purged, assuming that you do not have XDCR or views that rely on these ‘tombstones’ being present.

What is interesting in my research is that this issue does not seem to exist on 4.5 and that is due to the fact that the endpoint http://<your host name>:8091/pools/default/buckets/[bucket_name]/docs?inclusive_end=false&skip=0&limit=11 does not include deleted items anymore.
I would suggest making the metadata purge interval change I suggested above, if possible, and if not then you should test on 4.5 to verify that the issue has been resolved.

Unfortunately I have been unable thusfar to pinpoint the exact change that was made to alter this behaviour in 4.5 compared to earlier versions. I will let you know if I am able to.

egrep · May 27, 2016, 4:30am

@matt.carabine,
thanks, this is very useful explanation.
So, what do we have:

this problem exists for 4.1.X, but not for 4.5, so this is a “backport problem”.
sounds like the solution is “find ep-engine’s source code fragment that returns tombstoned items and change this behavior according to corresponding code in 4.5”
i’ll check workaround with “metadata purge interval”, but the strange thing is that there has been a lot of time (more then 24 hours) since docs expiration. And, imho “this is not a solution”.

P.S. Well, at least this just a backport problem

egrep · May 27, 2016, 5:00am

@matt.carabine,
via UI: “metadata purge interval is too small. Allowed range is 0.04 - 60”
ok, it was “3” => i.e. 3/0.04 = 75 hours => ~3 days.
Is there an option to set metadata purge interval to lower values then 0.04, for example 0.01?

dmitryb · May 27, 2016, 9:20am

I haven’t investigated much but I can definitely confirm that opening in the Web Console: Data Buckets -> bucket_name -> Documents page takes a fair amount of loading time, sometimes up to a minute. Even when there are no active (not expired) documents in the database.
Changing Metadata Purge Interval setting to fix this doesn’t sound as a good idea of course.
I’m using Couchbase v4.0.

egrep · May 27, 2016, 9:22am

@dmitryb,
strange; see no such behavior on 4.1.X src-based builds.

matt.carabine · May 27, 2016, 9:47am

The metadata purge interval is in ‘days’, so 0.04 days is 1 hour, if that’s what you want to set the value to!

egrep · May 27, 2016, 10:09am

@matt.carabine,
yep, this workaround works: i set this to 0.04, perform a compaction and “signs of problem” are gone.
But, the problem itself remains.

[TL;DR] => nothing useful below, everyone could skip this particular post;)

Litte code-discovery, for your attention:

sherlock-branch for ep-engine:

The only place where “failed to fetch data from database” is found is 2151@couch-kvstore.cc.
Condition to print this message:

couchstore_error_t errCode = cbCtx->cks.fetchDoc(..)
if(errCode != COUCHSTORE_SUCCESS && !meta_only) ...

3 . there are 2 places where errCode is changed within cks.fetchDoc():

a)

errCode = couchstore_open_doc_with_docinfo()

(returns something different then COUCHSTORE_SUCCESS ?)

and b)

if (docinfo->deleted) errCode = COUCHSTORE_ERROR_DOC_NOT_FOUND;

(call to couchstore_open_doc_with_docinfo() is successful, but because document is deleted, errCode changed)

We can leave behind case “b”, because there is no difference in
this chunk of code between 4.1.1 and master (and, as you said, 4.5 is not affected).

So, suppose, couchstore_open_doc_with_docinfo() returns not COUCHSTORE_SUCCESS.

Let’s take a look 624@/couchstore/src/couch_db.cc
Suspicious is that sherlok’s couchstore_open_doc_with_docinfo() has no diff from master.
Points of changing errCode are:
a):

if (docinfo->bp == 0)return COUCHSTORE_ERROR_DOC_NOT_FOUND; (632)

b):

errcode = bp_to_doc(pDoc, db, docinfo->bp, options); (640)

Let’s check ep-engine/couch-kvstore.cc for “a)”: docinfo->bp == 0, see that
“CouchKVStore::getMultiCb()” == static “getMultiCbC”

511@couch-kvstore.cc:

 errCode = couchstore_docinfos_by_id(db, ids, itms.size(), 0, getMultiCbC, &ctx);

Let’s take a look at couchstore_docinfos_by_id(), 989@couch_db.cc
It’s a wrapper for iterate_docinfos(); should also mention here, that there is no diff between sherlok and master for this call.

As i see from code,
btree_lookup(&rq, tree->pointer); “getMultiCbC” is set via rq as callback:

static couchstore_error_t iterate_docinfos(Db *db,
                                       const sized_buf keys[],
                                       unsigned numDocs,
                                       node_pointer *tree,
                                       int (*key_ptr_compare)(const void *, const void *),
                                       int (*key_compare)(const sized_buf *k1, const sized_buf *k2),
                                       couchstore_changes_callback_fn callback,
                                       int fold,
                                       void *ctx)
lookup_context cbctx = {... callback ...};
rq.callback_ctx = &cbctx;

… and “my brain stack has overflowed” here examining btree_lookup_inner
It’s easier to put “LOG” everywhere, re-comiple and re-test, to get full understanding where exactly COUCHSTORE_ERROR_DOC_NOT_FOUND comes from.

egrep · May 27, 2016, 1:37pm

@matt.carabine,
Loading...

drigby · May 27, 2016, 1:57pm

Dupe of https://issues.couchbase.com/browse/MB-16663

egrep · May 27, 2016, 2:04pm

@drigby,
Cool!
Could you please backport this to 4.1.x (as i understand, the problem is in 1 LOC: http://review.couchbase.org/gitweb?p=ep-engine.git;a=blobdiff;f=src/couch-kvstore/couch-kvstore.cc;h=bb29402349b4bb4d51e70ad6645ca169dffd2e97;hp=7c856ad9b2375bffe080a54f6c6326d868d3ed30;hb=bf22617cb39d333e548aef177f5c7b8538193331;hpb=191fda08d638dae93f5dfcd1146340ffefae87ce ) ?

drigby · May 27, 2016, 2:20pm

We generally try to constrain backports to issues which are sufficiently severe - there isn’t just the code change cost but all the additional QE / test / etc effort - and of course the opportunity cost.

Given this is a relatively minor issue (essentially a perf issue on an uncommon workflow) I can’t see a backport happening.

Given 4.5 will be released quite soon I suggest you wait for that.

egrep · May 27, 2016, 2:37pm

@drigby,

We generally try to constrain backports to issues which are sufficiently severe - there isn’t just the code change cost but all the additional QE / test / etc effort - and of course the opportunity cost.

Test is already there: http://review.couchbase.org/#/c/56505/

Given this is a relatively minor issue (essentially a perf issue on an uncommon workflow) I can’t see a backport happening.

ok, i agreed, 28-oct-2015 … 27-may-2016 = 8 months, and i’m the one who “duped”. But maybe i’m just one, who investigated and reported about this, while others did not ?

Given 4.5 will be released quite soon I suggest you wait for that.

The problem is that 4.0/4.1.X is declared stable now and MB-16663 is not in a list of “known issues” (4.1: Couchbase SDKs, 4.0: Couchbase SDKs).

Yes, it’s ok, “we will fix it in 4.5”. BUT in this case, change the “known issues” section for Couchbase SDKs , please.

egrep · May 30, 2016, 4:14am

@matt.carabine, @drigby, @cihangirb,
… any comments ?

egrep · May 31, 2016, 2:40pm

And one more try …
@matt.carabine, @drigby, @cihangirb,
… any comments ?

drigby · May 31, 2016, 2:53pm

Pinging people every 24hours isn’t particulary polite.

I already explained why I don’t see us backporting this to a 4.1.x release any time soon. Of course you can always backport the fix yourself - that’s one of the advantages of Open Source.

cihangirb · May 31, 2016, 3:15pm

Hi @egrep, Sorry I was confused about the comment above. Did you want us to put this into the known issues for 4.1? We can certainly do that if that is what you need.
thanks
-cihan

egrep · May 31, 2016, 3:19pm

@drigby,
well, imho, total ignorance is not polite too.
I’ve asked a very simple question:

There was an issue, with “affected version = 4.1.0”
Someone (long before 4.1.1 release) decided “we will fix this in 4.5”.
There was 4.1.1 and is was declared, that:

It further fortifies Couchbase Server 4.1.0 and includes some critical bug fixes.

( Couchbase SDKs )

There is no MB-16663 in “list of known issues”.
In fact, 4.5. just does not exists now. It’s a beta. Something “really unpredictable”.

Summary, as i see it: there is a bug in prev-stable version (4.1.0), there is current “more stable” version (4.1.1) but it has no fix. OK, no fix merge, i got, it’s an “opportunity cost” to change 1 line of code and adapt test.
But what stops you from telling “ok, we’ve missed info” and adding information for this bug to “Couchbase SDKs”

@drigby, try to understand my position: it’s not a “personal attack” or something like that. This question is about “reaction for uncertain situation”. And underneath this particular “uncertain situation” there is a “probably defined policy of bugs handling” that i would like to clarify. Imho, it’s completely not fair: to know about about bug for 4.1.0 and do not fix it in 4.1.1. It’s only “imho” because “there is no justice in the world”. And in this way all users should be aware about this before making any decisions.

egrep · May 31, 2016, 3:25pm

@cihangirb
Yes, please. It will be fair enough.
And all 86 others too: Loading...)%20and%20((status%20%3D%20Closed)%20or%20(status%20%3D%20Open)%20or%20(status%20%3D%20Resolved))%20and%20createdDate%20%3C%20%272016%2F04%2F01%27

cihangirb · May 31, 2016, 3:35pm

We do pick and choose the issue we think users will be likely to hit for release notes. We thoughts about adding the queries to the end so everyone can query the living JIRA but it could be hard to find the things you are looking for in a larger pile. Create an issue to add this to Docs for 4.1 release notes.
Thanks for reporting these.
-cihan

Topic		Replies	Views
Data loss problem (4.0.0-4051) Couchbase Server	1	1620	March 29, 2017
Issue with data in cache Couchbase Server	11	4524	October 26, 2015
URGENT - The number of documents in my bucket dropped by 300,000 Couchbase Server server	5	2511	October 10, 2017
Couchbase Server very laggy Couchbase Server	5	1612	June 29, 2016
Unknown error for documents Couchbase Server	2	1790	March 14, 2017

4.1.X: all nodes are 100% load / + strange memcached log

Related topics