[MB-6714] ep-engine needs around 1 seconds to mark empty and closed checkpoint as persisted Created: 24/Sep/12  Updated: 26/Oct/12  Resolved: 25/Sep/12

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: None
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Chiyoung Seo
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   
I'm pretty sure this was not a case up until very recently. Suspect this is caused by recent XDCR changes where we enabled closing of empty checkpoints. (BTW there's no need IMHO)

Steps to reproduce:

* grab everything latest and enable consistent views

* rebalance 1-2 or 1->4

* observe how slow it is

* in logs of ns_server you would see a lot of this:


[rebalance:info,2012-09-25T6:54:32.463,n_0@192.168.1.2:<0.18358.0>:janitor_agent:get_replication_persistence_checkpoint_id:470]default: Doing get_replication_persistence_checkpoint_id call for vbucket 127 on n_0@192.168.1.2
[ns_server:debug,2012-09-25T6:54:32.463,n_0@192.168.1.2:'janitor_agent-default':janitor_agent:handle_call:661]After creating new checkpoint here's what we have: {2,3,4}
[ns_server:debug,2012-09-25T6:54:32.464,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.464,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.475,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.475,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.486,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.486,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.497,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.497,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.508,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.508,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.519,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.519,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.550,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.550,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.581,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.581,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.612,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.612,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.643,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.643,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.674,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.674,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.775,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.775,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.876,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.876,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.977,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:32.977,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:33.078,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:33.078,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:33.179,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:33.179,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:33.280,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:33.280,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:33.381,n_2@192.168.1.2:<0.12504.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:33.381,n_3@192.168.1.2:<0.12508.0>:janitor_agent:do_wait_checkpoint_persisted:746]Waiting persisted checkpoint 3 on 127, now 2
[ns_server:debug,2012-09-25T6:54:33.482,n_3@192.168.1.2:'janitor_agent-default':janitor_agent:handle_info:677]Got done message from subprocess: <0.12508.0> (ok)

P.S.: given that it'll greatly slow down testing I'd raise this to blocker, but not sure people would understand me so set this to just critical.


 Comments   
Comment by Chiyoung Seo [ 25/Sep/12 ]
I understood what you mentioned above.

I pushed a couple of changes to address this issue:

http://review.couchbase.org/#/c/21067/
http://review.couchbase.org/#/c/21068/

Please reopen the bug if you still have the same issue.
Comment by Thuan Nguyen [ 25/Sep/12 ]
Integrated in github-ep-engine-2-0 #434 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/434/])
    MB-6714 Wake up the flusher after creating a checkpoint by force (Revision 63062710c9790aad0dabc9896f952de6f33cae3e)
MB-6714 Create a new checkpoint for the non-empty open checkpoint (Revision 31c007f6d5c73d96db242425b6fc299abc3ae9cf)

     Result = SUCCESS
Chiyoung Seo :
Files :
* src/ep_engine.cc

Chiyoung Seo :
Files :
* src/checkpoint.cc
Comment by Junyi Xie (Inactive) [ 26/Sep/12 ]
http://review.couchbase.org/#/c/21086/

Comment by Thuan Nguyen [ 27/Sep/12 ]
Integrated in github-ep-engine-2-0 #435 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/435/])
    MB-6714 Persisted chk id is included in checkpoint_creation reply (Revision 97ef3af84057703c2be262aa17aaa4704964dba0)

     Result = SUCCESS
Chiyoung Seo :
Files :
* src/ep_engine.cc
Comment by kzeller [ 26/Oct/12 ]
RN: Couchbase Server had experienced a one second delay to mark empty
and closed checkpoings on a persistence queue. Performance improvements
made to eliminate delay.
Generated at Fri Oct 31 22:10:11 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.