[MB-6084] ep-engine is not marking checkpoint as persisted in stats Created: 01/Aug/12  Updated: 08/Aug/12  Resolved: 08/Aug/12

Status: Resolved
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: None
Fix Version/s: None
Security Level: Public

Type: Bug Priority: Critical
Reporter: Aleksey Kondratenko Assignee: Chiyoung Seo
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: GZip Archive diag1.gz     GZip Archive diag.gz    

 Description   
When running my wip commit that ensures things are persisted on replica and new master before initiating takeover (http://review.couchbase.org/18798) I sometimes get rebalance 'stuck'.

Seemingly because items are replicated & persisted (verified by looking at .couch files), but affected checkpoint is not marked as persisted.

In this particular case it happened to vbucket 13.

What my change is doing is I'm creating checkpoint (using create checkpoint command) after establishing replication stream. And then I'm waiting until checkpoint prior to that is persisted. But in this case stats tell me that checkpoint 0 is persisted instead of 2.

Relevant stats:


root@beta:~/src/altoros/moxi/ns_server/couch/0/default# ~/src/altoros/moxi/repo20/install/bin/cbstats lh:12000 raw 'checkpoint 13'
 vb_13:checkpoint_extension: false
 vb_13:eq_tapq:replication_building_13_'n_1@10.17.2.163':cursor_checkpoint_id: 3
 vb_13:last_closed_checkpoint_id: 2
 vb_13:num_checkpoint_items: 1
 vb_13:num_checkpoints: 1
 vb_13:num_items_for_persistence: 0
 vb_13:num_open_checkpoint_items: 0
 vb_13:num_tap_cursors: 1
 vb_13:open_checkpoint_id: 3
 vb_13:persisted_checkpoint_id: 2
 vb_13:state: active
root@beta:~/src/altoros/moxi/ns_server/couch/0/default# ~/src/altoros/moxi/repo20/install/bin/cbstats lh:12002 raw 'checkpoint 13'
 vb_13:checkpoint_extension: false
 vb_13:last_closed_checkpoint_id: 2
 vb_13:num_checkpoint_items: 1
 vb_13:num_checkpoints: 1
 vb_13:num_items_for_persistence: 0
 vb_13:num_open_checkpoint_items: 0
 vb_13:num_tap_cursors: 0
 vb_13:open_checkpoint_id: 3
 vb_13:persisted_checkpoint_id: 0
 vb_13:state: replica
root@beta:~/src/altoros/moxi/ns_server/couch/0/default# ~/src/altoros/moxi/repo20/install/bin/cbstats lh:12002 raw 'tap'
 ep_tap_ack_grace_period: 300
 ep_tap_ack_interval: 1000
 ep_tap_ack_window_size: 10
 ep_tap_backoff_period: 5
 ep_tap_bg_fetch_requeued: 0
 ep_tap_bg_fetched: 0
 ep_tap_bg_max_pending: 500
 ep_tap_count: 2
 ep_tap_deletes: 0
 ep_tap_fg_fetched: 0
 ep_tap_noop_interval: 20
 ep_tap_queue_backfillremaining: 0
 ep_tap_queue_backoff: 0
 ep_tap_queue_drain: 0
 ep_tap_queue_fill: 0
 ep_tap_queue_itemondisk: 0
 ep_tap_throttle_queue_cap: 1000000
 ep_tap_throttle_threshold: 90
 ep_tap_throttled: 0
 ep_tap_total_backlog_size: 0
 ep_tap_total_fetched: 0
 ep_tap_total_queue: 0
 eq_tapq:anon_1:connected: true
 eq_tapq:anon_1:created: 14
 eq_tapq:anon_1:num_checkpoint_end: 0
 eq_tapq:anon_1:num_checkpoint_end_failed: 0
 eq_tapq:anon_1:num_checkpoint_start: 13
 eq_tapq:anon_1:num_checkpoint_start_failed: 0
 eq_tapq:anon_1:num_delete: 0
 eq_tapq:anon_1:num_delete_failed: 0
 eq_tapq:anon_1:num_flush: 0
 eq_tapq:anon_1:num_flush_failed: 0
 eq_tapq:anon_1:num_mutation: 0
 eq_tapq:anon_1:num_mutation_failed: 0
 eq_tapq:anon_1:num_opaque: 2
 eq_tapq:anon_1:num_opaque_failed: 0
 eq_tapq:anon_1:num_unknown: 0
 eq_tapq:anon_1:num_vbucket_set: 0
 eq_tapq:anon_1:num_vbucket_set_failed: 0
 eq_tapq:anon_1:pending_disconnect: false
 eq_tapq:anon_1:reserved: 0
 eq_tapq:anon_1:supports_ack: true
 eq_tapq:anon_1:type: consumer
 eq_tapq:anon_2:connected: true
 eq_tapq:anon_2:created: 422
 eq_tapq:anon_2:num_checkpoint_end: 0
 eq_tapq:anon_2:num_checkpoint_end_failed: 0
 eq_tapq:anon_2:num_checkpoint_start: 1
 eq_tapq:anon_2:num_checkpoint_start_failed: 0
 eq_tapq:anon_2:num_delete: 0
 eq_tapq:anon_2:num_delete_failed: 0
 eq_tapq:anon_2:num_flush: 0
 eq_tapq:anon_2:num_flush_failed: 0
 eq_tapq:anon_2:num_mutation: 119
 eq_tapq:anon_2:num_mutation_failed: 0
 eq_tapq:anon_2:num_opaque: 4
 eq_tapq:anon_2:num_opaque_failed: 0
 eq_tapq:anon_2:num_unknown: 0
 eq_tapq:anon_2:num_vbucket_set: 0
 eq_tapq:anon_2:num_vbucket_set_failed: 0
 eq_tapq:anon_2:pending_disconnect: false
 eq_tapq:anon_2:reserved: 0
 eq_tapq:anon_2:supports_ack: true
 eq_tapq:anon_2:type: consumer
root@beta:~/src/altoros/moxi/ns_server/couch/0/default#
root@beta:~/src/altoros/moxi/ns_server/couch/0/default# ~/src/altoros/moxi/repo20/install/bin/cbstats lh:12000 raw 'tap'
 ep_tap_ack_grace_period: 300
 ep_tap_ack_interval: 1000
 ep_tap_ack_window_size: 10
 ep_tap_backoff_period: 5
 ep_tap_bg_fetch_requeued: 0
 ep_tap_bg_fetched: 0
 ep_tap_bg_max_pending: 500
 ep_tap_count: 2
 ep_tap_deletes: 0
 ep_tap_fg_fetched: 119
 ep_tap_noop_interval: 20
 ep_tap_queue_backfillremaining: 0
 ep_tap_queue_backoff: 0
 ep_tap_queue_drain: 119
 ep_tap_queue_fill: 0
 ep_tap_queue_itemondisk: 0
 ep_tap_throttle_queue_cap: 1000000
 ep_tap_throttle_threshold: 90
 ep_tap_throttled: 0
 ep_tap_total_backlog_size: 0
 ep_tap_total_fetched: 139
 ep_tap_total_queue: 0
 eq_tapq:replication_building_13_'n_1@10.17.2.163':ack_log_size: 0
 eq_tapq:replication_building_13_'n_1@10.17.2.163':ack_seqno: 129
 eq_tapq:replication_building_13_'n_1@10.17.2.163':ack_window_full: false
 eq_tapq:replication_building_13_'n_1@10.17.2.163':backfill_completed: true
 eq_tapq:replication_building_13_'n_1@10.17.2.163':bg_jobs_completed: 0
 eq_tapq:replication_building_13_'n_1@10.17.2.163':bg_jobs_issued: 0
 eq_tapq:replication_building_13_'n_1@10.17.2.163':bg_result_size: 0
 eq_tapq:replication_building_13_'n_1@10.17.2.163':connected: true
 eq_tapq:replication_building_13_'n_1@10.17.2.163':created: 422
 eq_tapq:replication_building_13_'n_1@10.17.2.163':flags: 85 (ack,backfill,vblist,checkpoints)
 eq_tapq:replication_building_13_'n_1@10.17.2.163':has_queued_item: false
 eq_tapq:replication_building_13_'n_1@10.17.2.163':idle: true
 eq_tapq:replication_building_13_'n_1@10.17.2.163':paused: 1
 eq_tapq:replication_building_13_'n_1@10.17.2.163':pending_backfill: false
 eq_tapq:replication_building_13_'n_1@10.17.2.163':pending_disconnect: false
 eq_tapq:replication_building_13_'n_1@10.17.2.163':pending_disk_backfill: false
 eq_tapq:replication_building_13_'n_1@10.17.2.163':qlen: 0
 eq_tapq:replication_building_13_'n_1@10.17.2.163':qlen_high_pri: 0
 eq_tapq:replication_building_13_'n_1@10.17.2.163':qlen_low_pri: 0
 eq_tapq:replication_building_13_'n_1@10.17.2.163':queue_backfillremaining: 0
 eq_tapq:replication_building_13_'n_1@10.17.2.163':queue_backoff: 0
 eq_tapq:replication_building_13_'n_1@10.17.2.163':queue_drain: 119
 eq_tapq:replication_building_13_'n_1@10.17.2.163':queue_fill: 0
 eq_tapq:replication_building_13_'n_1@10.17.2.163':queue_itemondisk: 0
 eq_tapq:replication_building_13_'n_1@10.17.2.163':queue_memory: 0
 eq_tapq:replication_building_13_'n_1@10.17.2.163':rec_fetched: 124
 eq_tapq:replication_building_13_'n_1@10.17.2.163':recv_ack_seqno: 128
 eq_tapq:replication_building_13_'n_1@10.17.2.163':reserved: 1
 eq_tapq:replication_building_13_'n_1@10.17.2.163':seqno_ack_requested: 128
 eq_tapq:replication_building_13_'n_1@10.17.2.163':supports_ack: true
 eq_tapq:replication_building_13_'n_1@10.17.2.163':suspended: false
 eq_tapq:replication_building_13_'n_1@10.17.2.163':total_backlog_size: 0
 eq_tapq:replication_building_13_'n_1@10.17.2.163':total_noops: 17
 eq_tapq:replication_building_13_'n_1@10.17.2.163':type: producer
 eq_tapq:replication_building_13_'n_1@10.17.2.163':vb_filter: { 13 }
 eq_tapq:replication_building_13_'n_1@10.17.2.163':vb_filters: 1
 eq_tapq:replication_n_1@10.17.2.163:ack_log_size: 0
 eq_tapq:replication_n_1@10.17.2.163:ack_seqno: 18
 eq_tapq:replication_n_1@10.17.2.163:ack_window_full: false
 eq_tapq:replication_n_1@10.17.2.163:backfill_completed: true
 eq_tapq:replication_n_1@10.17.2.163:bg_jobs_completed: 0
 eq_tapq:replication_n_1@10.17.2.163:bg_jobs_issued: 0
 eq_tapq:replication_n_1@10.17.2.163:bg_result_size: 0
 eq_tapq:replication_n_1@10.17.2.163:connected: true
 eq_tapq:replication_n_1@10.17.2.163:created: 14
 eq_tapq:replication_n_1@10.17.2.163:flags: 85 (ack,backfill,vblist,checkpoints)
 eq_tapq:replication_n_1@10.17.2.163:has_queued_item: false
 eq_tapq:replication_n_1@10.17.2.163:idle: true
 eq_tapq:replication_n_1@10.17.2.163:paused: 1
 eq_tapq:replication_n_1@10.17.2.163:pending_backfill: false
 eq_tapq:replication_n_1@10.17.2.163:pending_disconnect: false
 eq_tapq:replication_n_1@10.17.2.163:pending_disk_backfill: false
 eq_tapq:replication_n_1@10.17.2.163:qlen: 0
 eq_tapq:replication_n_1@10.17.2.163:qlen_high_pri: 0
 eq_tapq:replication_n_1@10.17.2.163:qlen_low_pri: 0
 eq_tapq:replication_n_1@10.17.2.163:queue_backfillremaining: 0
 eq_tapq:replication_n_1@10.17.2.163:queue_backoff: 0
 eq_tapq:replication_n_1@10.17.2.163:queue_drain: 0
 eq_tapq:replication_n_1@10.17.2.163:queue_fill: 0
 eq_tapq:replication_n_1@10.17.2.163:queue_itemondisk: 0
 eq_tapq:replication_n_1@10.17.2.163:queue_memory: 0
 eq_tapq:replication_n_1@10.17.2.163:rec_fetched: 15
 eq_tapq:replication_n_1@10.17.2.163:recv_ack_seqno: 17
 eq_tapq:replication_n_1@10.17.2.163:reserved: 1
 eq_tapq:replication_n_1@10.17.2.163:seqno_ack_requested: 17
 eq_tapq:replication_n_1@10.17.2.163:supports_ack: true
 eq_tapq:replication_n_1@10.17.2.163:suspended: false
 eq_tapq:replication_n_1@10.17.2.163:total_backlog_size: 0
 eq_tapq:replication_n_1@10.17.2.163:total_noops: 36
 eq_tapq:replication_n_1@10.17.2.163:type: producer
 eq_tapq:replication_n_1@10.17.2.163:vb_filter: { [0,12] }
 eq_tapq:replication_n_1@10.17.2.163:vb_filters: 13


 Comments   
Comment by Peter Wansch (Inactive) [ 02/Aug/12 ]
Hi Chiyoung, can you look into this symptom which is required for the consistent views feature complete? Thank you.
Comment by Chiyoung Seo [ 03/Aug/12 ]
I don't think I should fix this issue by this sprint. Persistence and checkpoint are working fine at this time. I don't care any new issues introduced by external new features at this time. Please forward this comment to the entire team if you guys want to broadcast.
Comment by Chiyoung Seo [ 08/Aug/12 ]
http://review.couchbase.org/#change,19355
Comment by Thuan Nguyen [ 08/Aug/12 ]
Integrated in github-ep-engine-2-0 #389 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/389/])
    MB-6084 Reset checkpoint cursors after receiving backfill streams (Revision bb22966b48eeb2e16256cf3b40ea379ab8e72a00)

     Result = SUCCESS
Chiyoung Seo :
Files :
* checkpoint.cc
Generated at Sat Apr 19 23:15:32 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.