Details
-
Type:
Improvement
-
Status:
Open
-
Priority:
Minor
-
Resolution: Unresolved
-
Affects Version/s: 1.6.5.1
-
Fix Version/s: Backlog
-
Component/s: couchbase-bucket
-
Security Level: Public
-
Labels:None
Description
I loaded a two node cluster with 13.7M items.
Replication is caught up on both nodes, (total items is double then curr_itmes), but on one node (10.2.1.13) the bg_backlog_size keeps growing, started at around 100K, reached 4M at some point, went down to ~2M and. there is no traffic on that system:
Sharon-Barrs-MacBook-Pro:scripts sharonbarr$ ./stats 10.2.1.13:11210 tap
ep_tap_ack_grace_period: 300
ep_tap_ack_interval: 1000
ep_tap_ack_window_size: 10
ep_tap_backoff_period: 1
ep_tap_bg_fetch_requeued: 0
ep_tap_bg_fetched: 7047333
ep_tap_bg_max_pending: 500
ep_tap_count: 1
ep_tap_deletes: 0
ep_tap_fg_fetched: 26839455
ep_tap_keepalive: 300
ep_tap_noop_interval: 20
ep_tap_throttled: 2003363
ep_tap_total_fetched: 33891978
ep_tap_total_queue: 1867398
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:ack_log_size: 0
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:ack_playback_size: 0
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:ack_seqno: 26871351
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:ack_window_full: false
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:bg_backlog_size: 1867398
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:bg_jobs_completed: 6636686
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:bg_jobs_issued: 6636686
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:bg_queue_size: 0
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:bg_queued: 6636686
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:bg_result_size: 0
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:bg_results: 0
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:bg_wait_for_results: false
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:complete: false
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:connected: true
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:disconnects: 187
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:empty: false
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:flags: 20 (ack,vblist)
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:has_item: false
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:has_queued_item: true
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:idle: false
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:num_tap_nack: 14062256
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:num_tap_tmpfail_survivors: 14062256
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:paused: 1
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:pending_backfill: false
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:pending_disconnect: false
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:pending_disk_backfill: true
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:qlen: 1867398
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:qlen_high_pri: 0
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:qlen_low_pri: 0
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:rec_fetched: 20234665
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:reconnects: 187
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:recv_ack_seqno: 26871350
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:suspended: true
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:vb_filter: { [512,1023] }
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:vb_filters: 512
Sharon-Barrs-MacBook-Pro:scripts sharonbarr$ ./stats 10.2.1.14:11210 all | egrep "curr|ep_flush|mem|queue"
curr_connections: 27
curr_items: 6855859
curr_items_tot: 13692060
ep_dbname: /var/opt/membase/1.6.5.2r/data/ns_1/default
ep_flush_duration: 354
ep_flush_duration_highwat: 365
ep_flush_duration_total: 7014
ep_flush_preempts: 0
ep_flusher_state: running
ep_flusher_todo: 108003
ep_mem_high_wat: 24499716096
ep_mem_low_wat: 19599772876
ep_queue_age_cap: 900
ep_queue_size: 391290
ep_store_max_concurrency: 10
ep_tap_bg_fetch_requeued: 0
ep_total_enqueued: 24570524
mem_used: 19602399852
Sharon-Barrs-MacBook-Pro:scripts sharonbarr$
Replication is caught up on both nodes, (total items is double then curr_itmes), but on one node (10.2.1.13) the bg_backlog_size keeps growing, started at around 100K, reached 4M at some point, went down to ~2M and. there is no traffic on that system:
Sharon-Barrs-MacBook-Pro:scripts sharonbarr$ ./stats 10.2.1.13:11210 tap
ep_tap_ack_grace_period: 300
ep_tap_ack_interval: 1000
ep_tap_ack_window_size: 10
ep_tap_backoff_period: 1
ep_tap_bg_fetch_requeued: 0
ep_tap_bg_fetched: 7047333
ep_tap_bg_max_pending: 500
ep_tap_count: 1
ep_tap_deletes: 0
ep_tap_fg_fetched: 26839455
ep_tap_keepalive: 300
ep_tap_noop_interval: 20
ep_tap_throttled: 2003363
ep_tap_total_fetched: 33891978
ep_tap_total_queue: 1867398
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:ack_log_size: 0
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:ack_playback_size: 0
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:ack_seqno: 26871351
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:ack_window_full: false
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:bg_backlog_size: 1867398
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:bg_jobs_completed: 6636686
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:bg_jobs_issued: 6636686
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:bg_queue_size: 0
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:bg_queued: 6636686
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:bg_result_size: 0
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:bg_results: 0
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:bg_wait_for_results: false
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:complete: false
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:connected: true
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:disconnects: 187
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:empty: false
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:flags: 20 (ack,vblist)
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:has_item: false
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:has_queued_item: true
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:idle: false
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:num_tap_nack: 14062256
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:num_tap_tmpfail_survivors: 14062256
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:paused: 1
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:pending_backfill: false
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:pending_disconnect: false
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:pending_disk_backfill: true
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:qlen: 1867398
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:qlen_high_pri: 0
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:qlen_low_pri: 0
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:rec_fetched: 20234665
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:reconnects: 187
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:recv_ack_seqno: 26871350
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:suspended: true
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:vb_filter: { [512,1023] }
eq_tapq:r-ns_1@10.2.1.14-ns_1@10.2.1.13-1299541139.347474:vb_filters: 512
Sharon-Barrs-MacBook-Pro:scripts sharonbarr$ ./stats 10.2.1.14:11210 all | egrep "curr|ep_flush|mem|queue"
curr_connections: 27
curr_items: 6855859
curr_items_tot: 13692060
ep_dbname: /var/opt/membase/1.6.5.2r/data/ns_1/default
ep_flush_duration: 354
ep_flush_duration_highwat: 365
ep_flush_duration_total: 7014
ep_flush_preempts: 0
ep_flusher_state: running
ep_flusher_todo: 108003
ep_mem_high_wat: 24499716096
ep_mem_low_wat: 19599772876
ep_queue_age_cap: 900
ep_queue_size: 391290
ep_store_max_concurrency: 10
ep_tap_bg_fetch_requeued: 0
ep_total_enqueued: 24570524
mem_used: 19602399852
Sharon-Barrs-MacBook-Pro:scripts sharonbarr$
the backlog queue drained with 17M tmp failures. this could be due to slow write rate on the receiver side.
Will be good if we can verify why the queue size went up at the beginning (maybe because we were still in the backfill).
We also discussed increasing the sleep time when getting a tmp fail, but we need to discuss it more.
Decreasing the severity for now.