Periodic OPS down on couchbase server 3.0.1 every hour

OPS down occurred on 44min ~ 49min periodically (every hour).
Is there any configuration for periodic job for Couchabse Server ?

I am trying to find the reason why OPS down is occurred periodically every 44min~50min per hour.

  • Network Traffic (1Gbps) => may be not (Couchbase v2.2 server is ok)
  • Garbage Collection on WebApi (using Couchbase Client) => not peridoically
  • System periodically batch job ? there’s no crontab

2014.12.19 09:44 ~ 09:50

2014.12.21 11:44 ~ 11:50

Couchbase 2.2 is ok (OPS down not occurred )
2014.12.21 11:44 ~11:50

How often is compaction running? Also, are you setting TTL’s on documents? The Expiry Pager runs by default once per hour but typically this doesn’t cause a degradation in performance.

Thx for your answer, @tgreenstein

I am not setting TTL’s on documents and I don’t use auto compaction option. :frowning:

Is there anyone experienced above issue on 3.0.1?

Every 44, 48min, so many timeouts are occurred on cb clients .
I use cb java sdk v2.0.2 and I have 2 Clusters for joining data.

  • There’s no periodic throughput down like this on couchbase server 2.2
  • And there’s no batch job on all nodes.
  • It occurs 44min and 48min every hour exactly no
    matter with gc and request loads (it acts like alarm ;()
  • There’s no network effects like high traffic periodically.
  • I never set TTL on documents and never configure auto compaction option.

Given everything you’re saying, it sounds like an issue we’ve not seen yet. Can you file one against Couchbase Server on our issue tracker? Please include a cbcollect_info for the nodes, which can be generated from the console.

1 Like

@ingenthr

Happy new year !

I’ve filed this issue on http://issues.couchbase.com/browse/MB-13032.

:smile: I’m so sorry too late for update

I’ve found 2 factors about this issue.

  1. data size
  2. node counts

And 1more thing, I assum that key length is related with Periodic OPS dropped down.

I did test by increasing data to cluster and recorded fail counts per 100million.

100million - not occurred
400million - no failure. but retries
850million - many failures (10k)
==> add 8 nodes (total 16nodes)
850 million - just 8 failures (decreasing failure)

=========
Test Informations

*** Server Informations ***
nodes : 8

*** node spec ***
OS : Linux ( 2.6.32-358.6.2.el6.x86_64 ) 64 bit
CPU : Intel® Xeon® CPU E5-2420 0 @ 1.90GHz[6] * 2 N
RAM : 128GB(DDR3[1333 MHz] 16384 * 8)
DISK : [LSI MegaRAID SAS PCI Express ROMB [F/W: 3.340.05-2939] (1024MB)]
[-] 299.0 GB * 4

*** bucket spec ***
Ram Quata : 858GB
data size : 1.27 billion (1,270,000,000) (284GB, ALL data is on Memory)
replicas : 1
disk io/optimization : Low
Auto Compaction : OFF
Flush : enable