Client at 100% CPU usage when using get_multi()

Hi there,

We’re running a python worker inside k8s, and during our load test we found that our machine would spin up to 100%CPU and remain there unresponsive until we killed it. On further inspection, we noticed that the issue was always being caused by a get_multi() call. We isolated that call as follows and it will always crash after a couple minutes (sometimes a few seconds) of repeat calls:

collection_connection.get_multi(
    keys=['doc001', 'doc002', .....  , 'doc079', 'doc080'],
    GetMultiOptions(
        timeout=timedelta(seconds=15)
    )
)

We are using the python library version 4.0.3 (we needed v4+ for collection name length > 30 characters) and Couchbase Community version 7.0.2

We are confronted with C-level errors like “malloc” so I assume the CPU spike is due to corrupted memory, however we are running just a python process in AWS on a t3.small instance with no memory limits (should be up to 2GB of memory to use up which it doesn’t even come close to). I appreciate this may be a vague question considering you don’t know my exact setup, but does anyone have any idea if this is a known issue?

Thanks,
Michael

For reference, our connection is created as follows:

cluster = Cluster(
    'couchbases://192.168.123.123',
    ClusterOptions(
        PasswordAuthenticator(
            'user',
            'password'
        ),
        timeout_options=ClusterTimeoutOptions(
            kv_timeout=timedelta(10)
        )
    )
)
bucket_connection = cluster.bucket('bucket_name')
scope_connection = bucket_connection.scope('scope_name')
collection_connection = scope_connection.collection('collection_name')

We have attempted caching connections at bucket and collection level, and recreating the connection before each call, neither seems to make a difference.

It seems our cache size is expanding. There’s no reason why the get / get_multi calls would keep stuff in memory, right?

Can you provide the smallest piece of code the reproduces the issue?

Hi @mreiche My original post is what triggers it. Running that get_multi() call repeatedly. There is no other code run.

We are using the python library version 4.0.3

I don’t know of any memory fixes after 4.0.3, but please use the latest version - 4.0.5

It seems our cache size is expanding.

What cache?

does anyone have any idea if this is a known issue?

You can search for known issues at issues.couchbase.com. You may need to create an account.

https://issues.couchbase.com/issues/?jql=project%3DPYCBC%20%26%26%20affectedVersion%20>%3D%204.0.3%20ORDER%20BY%20created%20DESC

My original post is what triggers it. Running that get_multi() call repeatedly.

Yes. But how many times? Are exceptions being thrown/caught/ignored?

Sometimes it’s useful to begin with 1 call, then incrementally increase the number of iterations in order to observe the changing behavior.

We are confronted with C-level errors like “malloc”

How large are the documents? The call requests 81 documents. With the maximum document size being 20m, that would make 1.62 GB. It would be useful to see those C-level errors. Maybe it’s just running out of memory.

I tried reproducing the issue with this. It completed in about 8 seconds without any problems.

# populate with 
# cbc-pillowfight  --password password --username Administrator  --num-items 100 --num-threads 1 --min-size 64 --max-size 64  --json --spec "couchbase://127.0.0.1/my_bucket" --populate-only
import time
from couchbase.cluster import Cluster, ClusterOptions
from couchbase.auth import PasswordAuthenticator
from couchbase.options import GetMultiOptions
from datetime import datetime, timedelta

# get a reference to our cluster
cluster = Cluster('couchbase://127.0.0.1', ClusterOptions(
  PasswordAuthenticator('Administrator', 'password')))
bucket_connection = cluster.bucket('my_bucket')
scope_connection = bucket_connection.scope('_default')
collection_connection = scope_connection.collection('_default')

d='000000000000000000';
keys = ['a'] * 81;
for i in range(10, 91):
  keys[i-10]=d+str(i);

t = time.time()
for i in range(0,1000):
    result = collection_connection.get_multi(
        keys=keys,
        options = GetMultiOptions(
            timeout=timedelta(seconds=15)
        )
    )

print ( (time.time()-t) );

@mreiche Thanks for trying that. I’ve checked within an individual python instance and I also can run this without issue. It must be to do with how we’re threading our worker, maybe the Couchbase++ process underneath is being shared across multiple python threads somehow. I’ll check our orchestration to see if I find any issues, and I’ll close this topic as soon as I find the issue.

@mreiche , we have done the LoadTest for our backend which using python couchbase sdk (version:4.0.3).
While we were running the LoadTest concurrent agents by less than 5 ones, that was okay, but while we reached to 10 concurrent ones (RPS: +/- 10), we met “segment failure” issue, after this C++ error occurred, our python backend crashed.
This is the following infomration for your reference:
A. each request, we were quering 40 document, and total size is 50Kb via API “get_multi()”
B. there is no any special characters in our document
C. we also tried the SDK version 4.1.0, same issue.
D. we also checked our memory usage, we allocated our backend runtime as 1Gb, it never used more than 20%.
The following are:
A. the crash traceback during loadtest:

B. this is our OS version:
NAME=“Amazon Linux”
VERSION=“2”
ID=“amzn”
ID_LIKE=“centos rhel fedora”
VERSION_ID=“2”
PRETTY_NAME=“Amazon Linux 2”
ANSI_COLOR=“0;33”
CPE_NAME=“cpe:2.3:o:amazon:amazon_linux:2”
HOME_URL=“https://amazonlinux.com/

This doesn’t seem related to the original post. Can you open a case with customer support for the issue you are having? You can also search issues.couchbase.com yourself. I cannot read the screenshot. When you open the customer support case, provide text instead of a screen shot.