Fetching keys across all docs in a bucket from the CLI

Is there an efficient way to dump all keys of a bucket into a file on the server itself? Like somehow from the index files (.fdb files)
Couchbase version: Community Edition 6.0.0 build 1693
Number of keys in a bucket: ~150 Million

Use cbq to execute the query

select raw meta().id from bucketName

I tried cbq but its not able to return all the document keys. It returned ~1 Million keys with the error

"errors": [
    {
        "code": 5000,
        "msg": " Index scan timed out - cause:  Index scan timed out"
    }
]

Do you have solution around it?

Are you using a PRIMARY INDEX to support the statement or have you defined a secondary index and added a suitable inclusive condition to the statement?

You can try adding an ORDER BY and use pagination to avoid the index timeout:

SELECT RAW meta().id FROM bucketName ORDER BY meta().id OFFSET 0 LIMIT 500000

(And of course repeat increasing OFFSET by the LIMIT amount each time.)

Depending on your platform, if you have CURL installed you could script this, e.g.

#!/usr/bin/bash
for ((i=0;i<150000000;i+=500000))
do
  curl -su user:pass http://localhost:8093/query/service -d "statement=SELECT RAW meta().id FROM bucketName ORDER BY meta().id OFFSET ${i} LIMIT 500000" --output keys_${i}.json
done 

Of course you’d have to stitch the multiple result sets together.
HTH.

Thanks for the quick reply.

Yes, this is one of the approach we can go with, but there is one drawback with this approach. The time for executing in 1 iteration will keep on increasing as the offset increases (not suitable for large dataset). Reason lies in how the OFFSET queries are getting processed.

The better query would be

key=''
repeat untill !result
  result = SELECT meta().id FROM bucket WHERE meta().id > $key LIMIT 1000000;
  key = result[last_key]
done

Check this reply

I’m aware the OFFSET seeking will increase as the offset is larger, but hopefully your system can handle seeking to the last large document block. Nevertheless, indeed, using a filter is more efficient but slightly more cumbersome to illustrate with a short generic script snippet. :slight_smile:

https://www.google.com/search?q=couchbase+cbq+timeout

Google says that couchbase cbq has a --timeout parameter.

There are also other solutions such as using a map-reduce view. (although views are deprecated).

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.