SGW queries timing out in CB Server

Hey there,

currently our sync-gateway is failing to sync data to mobile couchbase lite clients. We believe that’s due to internal queries timing out in couchbase server, since our sg_warn.log is filled with dozens of lines like the following whenever we try to sync.

2024-06-04T08:32:00.259Z [WRN] c:[1841182f] Error retrieving changes for channel "<ud>dn_f092c223-d485-4fe3-9d17-70cdcd4e51be</ud>": Timeout performing Query -- db.(*Database).buildRevokedFeed.func1() at changes.go:223

We can also observe that all our CPU cores on all machines go up to about 95% and stay there for about 30-60s when syncing.

It would be amazing if someone could give me ideas on how to debug this problem since I’m hitting a wall here with my limited knowledge of CB internals. I’ll drop my limited insights below and would be greatful if someone could point us in the right direction. It would probably be helpful if someone could tell us the exact query that is used to query such a channel so we can debug the query itself.

Our test setup:

  • 3 Machines with CB Server in a cluster, 4 Cores / 8GB ram each
  • Fairly small bucket with about 140k docs
  • One sync-gateway (originally 3) to help with debugging

Additional infos:

  • Longpolling changes with the same affected users seems to work without problems
  • Using the SGW API does seem to work without problems (fetching docs etc.)
  • User with access to only one document is sometimes able to sync (I assume when there is no timeout) but also pushes the cluster to nearly 100% workload and sync often fails

Only observable queries in the CB query monitor interface are statements like

PREPARE SELECT [op.name, LEAST(meta(my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)][1] AS seq, [op.name, LEAST(meta(my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)][2] AS rRev, [op.name, LEAST(meta(my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)][3] AS rDel, meta(my-bucket).xattrs._sync.rev AS rev, meta(my-bucket).xattrs._sync.flags AS flags, META(my-bucket).id AS id FROM my-bucket USE INDEX (sg_channels_x1) UNNEST OBJECT_PAIRS(meta(my-bucket).xattrs._sync.channels) AS op WHERE ([op.name, LEAST(meta(my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)] BETWEEN [$channelName, $startSeq] AND [$channelName, $endSeq]) ORDER BY [op.name, LEAST(meta(my-bucket`).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)] LIMIT 5000

I tried to use such a query, but with the PREPARE it’s done in 5 ms and as normal SELECT with a channel from the (timeout) logs above and 1 and our current sequence number as $startSeq and $endSeq, it’s done in about 7ms. So no timeouts there.

Also there are dozens of other lines in the query monitor getting attributed to the user “local:sync-gateway” in which the query column is empty. When I click Edit on one of those, I get an empty query window. Not sure if UI bug though. I’m assuming those are also PREPARE SELECT statements.

All sync-gateway related indexes seem to be live and well (sg_access_x1, sg_allDocs_x1, sg_channels_x1, sg_roleAccess_x1, sg_syncDocs_x1, sg_tombstones_x1)

I would be very grateful if someone could tell me whats going on there. The setup is pretty fresh, although we did some testing with rebalancing etc. Settings should mostly be standard.
Could only include one picture, sorry for the lack of more screenshots.

Best
Marc

1 Like

Some additional infos:

  • SG and CB Server are running on the latest versions (3.1 & 7.6).
  • I couldn’t find any indication of a problem in CBs debug.log, query.log, error.log etc
  • Restarting SG or CB servers didn’t help

Additional log lines found in sg_warn:

2024-06-03T13:31:10.494Z [WRN] c:[4481e8b3] Error retrieving changes for channel “C_DOCUMENT_12a4c6b1-49d0-447d-9f14-04569171bc72”: Post “https://10.38.1.3:18093/query/service”: stream error: stream ID 2907; PROTOCOL_ERROR | {“statement”:“SELECT [op.name, LEAST(meta(my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)][1] AS seq, [op.name, LEAST(meta(my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)][2] AS rRev, [op.name, LEAST(meta(my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)][3] AS rDel, meta(my-bucket).xattrs._sync.rev AS rev, meta(my-bucket).xattrs._sync.flags AS flags, META(my-bucket).id AS id FROM my-bucket USE INDEX (sg_channels_x1) UNNEST OBJECT_PAIRS(meta(my-bucket).xattrs._sync.channels) AS op WHERE ([op.name, LEAST(meta(my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)] BETWEEN [$channelName, $startSeq] AND [$channelName, $endSeq]) ORDER BY [op.name, LEAST(meta(my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)] LIMIT 5000”,“client_context_id”:“6d8607c2-fdb2-4159-9ab0-96ee6fb12aad”} – db.(*Database).changesFeed.func1() at changes.go:407

2024-06-03T13:31:10.491Z [WRN] c:[4481e8b3] Error retrieving changes for channel “C_DOCUMENT_64d26059-ba16-42e6-816a-dd29cc429e55”: Post “https://10.38.1.1:18093/query/service”: stream error: stream ID 2881; PROTOCOL_ERROR | {“statement”:“SELECT [op.name, LEAST(meta(my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)][1] AS seq, [op.name, LEAST(meta(my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)][2] AS rRev, [op.name, LEAST(meta(my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)][3] AS rDel, meta(my-bucket).xattrs._sync.rev AS rev, meta(my-bucket).xattrs._sync.flags AS flags, META(my-bucket).id AS id FROM my-bucket USE INDEX (sg_channels_x1) UNNEST OBJECT_PAIRS(meta(my-bucket).xattrs._sync.channels) AS op WHERE ([op.name, LEAST(meta(my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)] BETWEEN [$channelName, $startSeq] AND [$channelName, $endSeq]) ORDER BY [op.name, LEAST(meta(my-bucket).xattrs._sync.sequence, op.val.seq),IFMISSING(op.val.rev,null),IFMISSING(op.val.del,null)] LIMIT 5000”,“client_context_id”:“d776160a-b6c7-4ba9-aa42-b0eb7aa7f750”} – db.(*Database).changesFeed.func1() at changes.go:407

Those appear for several similiar channels and for all 3 CB server instances (10.38.1.1-3)

However, as described above, using this query directly in CB server with the logged channel names results in correct looking results in 10ms, EXPLAIN describes the (assumingly) correct use of the index.