RetryLoop for WriteCasWithXattr with key <doc-id> giving up after 11 attempts

Hi everyone

I have a brand new cluster running CB CE 6.0.0 and SG 2.1 that I have replicated data to using XDCR (around 6 million docs - 6GB in size). After replicating most of the data and I started SG and expected it to have to catch up a bit, but it have been about 30h now, and I still can’t access docs via SG. In the SG logs I keep getting the following (with a new doc-id every time):

2019-07-13T11:41:22.210Z [INF] DCP: Backfill in progress: 9% (580102 / 5933424)
2019-07-13T11:41:54.831Z [WRN] RetryLoop for WriteCasWithXattr with key -nh-fBFuYjUWgK2E78-hH-Q giving up after 11 attempts – base.RetryLoop() at util.go:352
2019-07-13T11:41:54.831Z [INF] Import: Error importing doc “-nh-fBFuYjUWgK2E78-hH-Q”: operation has timed out

The CB service has also given some errors in the logs, but I have search the web and these forums and can’t seem to find help pointing me in the right direction. I am able to query data fine from the Admin UI, but the documents tab does not return anything (just hangs with “Retrieving docs” spinning). I restarted the server twice in the last couple of days to see if it resolves anything, but it doesn’t seem to help.

[ns_server:error,2019-07-13T08:32:22.526Z,ns_1@catchandmatch.prod:ns_doctor<0.24720.84>:ns_doctor:update_status:316]The following buckets became not ready on node ‘ns_1@catchandmatch.prod’: [“primary”], those of them are active [“primary”]
[ns_server:error,2019-07-13T08:39:22.813Z,ns_1@catchandmatch.prod:service_stats_collector-fts<0.24898.84>:rest_utils:get_json_local:63]Request to (fts) api/nsstats failed: {error,timeout}
[ns_server:error,2019-07-13T08:39:22.815Z,ns_1@catchandmatch.prod:service_agent-fts<0.4170.85>:service_agent:handle_info:231]Linked process <0.4404.85> died with reason {timeout,
{gen_server,call,
[<0.4376.85>,
{call,
“ServiceAPI.GetCurrentTopology”,
#Fun<json_rpc_connection.0.125340786>},
60000]}}. Terminating
[ns_server:error,2019-07-13T08:39:22.815Z,ns_1@catchandmatch.prod:service_agent-fts<0.4170.85>:service_agent:terminate:260]Terminating abnormally
[ns_server:error,2019-07-13T08:39:22.913Z,ns_1@catchandmatch.prod:service_stats_collector-index<0.24922.84>:rest_utils:get_json_local:63]Request to (indexer) stats?async=true failed: {error,timeout}
[ns_server:error,2019-07-13T08:39:22.815Z,ns_1@catchandmatch.prod:service_agent-fts<0.4170.85>:service_agent:terminate:265]Terminating json rpc connection for fts: <0.4376.85>
[ns_server:error,2019-07-13T08:39:22.917Z,ns_1@catchandmatch.prod:query_stats_collector<0.24851.84>:rest_utils:get_json_local:63]Request to (n1ql) /admin/stats failed: {error,timeout}
[ns_server:error,2019-07-13T08:39:22.929Z,ns_1@catchandmatch.prod:service_agent-index<0.4123.85>:service_agent:handle_info:231]Linked process <0.4427.85> died with reason {timeout,
{gen_server,call,
[<0.4494.85>,
{call,
“ServiceAPI.GetTaskList”,
#Fun<json_rpc_connection.0.125340786>},
60000]}}. Terminating
[ns_server:error,2019-07-13T08:39:22.929Z,ns_1@catchandmatch.prod:service_agent-index<0.4123.85>:service_agent:terminate:260]Terminating abnormally
[ns_server:error,2019-07-13T08:39:22.929Z,ns_1@catchandmatch.prod:service_agent-index<0.4123.85>:service_agent:terminate:265]Terminating json rpc connection for index: <0.4494.85>

Has anyone maybe had a similar issue or have any ideas what the issue here might be? At the moment, I am building a complete new cluster to see if I can get something else up an running.

Would appreciate any suggestions or advice.

Best,
Norval

Hi Norval,
a) did you configure you sync gateway to import_docs: true and enable_shared_buckets_access: true in you config file?
b) what is the size of the cluster? how much memory installed? what is the resident ratio?
c) what is the cpu utilization?
c) do you see import messages on the sync gateway side?

Hi @roikatz

Thanks for the response :slight_smile:

a) import_docs is set to “continuous” and enable_shared_bucket_access to “true”
b) Cluster is only a single node at the moment, 4vCPUs with 16GB RAM. I’m not too sure what you mean by resident ratio? Could you maybe explain a little bit more?
c) CPU utilization was moving around quite a bit but was generally around 40-60%
d) I have tried a couple of new things already to get the node up an running, so don’t have access to the logs at the time anymore, but there were definitely docs coming in to SG

Best,
Norval

Few more questions then,
a) did you have a lot of rapid upserts of the same document?
b) did you have a failed node? ports? etc?
c) do you use views as you mobile index or gsi?
d) did you have enough disk space (run out of space?)
resident ratio is the amount of docs in memory/total docs