FTS index creation on datetime field causes continuous FTS service restart (v5b)

jodum · July 26, 2017, 3:20pm

Hello,

In our POC we have single bucket with 4.4 million rows, installed on a two-node (SuSE 11) cluster.

With both releases of 5.0beta, we are unable to create a FTS index on datetime data.

We have created FTS indexes on text fields without any problems.

However, whenever we try to create an FTS index on a datetime field, the CPU of both nodes in the cluster spikes, and the logs show the FTS service continuously restarting (see below)

Each time the FTS service restarts, the console’s Search page shows all three indexes progress reset to zero & then increase slowly. However, the date index’s progress never gets past 60% - often it restarts at around 30%.

The date index is based on a single field inside a child object (JSON below).

The only way I have found to break the FTS restart cycle is to delete the date index & wait.

Help!

The logs show:

Event:
Service 'fts' exited with status 1. Restarting. Messages:
2017-07-26T10:48:46.239-04:00 INFO cbdatasource: server: nodeB:11210, uprOpenName: fts:doc-name_257270ea67b8c6f2_6ddbfb54-34b925c, worker, looping beg, vbucketState: "running" (has 169 vbuckets), 855-1023
2017-07-26T10:48:49.581-04:00 INFO cbdatasource: server: nodeA:11210, uprOpenName: fts:doc-address_1c8a440325d8d5fc_f4e0a48a-922c046, worker, looping beg, vbucketState: "running" (has 171 vbuckets), 171-341
2017-07-26T10:48:49.588-04:00 INFO cbdatasource: server: nodeB:11210, uprOpenName: fts:doc-address_1c8a440325d8d5fc_18572d87-31b31f56, worker, looping beg, vbucketState: "running" (has 171 vbuckets), 684-854
2017-07-26T10:48:51.204-04:00 INFO cbdatasource: server: nodeB:11210, uprOpenName: fts:doc-address_1c8a440325d8d5fc_6ddbfb54-7209144f, worker, looping beg, vbucketState: "running" (has 169 vbuckets), 855-1023
2017-07-26T10:49:06.929-04:00 FATA moss OnError, ...show
Module Code: ns_log 000
Server Node: ns_1@nodeA
Time: 10:49:12 AM   Wed Jul 26, 2017

Event:
Service 'fts' exited with status 1. Restarting. Messages:
2017-07-26T10:51:16.740-04:00 INFO cbdatasource: server: nodeB:11210, uprOpenName: fts:doc-name_257270ea67b8c6f2_54820232-7dfee170, worker, looping beg, vbucketState: "running" (has 1 vbuckets), 512
2017-07-26T10:51:24.281-04:00 INFO cbdatasource: server: nodeA:11210, uprOpenName: fts:doc-address_1c8a440325d8d5fc_54820232-3783a26f, worker, looping beg, vbucketState: "running" (has 170 vbuckets), 342-511
2017-07-26T10:51:24.943-04:00 INFO cbdatasource: server: nodeB:11210, uprOpenName: fts:doc-address_1c8a440325d8d5fc_aa574717-488c0ad7, worker, looping beg, vbucketState: "running" (has 171 vbuckets), 513-683
2017-07-26T10:51:24.943-04:00 INFO cbdatasource: server: nodeA:11210, uprOpenName: fts:doc-address_1c8a440325d8d5fc_13aa53f3-3b0fd91b, worker, looping beg, vbucketState: "running" (has 171 vbuckets), 0-170
2017-07-26T10:51:26.519-04:00 FATA moss OnError, treating this as f ...show
Module Code: ns_log 000
Server Node: ns_1@nodeB
Time: 10:51:29 AM   Wed Jul 26, 2017

Event:
Service 'fts' exited with status 1. Restarting. Messages:
2017-07-26T10:54:14.158-04:00 INFO cbdatasource: server: nodeB:11210, uprOpenName: fts:doc-name_257270ea67b8c6f2_18572d87-33f6af1a, worker, looping beg, vbucketState: "running" (has 171 vbuckets), 684-854
2017-07-26T10:54:14.984-04:00 INFO cbdatasource: server: nodeB:11210, uprOpenName: fts:doc-dob_2ff4a120ab67507a_18572d87-3fb4b8d8, worker, looping beg, vbucketState: "running" (has 171 vbuckets), 684-854
2017-07-26T10:54:15.135-04:00 INFO cbdatasource: server: nodeB:11210, uprOpenName: fts:doc-address_1c8a440325d8d5fc_18572d87-4496be5d, worker, looping beg, vbucketState: "running" (has 171 vbuckets), 684-854
2017-07-26T10:54:16.498-04:00 INFO cbdatasource: server: nodeB:11210, uprOpenName: fts:doc-address_1c8a440325d8d5fc_6ddbfb54-66776559, worker, looping beg, vbucketState: "running" (has 169 vbuckets), 855-1023
2017-07-26T10:54:24.829-04:00 FATA moss ...show
Module Code: ns_log 000
Server Node: ns_1@nodeA
Time: 10:54:25 AM   Wed Jul 26, 2017

Event:
Service 'fts' exited with status 1. Restarting. Messages:
2017-07-26T10:58:10.317-04:00 INFO cbdatasource: server: nodeB:11210, uprOpenName: fts:doc-name_257270ea67b8c6f2_18572d87-1aea4200, worker, looping beg, vbucketState: "running" (has 171 vbuckets), 684-854
2017-07-26T10:58:10.340-04:00 INFO cbdatasource: server: nodeA:11210, uprOpenName: fts:doc-dob_2ff4a120ab67507a_f4e0a48a-329d0a9e, worker, looping beg, vbucketState: "running" (has 171 vbuckets), 171-341
2017-07-26T10:58:10.411-04:00 INFO cbdatasource: server: nodeA:11210, uprOpenName: fts:doc-name_257270ea67b8c6f2_f4e0a48a-7357f9e5, worker, looping beg, vbucketState: "running" (has 171 vbuckets), 171-341
2017-07-26T10:58:10.492-04:00 INFO cbdatasource: server: nodeB:11210, uprOpenName: fts:doc-dob_2ff4a120ab67507a_6ddbfb54-13c7d216, worker, looping beg, vbucketState: "running" (has 169 vbuckets), 855-1023
2017-07-26T10:58:21.993-04:00 FATA moss OnError, treating this as fat ...show
Module Code: ns_log 000
Server Node: ns_1@nodeA
Time: 10:58:22 AM   Wed Jul 26, 2017

The date index JSON:

{
 "name": "doc-dob",
 "type": "fulltext-index",
 "params": {
  "mapping": {
   "default_mapping": {
    "enabled": true,
    "dynamic": false,
    "properties": {
     "childObjA": {
      "enabled": true,
      "dynamic": false,
      "properties": {
       "dob": {
        "enabled": true,
        "dynamic": false,
        "fields": [
         {
          "name": "dob",
          "type": "datetime",
          "analyzer": "",
          "store": true,
          "index": true,
          "include_term_vectors": true,
          "include_in_all": true
         }
        ]
       }
      }
     }
    }
   },
   "type_field": "type",
   "default_type": "_default",
   "default_analyzer": "standard",
   "default_datetime_parser": "dateTimeOptional",
   "default_field": "_all",
   "store_dynamic": false,
   "index_dynamic": true
  },
  "store": {
   "kvStoreName": "mossStore"
  },
  "doc_config": {
   "mode": "type_field",
   "type_field": "type",
   "docid_prefix_delim": "",
   "docid_regexp": null
  }
 },
 "sourceType": "couchbase",
 "sourceName": "GD",
 "sourceUUID": "bdcac3db3bef45193664139f91d01be0",
 "sourceParams": {},
 "planParams": {
  "maxPartitionsPerPIndex": 171,
  "numReplicas": 0
 },
 "uuid": ""
}

steve · July 26, 2017, 5:25pm

Hi jodum,
Looks like that error message is getting clipped. Usually, when there’s a moss subsystem error, it’s related to disk/persistence issues, such as running out of disk space. If there’s more to that error message, that would be the next clue to track down.

You can also look or grep for that error msg in the fts.log – see also: Couchbase SDKs

cheers,
steve

mschoch · July 26, 2017, 5:25pm

Unfortunately the error message in that log file isn’t particularly helpful. Was that message from the fts log file? Or just the one visible in the UI?

Is it possible for you to upload the full cbcollect_info to a JIRA ticket?

Also, can you provide a sample JSON so that we can see the date format?

Thanks
marty

jodum · July 26, 2017, 6:14pm

Steve, thanks for the advice! I did not notice “…show” meant the full error text was truncated, but it seems obvious now

There was no matching error in fts.log, but both info.log & debug.log show this:

> 2017-07-26T10:49:06.929-04:00 FATA moss OnError, treating this as fatal, err: store: doLoadSegments mmap.Map(), **err: cannot allocate memory** -- main.initMossOptions.func1()
> [goport(/opt/couchbase/bin/cbft)] 2017/07/26 10:49:11 child process exited with status 1

And there is a corresponding item in babysitter.log (below).

The nodes are configured with 3GB memory quotas for the Data and Search services, and 512MB for the Index Service.

(And in case it matters, the date field values are formatted like “1925-10-29T00:00:00-05:00”.)

Does this error indicate that the Search service needs a larger memory quota?

Does it make sense that a building a datetime index would need more memory than building indexes for text fields?

Thanks very much for your help!

Here is the babysitter.log entry:

> [error_logger:error,2017-07-26T10:49:12.077-04:00,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]** Generic server <0.12376.1> terminating
> ** Last message in was {<0.12382.1>,{exit_status,1}}
> ** When Server state == {state,<0.12382.1>,
>                             {fts,"/opt/couchbase/bin/cbft",
>                                 ["-cfg=metakv",
>                                  [45,117,117,105,100,61|
>                                   <<"7ce848ec7ea5ee23cde6af17b26a8aa5">>],
>                                  "-server=http://127.0.0.1:8091",
>                                  [45,98,105,110,100,72,116,116,112,61,
>                                   "172.xx.x.x",58,"8094",44,48,46,48,46,48,46,
>                                   48,58,"8094"],
>                                  "-dataDir=/opt/couchbase/var/lib/couchbase/data/@fts",
>                                  "-tags=feed,janitor,pindex,queryer,cbauth_service",
>                                  "-auth=cbauth",
>                                  [45,101,120,116,114,97,61,"172.xx.x.x",58,
>                                   "8091"],
>                                  "-options=startCheckServer=skip,slowQueryLogTimeout=5s,defaultMaxPartitionsPerPIndex=171,bleveMaxResultWindow=10000,failoverAssignAllPrimaries=false,hideUI=true,cbaudit=true,ftsMemoryQuota=31
> 45728000,maxReplicasAllowed=3,bucketTypesAllowed=membase:ephemeral",
>                                  "-bindHttps=:18094",
>                                  "-tlsCertFile=/opt/couchbase/var/lib/couchbase/config/ssl-cert-key.pem",
>                                  "-tlsKeyFile=/opt/couchbase/var/lib/couchbase/config/ssl-cert-key.pem"],
>                                 [via_goport,exit_status,stderr_to_stdout,
>                                  {env,
>                                      [{"GOTRACEBACK","crash"},
>                                       {"CBAUTH_REVRPC_URL",
>                                        "http://%40:755f8db40d027a5e4674e97871b4a62e@127.0.0.1:8091/fts"}]}]},
>                             {ringbuffer,1399,1024,
>                                 {[{<<"[goport(/opt/couchbase/bin/cbft)] 2017/07/26 10:49:11 child process exited with status 1\n">>,
>                                    89},
>                                   {<<"2017-07-26T10:49:06.929-04:00 FATA moss OnError, treating this as fatal, err: store: doLoadSegments mmap.Map(), err: cannot allocate memory -- main.initMossOptions.func1() at init_moss.g
> o:108\n">>,
>                                    192},
>                                   {<<"2017-07-26T10:48:51.204-04:00 INFO cbdatasource: server: nodeB:11210, uprOpenName: fts:doc-address_1c8a440325d8d5fc_6ddbfb54-7209144f, worker, looping beg, vbucketStat
> e: \"running\" (has 169 vbuckets), 855-1023\n">>,
>                                    228},
>                                   {<<"2017-07-26T10:48:49.588-04:00 INFO cbdatasource: server: nodeB:11210, uprOpenName: fts:doc-address_1c8a440325d8d5fc_18572d87-31b31f56, worker, looping beg, vbucketStat
> e: \"running\" (has 171 vbuckets), 684-854\n">>,
>                                    227},
>                                   {<<"2017-07-26T10:48:49.581-04:00 INFO cbdatasource: server: 172.xx.x.x:11210, uprOpenName: fts:doc-address_1c8a440325d8d5fc_f4e0a48a-922c046, worker, looping beg, vbucketState: \"running
> \" (has 171 vbuckets), 171-341\n">>,
>                                    215}],
>                                  [{<<"2017-07-26T10:48:46.239-04:00 INFO cbdatasource: server: nodeB:11210, uprOpenName: fts:doc-name_257270ea67b8c6f2_18572d87-7db5c785, worker, looping beg, vbucketState:
> \"running\" (has 171 vbuckets), 684-854\n2017-07-26T10:48:46.239-04:00 INFO cbdatasource: server: nodeB:11210, uprOpenName: fts:doc-name_257270ea67b8c6f2_6ddbfb54-34b925c, worker, looping beg, vbucketState
> : \"running\" (has 169 vbuckets), 855-1023\n">>,
>                                    448}]}},
>                             fts,undefined,[],0}
> ** Reason for termination ==
> ** {abnormal,1}

> [error_logger:error,2017-07-26T10:49:12.078-04:00,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
> =========================CRASH REPORT=========================
>   crasher:
>     initial call: ns_port_server:init/1
>     pid: <0.12376.1>
>     registered_name: []
>     exception exit: {abnormal,1}
>       in function  gen_server:terminate/6 (gen_server.erl, line 744)
>     ancestors: [<0.12375.1>,<0.12374.1>,ns_child_ports_sup,
>                   ns_babysitter_sup,<0.56.0>]
>     messages: [{'EXIT',<0.12382.1>,normal}]
>     links: [<0.12375.1>]
>     dictionary: []
>     trap_exit: true
>     status: running
>     heap_size: 4185
>     stack_size: 27
>     reductions: 1166237
>   neighbours:

> [error_logger:error,2017-07-26T10:49:12.079-04:00,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]** Generic server <0.12375.1> terminating
> ** Last message in was {die,{abnormal,1}}

> ** When Server state == {state,fts,5000,
>                                {1500,992676,912113},
>                                undefined,infinity}
> ** Reason for termination ==
> ** {abnormal,1}

> [error_logger:error,2017-07-26T10:49:12.079-04:00,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
> =========================CRASH REPORT=========================
>   crasher:
>     initial call: supervisor_cushion:init/1
>     pid: <0.12375.1>
>     registered_name: []
>     exception exit: {abnormal,1}
>       in function  gen_server:terminate/6 (gen_server.erl, line 744)
>     ancestors: [<0.12374.1>,ns_child_ports_sup,ns_babysitter_sup,<0.56.0>]
>     messages: []
>     links: [<0.12374.1>]
>     dictionary: []
>     trap_exit: true
>     status: running
>     heap_size: 4185
>     stack_size: 27
>     reductions: 2049
>   neighbours:

> [error_logger:error,2017-07-26T10:49:12.079-04:00,babysitter_of_ns_1@127.0.0.1:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
> =========================CRASH REPORT=========================
>   crasher:
>     initial call: erlang:apply/2
>     pid: <0.12374.1>
>     registered_name: []
>     exception exit: {abnormal,1}
>       in function  restartable:loop/4 (src/restartable.erl, line 69)
>     ancestors: [ns_child_ports_sup,ns_babysitter_sup,<0.56.0>]
>     messages: []
>     links: [<0.72.0>]
>     dictionary: []
>     trap_exit: true
>     status: running
>     heap_size: 4185
>     stack_size: 27
>     reductions: 1521
>   neighbours:

jodum · July 26, 2017, 6:20pm

…And in case the failed allocation is disk instead of memory, nodeA had 10GB of free space, and nodeB has 209GB of free space.

(BTW, the Server page of the UI can’t accommodate that much free space - it shows “remaining (-3123210320 B)” for nodeB)

steve · July 26, 2017, 6:25pm

Hi jodum,
Hmmm, wondering how much ram/memory your server has? Also, double-checking – it’s 64-bit linux, right?

Thanks Marty – as you reminded me, all these questions would be answered by grabbing a cbcollect-info.

– steve

jodum · July 26, 2017, 7:06pm

I don’t have anywhere to upload the full zip, but below are the obvious system/memory/disk sections from couchbase.log generated by cbcollect_info on nodeA. Let me know if there are any other items that might be helpful.

And thanks again for your time & help!

==============================================================================
uname
uname -a
==============================================================================
Linux nodeA 3.0.76-0.11-default #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990) x86_64 x86_64 x86_64 GNU/Linux
[...]

==============================================================================
Process list snapshot
export TERM=''; top -Hb -n1 || top -H n1
==============================================================================
top - 14:33:27 up 46 days, 17:02,  1 user,  load average: 0.73, 0.67, 0.68
Tasks: 479 total,   2 running, 477 sleeping,   0 stopped,   0 zombie
Cpu(s): 10.4%us,  1.2%sy,  0.0%ni, 87.5%id,  0.7%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   8063040k total,  6335104k used,  1727936k free,   204932k buffers
Swap:  2103292k total,   140704k used,  1962588k free,  3614728k cached
[...]

==============================================================================
Swap configuration
free -t
==============================================================================
             total       used       free     shared    buffers     cached
Mem:       8063040    6336756    1726284          0     204940    3616484
-/+ buffers/cache:    2515332    5547708
Swap:      2103292     140704    1962588
Total:    10166332    6477460    3688872
==============================================================================
Swap configuration
swapon -s
==============================================================================
Filename                Type        Size    Used    Priority
/dev/sda1                               partition   2103292 140704  -1
[...]

==============================================================================
Distro version
cat /etc/SuSE-release
==============================================================================
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 3
[...]

==============================================================================
Installed software
rpm -qa
==============================================================================
[...]
couchbase-server-5.0.0-3217
[...]


==============================================================================
Filesystem
df -ha
==============================================================================
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        48G   36G   11G  77% /
proc               0     0     0    - /proc
sysfs              0     0     0    - /sys
debugfs            0     0     0    - /sys/kernel/debug
udev            3.9G   92K  3.9G   1% /dev
tmpfs           3.9G   72K  3.9G   1% /dev/shm
devpts             0     0     0    - /dev/pts
fusectl            0     0     0    - /sys/fs/fuse/connections
securityfs         0     0     0    - /sys/kernel/security
none               0     0     0    - /proc/sys/fs/binfmt_misc
none               0     0     0    - /var/lib/ntp/proc
[...]

steve · July 26, 2017, 7:49pm

Thanks – looks like a 8GB ram box.

The next question I’m wondering about (in my head) is how much memory was actually attempted to be memory mapped (mmap()'ed) during the error/failure.

The logged error msg doesn’t say, but I see there’s already been a recent fix to the error message that will report the actual mmap parameters in future builds… https://github.com/couchbase/moss/blame/master/store_footer.go#L303

In the meanwhile, I see from your index definition that you have some FTS features enabled that will use more storage…

      "store": true, // The index keeps a copy of the original field,
                     // used for search result highlighting/snippets.
      "include_term_vectors": true, // Allows for search highlighting
      "include_in_all": true // Allows for query string searches
                             // that don't need a field name, like searching for
                             // "2017" instead of "dob:2017"

and…

   "index_dynamic": true // Indexes all the fields

Turning those to false, if you don’t need them, will allow the index to be smaller, perhaps enough to allow mmap() to succeed.

steve · July 26, 2017, 7:53pm

Related to your questions, the memory quota for FTS is a quota amount for its dirty write queue, so likely not directly involved or helpful-to-change in this case of an mmap() error.

Also, indeed, on datetime fields (and numeric fields) can take up more storage in FTS than a text field (especially, say, a text field that is treated as a single term). The reason why is a datetime or numeric field is analyzed into multiple terms, in what I think of as bucketing the datetime/numeric value into different buckets of precision granularity.

– steve

jodum · July 26, 2017, 8:16pm

An excellent idea, but unfortunately it was not enough. With all those options set to false, the first creation run got to 37%, the next got all the way to 75%, then 46% …

I will investigate if more memory can be added to either/both servers.

Do you happen to know which likely requires more memory for datetime FTS: inserting docs with an existing index in place, or inserting docs with no index and creating the index after all the inserts are complete?

Once again, thank you very much for your help!

jodum · July 26, 2017, 8:19pm

I stand corrected. After typing the above reply I went to delete the date index, and it was at 100% complete

steve · July 26, 2017, 8:37pm

Hi jodum,

That’s good news.

Still sounds like perhaps a little mystery, as if it seemingly took a step “backwards” along the way? 37… 75… 46(?)… then 100? So perhaps still something odd going on here.

Do you happen to know which likely requires more memory for datetime FTS: inserting docs with an existing index in place, or inserting docs with no index and creating the index after all the inserts are complete?

Roughly handwaving here, w.r.t. overall system usage or performance, inserting all the docs first can likely have the benefit of all system resources (like I/O bandwidth, etc) can be dedicated to just getting the docs-loading all done. Then, when you next create an FTS index, the system resources can be more dedicated to indexing maneuvers.

In between, also, if you wait a bit to let the KV doc mutations all persist to disk before creating the FTS index, then there might be some benefit to doing that too.

(A true engineer on the internet, at this point, would insert some fine print verbiage here about “testing and experimenting on your own system / your perf may differ / yadda yadda”

Finally, appreciate your tire kicking on FTS and your feedback!

cheers,
steve

jodum · July 27, 2017, 5:27pm

Just in case I was’t clear, the progress didn’t go backwards - it restarted several times. The first run went from 0 to 37%. Then it restarted and went from 0 to 75%. Then it restarted and went from 0 to 46%. At that point I stopped watching it to type my earlier reply. When I went back to the console it was at 100%.

So yes, there is still a mystery. But at least the index is there & working

Jeff

Topic		Replies	Views
Abnormally large jumping num_mutations_to_index in FTS index while it is 100% ready Full Text Search	20	1680	November 15, 2022
Service fts Restarting random. . Bleve - Scorch Full Text Search	2	1581	February 21, 2019
Couchbase Enterprise Edition 6.0.3 : FTS query not returning the specified fields Full Text Search	16	1608	September 7, 2020
Index files ./mainIndex/log.000000000000xx.data are filling my hard drive 100% SQL++ n1ql , index	14	3074	February 26, 2019
MAYDAY! "Lost data" after changing RAM quota Couchbase Server	8	1496	April 25, 2018

FTS index creation on datetime field causes continuous FTS service restart (v5b)

Related topics