Dataset mutation never finishes


–running couchbase server 6.0.4.
We are trying to setup datasets shadowing our bucket data. When testing on small data sizes (under 10000 records), it works great, the dataset gets created and we can see the mutation counts go down quickly and finish populating the dataset.

However when we moved to a larger data size (around 9,000,000 records), what we noticed is that the mutation counts for the datasets start to go down, as we check back periodically on the status, some times we see the Link is disconnected, so we go and run the connect link again. But it never finishes, even when it goes down to 0 it seems to cycle back up to 9,000,000 again. And when doing a select count(*) on the dataset, we get 0 records.

What could cause this behavior?

Hi @JayZhang,

Some failure must be happening that causes ingestion to restart (e.g. a document is encountering a failure during the evaluation of the “where” condition of a dataset).

  • How many datasets do you have?
  • Are you using complex filters in the “where” statements for the filters?
  • Are there any on-going operations on the source bucket?

If you can share the logs, I can tell you exactly what is happening.

Hi mhubail,

imagine a document


varA: valueA
varB: valueB


my query to create the datasets:

Create Dataset testDataset on bucketA Where varA= 'valueA' and varB = 'valueB'

I don’t think the above is a very complex filter?

There are other operations on the bucket, but looking at the server node cpu and memory usage of the bucket, they are under 10% and 50% respectively.

doing 1 dataset will run into this issue.

using the same where filters and running the query:

Select count(*) from bucketA Where varA= 'valueA' and varB = 'valueB'

will return with the count with no issues, so I’m not sure if it would be an issue with the where clause.
I currently can not get my hands on the logs, so will have to wait on that.

Your filter looks fine. I asked about the filter because we have a known issue in 6.0.x related to that and has similar symptoms to what you described. For example, if a condition in the “where” statement in a dataset’s filter is expecting a field to be an array then a document arrives in which that field isn’t an array, that document will cause ingestion to restart. This behavior was changed in 6.5.x and now the keys of documents that cause failures in the “where” statement will be logged.
To make sure this isn’t the issue, you can create a dataset without any filter (i.e. just “Create Dataset testDataset on bucketA;”) and see if the same issue is encountered or not.

How often does the dataset cycle back to 9,000,000? There are certain cluster topology changes when performed on the Data Service could cause the Analytics Datasets to rollback some/all data and stream again from the Data Service to ensure correctness of the data. However, if no topology changes are performed on neither the Data Service nor the Analytics Service, then this should not happen. Similarly, once a link is connected, it should stay connected until it is manually disconnected or a bucket is dropped. Dropping a bucket will disconnect the datasets which were created on it.

I asked about on-going operations on the bucket, just to verify that the bucket isn’t being flushed (all docs are deleted) or the bucket itself is deleted then re-created with the same name. Other operations should be fine.

Once you share the logs, I should be able to help you determine the issue.

Hi mhubail,

Got my hands on the logs, seeing these on the Analytics server:

ERRO [ActiveNotificationHandler] Active Job JID:0.2270 failed
org.apache.hyracks.api.exceptions.HyracksDataException: HYR0002: Error in processing tuple 8 in a frame
at org.apache.hyracks.api.exceptions.HyracksDataException.create( ~[hyracks-api.jar:6.0.4-3082]
at org.apache.hyracks.algebricks.runtime.operators.std.AssignRuntimeFactory$1.produceTuple( ~[algebricks-runtime.jar:6.0.4-3082]
at org.apache.hyracks.algebricks.runtime.operators.std.AssignRuntimeFactory$1.nextFrame( ~[algebricks-runtime.jar:6.0.4-3082]
at ~[hyracks-dataflow-common.jar:6.0.4-3082]
at ~[hyracks-dataflow-common.jar:6.0.4-3082]
at org.apache.hyracks.dataflow.common.comm.util.FrameUtils.appendToWriter( ~[hyracks-dataflow-common.jar:6.0.4-3082]
at org.apache.hyracks.algebricks.runtime.operators.base.AbstractOneInputOneOutputOneFramePushRuntime.appendTupleToFrame( ~[algebricks-runtime.jar:6.0.4-3082]
at org.apache.hyracks.algebricks.runtime.operators.std.StreamSelectRuntimeFactory$1.nextFrame( ~[algebricks-runtime.jar:6.0.4-3082]
at org.apache.hyracks.algebricks.runtime.operators.meta.AlgebricksMetaOperatorDescriptor$1.nextFrame( ~[algebricks-runtime.jar:6.0.4-3082]
at org.apache.hyracks.dataflow.common.comm.util.FrameUtils.flushFrame( ~[hyracks-dataflow-common.jar:6.0.4-3082]
at org.apache.hyracks.dataflow.std.base.AbstractReplicateOperatorDescriptor$ReplicatorMaterializerActivityNode$1.nextFrame( ~[hyracks-dataflow-std.jar:6.0.4-3082]
at ~[hyracks-dataflow-common.jar:6.0.4-3082]
at org.apache.asterix.external.util.DataflowUtils.addTupleToFrame( ~[asterix-external-data.jar:6.0.4-3082]
at org.apache.asterix.external.dataflow.TupleForwarder.addTuple( ~[asterix-external-data.jar:6.0.4-3082]
at org.apache.asterix.external.dataflow.FeedRecordDataFlowController.parseAndForward( ~[asterix-external-data.jar:6.0.4-3082]
at org.apache.asterix.external.dataflow.FeedRecordDataFlowController.start( ~[asterix-external-data.jar:6.0.4-3082]
at org.apache.asterix.external.dataset.adapter.FeedAdapter.start( ~[asterix-external-data.jar:6.0.4-3082]
at ~[cbas-connector.jar:6.0.4-3082]
at ~[asterix-active.jar:6.0.4-3082]
at org.apache.hyracks.api.rewriter.runtime.SuperActivityOperatorNodePushable.lambda$runInParallel$0( ~[hyracks-api.jar:6.0.4-3082]
at ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker( ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$ ~[?:?]
at [?:?]
Caused by: org.apache.asterix.runtime.exceptions.TypeMismatchException: ASX0037: Type mismatch: expected value of type object, but got the value of type array
at org.apache.asterix.runtime.evaluators.functions.records.FieldAccessNestedEvalFactory$_EvaluatorGen.evaluate( ~[asterix-runtime.jar:6.0.4-3082]
at org.apache.asterix.runtime.evaluators.functions.CastTypeEvaluator$_EvaluatorGen.evaluate( ~[asterix-runtime.jar:6.0.4-3082]
at org.apache.hyracks.algebricks.runtime.operators.std.AssignRuntimeFactory$1.produceTuple( ~[algebricks-runtime.jar:6.0.4-3082]
… 22 more

Hi @JayZhang,
As the following error message suggests:

Type mismatch: expected value of type object, but got the value of type array

It looks like a “WHERE” statement filter issue. Some field in your filter is expected to be a JSON object but it was a JSON Array in one or more documents. As mentioned earlier, every time that document is encountered in CB Server 6.0.x, it will cause ingestion to restart from a certain point. If your application is expecting this field to be always an object, best thing to do is to fix the source data in the Data Service. The Analytics Service can help you identify data issues by creating a dataset without any filters then running queries similar to this:

SELECT meta().id FROM some_dataset WHERE IS_OBJECT(suspected_field_name) = false;

This will give you the IDs of those documents then you can fix their data in the Data Service. After that, you may create the dataset in Analytics with the proper filter.
Another option would be to add the condition IS_OBJECT(expected_to_object_field_name) at the beginning of your filter to ensure that the field won’t be accessed unless it is of the expected type. However, I don’t recommend this solution unless you can’t fix the source data since it will cause some overhead and depends on the execution order of the conditions in the “WHERE” statement.

You don’t encounter this issue in your tests when your data size is small most likely because your sample data is all well-formed. As mentioned earlier, the behavior of restarting ingestion in such failures has been improved in CB Server 6.5.x and now ingestion will continue and the IDs of documents that fail to be ingested are logged with the cause of the failure.

Hope this helps.

Thank you for the reply, maybe it’s the index creation and not the dataset creation that’s causing this issue for me. I wonder if the object is null it would cause this issue?

for example:

Document on bucket beer
type: typeA,
person: {
        fName: Joe,

creating this dataset and index:

disconnect link Local
create dataset doc on beer where type= 'typeA'
create index idx_doc on doc ( person.fName: string)
connect link Local

Would this run into the errors we are seeing if person is null or if persone.fName is null in some of the documents?
Does it behave differently if person is null vs if person.fName is null?

The secondary index should work just fine in both of these cases. Documents in which person or person.fName is null or missing will not be indexed.

The stacktrace in the log you shared does not point to an issue in secondary indexes but to filtering documents before they are ingested.
Was it an old log? do/did you have other datasets with different filters?

It is not an old log. These logs were being generated during the mutation count down. Once I disconnected the link and dropped the dataset, the errors stopped.

There are also no other datasets on the analytics server during the test. I’ll investigate more, but I don’t think we have documents that have type as something other than a string.