I have a case where I get a file generated by Spark containing document keys of cold data, which should be deleted. These files usually contain arround 6.000.000 records/ keys.
I wrote a Python script which reads the file line by line, generates packages of e.g. 250 lines and then deletes them using remove_multi() providing a dictionary with the keys and their expected CAS.
As far as I can see, the remove_multi() does not immediatelly remove the documents. In the Couchbase UI, I can see the are no “deletes per sec.” for quite a while, then after a few minutes they go up to some thousands per seconds and the disk write queue gets filled too.
Am I missing something like a commit to force the removal after each 250 docs package?
Any advice would be very much appreciated.
def remove_batch_data( rm_dict ): suc = 0 cb = Bucket(CB_CONNSTR, password=CB_PASSWORD) if ( len(rm_dict)>0 ): if str2bool(dryrun_mode): for key in rm_dict: logging.info('Key: ' + key + ' - CAS: ' + str(rm_dict[key])) else: try: docs = cb.remove_multi( rm_dict, quiet=True ) suc += len(rm_dict) except cb_errors.KeyExistsError as exc: for k, res in exc.all_results.items(): if not res.success: logging.warning('Removal failed: ' + k) else: suc += 1 return suc