Is CB cluster automatically update statistics?

Hi,

From CB7.1 doc, in order to use CBO, we need to run “update statistics” to gather distribution statistics for the expressions used in the query. My questions are:

  1. After the initial run of “update statistics”, do we need to schedule a job to run it to refresh statistics or CB cluster automatically refresh statistics if it’s staled?

  2. Is the statistics stored in system-like collection N1QL_CBO_STATS purged regularly?

Thanks,

Matthew

  1. Currently optimizer statistics are gathered via UPDATE STATISTICS/ANALYZE command issued by the user, thus user needs to manually refresh optimizer statistics with UPDATE STATISTICS/ANALYZE command. The frequency of this refresh will depend on the changes of the bucket/collection since the last time UPDATE STATISTICS/ANALYZE is run. Active bucket/collection (with lots of mutations) will require more frequent refresh than a more stale bucket/collection (few mutations). Future enhancement is being evaluated to make this process more automatic.
  2. Optimizer statistics stored in N1QL_CBO_STATS is purged only when a bucket/collection is being dropped, or when an index is being dropped. It is also overwritten by newer optimizer statistics collected with new UPDATE STATISTICS/ANALYZE commands. Use can also manually purge optimizer statistics by using the DELETE option for UPDATE STATISTICS/ANALYZE.
1 Like

Hi,

Thank you for your quick response.

Does update statistics command have the parameter to only refresh stale documents or it always update stats on all documents in bucket?

At this moment, our production only uses default scope and collection. It has about 20B documents. I am afraid of the time will be very long for entire statistics update.

Thanks,

Matthew

Currently UPDATE STATISTICS does not have any way of differentiate “stale” vs “non-stale” bucket/collection. An UPDATE STATISTICS command currently always gather new optimizer statistics (from the entire bucket/collection). There is currently no way to only incrementally update optimizer statistics with newer documents. This aspect is also under evaluation for future enhancement.
In terms of the size of a bucket/collection, that’s not directly related to how long an UPDATE STATISTICS command will run. UPDATE STATISTICS uses random sampling to gather optimizer statistics, it does not need to scan the entire bucket/collection. Therefore even for large bucket/collection UPDATE STATISTICS should not take a very long time.

Hi,

Based on this thread How to Disable Cost Based Optimizer via REST API cmd

I can disable CBO using a command. do I need to run this command on every query node?

Thanks,

Matthew

The curl command given in that thread is a cluster-level setting, i.e. it affects the entire cluster. You don’t need to run it for each query node.
For more information on settings see:

1 Like