Server-side map geospatial clustering

We’re working on implementing a web map UI for a large dataset, potentially hundreds of thousands of documents. We’re exploring optimal algorithms for achieving this with Couchbase. Geospatial FTS indexes seem like the obvious foundation. But we’re still working through the best queries and algorithms to achieve the most performant results. Is this a use case that anyone could share experience with?

The UI will need to query on a lat/lon visible bounding box for a particular zoom level and receive a set of markers for either clusters of documents or solitary documents without close neighbors. Clusters should include the number of markers contained within the cluster. Additionally, users will have the ability to filter markers with additional filter criteria for document properties (e.g. user, status, etc.).

The initial algorithm I’ve put together in a POC is:

  1. Calculate grid of sub-regions of map’s visible rect
  2. In parallel, for each grid sub-region execute a FTS query on the rect
  3. If more than one result for a grid region, return a cluster with the row count and average lat/lon of documents
  4. If exactly 1 result, return the document itself as a marker
  5. Gather the grid region results and return to client

This algorithm seems to perform ok. But timings range from 10 ms to 10 s. Some of this could be due to local environment performance, but feedback would be helpful if there are alternatives to consider or optimizations that could improve on it.

I haven’t delved into the additional document filtering yet either. Is there a way to achieve this with the FTS search or would it require N1QL query on top of it?

Would additional search index nodes improve parallel query performance?

@jeff.lockhart Sharing your FTS index definition and some sample queries will help us help you better when it comes to questions on index/search performance.

In your algorithm you mention …

If more than one result for a grid region, return a cluster with the row count and average lat/lon of documents

What do you hope to gain from determining the average lat/lon of the documents returned by your search request? I take it that this calculation is made within your application as Couchbase FTS does not support arithmetic operations over document content. (This can be achieved by running your search request from N1QL though.)

Document filtering and obtaining document properties is supported with couchbase FTS. If you can get us a sample document along with the index definition and indicate how you wish to filter documents, we could look into …

  • Optimizing the index to only hold those documents that are relevant
  • Compounding your search to further narrow down the number of documents that qualify

Would additional search index nodes improve parallel query performance?

By simply partitioning your search index - you will be able to run your query in parallel on multiple partitions of your index - via a scatter-gather operation. If the CPU utilization on a single node is reaching saturation with your query workload, you could look into adding more search nodes - upon rebalancing new nodes in, your index’s partitions will now reside (evenly balanced if number of partitions is a multiple of the number of search nodes in the cluster) across all search nodes in your cluster.

What do you hope to gain from determining the average lat/lon of the documents returned by your search request?

As the cluster marker needs a lat/lon to display on the map, taking the average of the documents represented by the cluster achieves this. Alternatively, we could simply take the center point of the grid rect, but this is a bit poorer UX, as it ends up creating a distinct grid pattern of cluster markers.

I take it that this calculation is made within your application as Couchbase FTS does not support arithmetic operations over document content.

Yes, I’m currently performing this math in the server application. The document geo is stored in the FTS index. I’ve experimented with various limits on the document rows returned to perform the average on. I didn’t see noticeable difference in performance between a limit of 10, 100, or 1000, or even between taking a simple centroid of the grid.

Upon further testing, this algorithm works well at high zoom levels, taking milliseconds even with 10s of millions of documents. But the searches become longer running at low zoom levels, taking multiple seconds starting at the country level. Continent, or entire world map view are extremely slow, timing out with a large number of documents.

This may have to do with needing to load nearly the entire FTS index into memory in order to search a large portion of the map, which, being done in parallel, is likely causing contention between accessing different parts of the index. Maybe there’s a better algorithm that performs better at these low zoom levels. It’s also possible we just don’t support displaying markers at very low zoom levels.

For 30M documents, the FTS index disk size is 5.35GB. I’ve allocated 16GB to the search service, but I’m still testing with a single node local server. There may be optimizations to shrink the size of the index and that might make a difference. Perhaps it could be more performant to not store some of the properties and rather rely on the data service to get them instead, since they’re only needed when rendering non-clustered markers.

Sharing your FTS index definition and some sample queries will help us help you better when it comes to questions on index/search performance.

Here's the FTS index I'm currently using:
{
  "type": "fulltext-index",
  "name": "lead_geo",
  "uuid": "72c6391d703e8503",
  "sourceType": "gocbcore",
  "sourceName": "salesrabbit",
  "sourceUUID": "e7e269b73b9eb60d5a60a74ca274abc2",
  "planParams": {
    "maxPartitionsPerPIndex": 1024,
    "indexPartitions": 1
  },
  "params": {
    "doc_config": {
      "docid_prefix_delim": "",
      "docid_regexp": "",
      "mode": "type_field",
      "type_field": "type"
    },
    "mapping": {
      "analysis": {},
      "default_analyzer": "standard",
      "default_datetime_parser": "dateTimeOptional",
      "default_field": "_all",
      "default_mapping": {
        "dynamic": true,
        "enabled": false
      },
      "default_type": "_default",
      "docvalues_dynamic": false,
      "index_dynamic": false,
      "store_dynamic": false,
      "type_field": "_type",
      "types": {
        "lead": {
          "dynamic": false,
          "enabled": true,
          "properties": {
            "channels": {
              "dynamic": false,
              "enabled": true,
              "properties": {
                "orgUnit": {
                  "dynamic": false,
                  "enabled": true,
                  "fields": [
                    {
                      "index": true,
                      "name": "orgUnit",
                      "type": "text"
                    }
                  ]
                },
                "users": {
                  "dynamic": false,
                  "enabled": true,
                  "fields": [
                    {
                      "index": true,
                      "name": "users",
                      "store": true,
                      "type": "text"
                    }
                  ]
                }
              }
            },
            "geo": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "geo",
                  "store": true,
                  "type": "geopoint"
                }
              ]
            },
            "status": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "status",
                  "store": true,
                  "type": "text"
                }
              ]
            }
          }
        }
      }
    },
    "store": {
      "indexType": "scorch",
      "segmentVersion": 15
    }
  },
  "sourceParams": {}
}
Here's a simplified version of the document model:
{
  "type": "lead",
  "geo": {
    "lat": 40.4210126,
    "lon": -111.8836315
  },
  "status": "status1",
  "channels": {
    "users": ["user1", "user2"],
    "orgUnit": "ou1"
  },
  //...
}
Here's the query I'm currently using:
cluster.searchQuery(
    "lead_geo",
    GeoBoundingBoxQuery(grid.ulLon, grid.ulLat, grid.brLon, grid.brLat)
        .field("geo"),
    SearchOptions.searchOptions()
        .fields("geo", "status", "channels.users")
        .limit(10)
)

How can I modify this query to additionally filter on status and/or channel.users? For example:
WHERE status IN ("status2", "status3") AND (ARRAY_CONTAINS(channel.users, "user3") OR ARRAY_CONTAINS(channel.users, "user5"))
Is it possible to do this entirely with the FTS index? I have the properties indexed and stored, along with geo.

Another question, is it possible to use a subdocument property as the type filter? For example:

{
  "subDoc": {
    "type": "docType"
  }
}

using subDoc.type as the type filter in the FTS query? I tried with this syntax, and while didn’t get an error, my test queries didn’t use the index like they did when I used a root property for type instead.

I figured out how to do the additional filtering by status and channels.users with conjuncts, disjuncts, and match queries:
cluster.searchQuery(
    "lead_geo",
    conjuncts(
        geoBoundingBox(grid.ulLon, grid.ulLat, grid.brLon, grid.brLat)
            .field("geo"),
        disjuncts(
            match("status2").field("status"),
            match("status3").field("status")
        ),
        disjuncts(
            match("user3").field("channels.users"),
            match("user5").field("channels.users")
        )
    ),
    searchOptions()
        .fields("geo", "status", "channels.users")
        .limit(10_000)
)

I changed the FTS index’s default analyzer to keyword as well, since I’m interested in exact ID matches.

However, there are a couple issues with how this is working. First, I noticed that when a document lies just outside a grid’s bounds, it may still show in the search results. It may have to do with the precision of the geo index.

For example, a document with coordinates:

"geo":{
  "lon":-90.9506681188084,
  "lat":42.22834101185747
}

shows up in adjacent grid bounding box queries:

ulLat=42.229759216308594
ulLon=-90.95199584960938
brLat=42.227848052978516
brLon=-90.95066833496094 <-- adjacent edge

and

ulLat=42.229759216308594
ulLon=-90.95066833496094 <-- adjacent edge
brLat=42.227848052978516
brLon=-90.94933319091797

The document’s lon matches the shared lon in both grid’s up to 6 decimals. But ultimately it truly lies only on one side of the line, to the right within the second grid. Both -90.95066833496094 and -90.9506681188084 are 64-bit doubles, which the SDK API expects. Cast as 32-bit floats however, they are both equal to -90.95067.

This is problematic, as it’s important for this algorithm to work for the documents within the visible screen rect to each be returned exactly once and only once between all of the smaller grid rect queries. As it seems the bounding box FTS query is inclusive of the bounding box lat/lon, the precision of the grid lat/lon needs to be great enough that there’d be an extremely low probability of a document having an exact match, in order to avoid having it on the boundary of multiple grids and included in both query results.

As a workaround, I’ve needed to increase the search result limit to the max of 10,000 and manually filter out document geos that aren’t actually within the grid bounding box. However, besides the additional overhead, this can’t filter out any results above the 10,000 document limit that may fail the check, in the case where there are many documents within a grid, which is possible at low zoom levels.

There may be a better way to determine the grid boundaries in a way to avoid this duplication between adjacent bounding box queries. Could you explain how the precision of the FTS geopoint indexes work?

The second and more problematic issue I’m seeing, is that the query results aren’t consistent. Repeating the same bounding box query for the same region often returns a different set of results. This is with or without the additional conjuncts/disjuncts filters. When it happens, it’s continuously reproducible. Back to back repeated querying returns a different set of results that are less than the total number of expected results over and over. I’ve reproduced it with both the Java SDK as well as N1QL SEARCH() queries in the admin console. It seems the queries work as expected immediately after the FTS index is built or after not performing a query on the cluster for a while, but after panning and zooming briefly on the map, performing many queries in the process, the queries begin to misbehave, returning incomplete results.

Along with the incomplete results from the queries, the SDK is logging this repeatedly:

2021-11-22 21:33:44.563 [cb-events] WARN com.couchbase.endpoint - [com.couchbase.endpoint][UnexpectedEndpointDisconnectedEvent] The remote side disconnected the endpoint unexpectedly {“circuitBreaker”:“DISABLED”,“coreId”:“0xec75650000000001”,“local”:“<server_ip>:49238”,“remote”:“<cluster_ip>:8094”,“type”:“SEARCH”}
2021-11-22 21:33:44.563 [cb-events] WARN com.couchbase.endpoint - [com.couchbase.endpoint][UnexpectedEndpointDisconnectedEvent] The remote side disconnected the endpoint unexpectedly {“circuitBreaker”:“DISABLED”,“coreId”:“0xec75650000000001”,“local”:“<server_ip>:49230”,“remote”:“<cluster_ip>:8094”,“type”:“SEARCH”}
2021-11-22 21:33:44.563 [cb-events] WARN com.couchbase.endpoint - [com.couchbase.endpoint][UnexpectedEndpointDisconnectedEvent] The remote side disconnected the endpoint unexpectedly {“circuitBreaker”:“DISABLED”,“coreId”:“0xec75650000000001”,“local”:“<server_ip>:49292”,“remote”:“<cluster_ip>:8094”,“type”:“SEARCH”}
2021-11-22 21:33:44.563 [cb-events] WARN com.couchbase.endpoint - [com.couchbase.endpoint][UnexpectedEndpointDisconnectedEvent] The remote side disconnected the endpoint unexpectedly {“circuitBreaker”:“DISABLED”,“coreId”:“0xec75650000000001”,“local”:“<server_ip>:49277”,“remote”:“<cluster_ip>:8094”,“type”:“SEARCH”}
2021-11-22 21:33:44.563 [cb-events] WARN com.couchbase.endpoint - [com.couchbase.endpoint][UnexpectedEndpointDisconnectedEvent] The remote side disconnected the endpoint unexpectedly {“circuitBreaker”:“DISABLED”,“coreId”:“0xec75650000000001”,“local”:“<server_ip>:49299”,“remote”:“<cluster_ip>:8094”,“type”:“SEARCH”}

@jeff.lockhart ,

Your use case demands a thorough analysis of the use case details and I am skeptical about doing that effectively over forums ping pongs. So, if you have an option then please create a support ticket with all the details and the cb collect logs.

Precision => FTS should give precise results up to a granularity of a few meters. We could dig your use case if you share your exact query and the missing document coordinates.

Inconsistent results => this looks improbable but can’t confirm anything without checking the logs. Not sure whether there are any server write timeouts or other connection/gateway errors happening here (As per the SDK error logs).
Were there any external network elements that act like a circuit breaker or throttler here?
Were those queries taking more than 60 secs to finish?
What happens when you hit the FTS backend directly over curl calls upon the first occurrence of such inconsistent results?

Additional status/channel fields within the FTS index - this is indeed possible and should help improve your performance.

Performance => having more partitions would help to take benefit of the CPU cores (or nodes) available in the same node to parallelize the query and thereby improve the performance.

10K limit => this is configurable - bleveMaxResultWindow | Couchbase Docs

Skip the scoring => since you are doing keyword-based exact matches, the tf-idf based relevancy scoring doesn’t seem to make sense for your use case. You may skip the scoring to ease some performance.
ref- Scoring | Couchbase Docs

Cheers!
Sreekanth

@jeff.lockhart If you think the results produced from N1QL are not consistent - what would help us diagnose the situation for you are the actual queries that your ran from the N1QL workbench (using SEARCH()) or the sample SDK code (like you shared last time) that produces the inconsistent results.

My suspicion is that your N1QL query is treating the index as non-covering and during the verification phase it’s possibly not got enough context to verify documents correctly. See - Search Functions | Couchbase Docs

While running your N1QL query, would you add a 3rd argument within the search function like this (introducing the FTS index name within the options argument)…

SEARCH(salesrabbit, ..<search request>.., {"index": "lead_geo"})

I do agree with @sreeks - it might do good for you to reach out to Couchbase support to register a ticket with them - you’ll be given access to a platform to upload logs of your cluster and other necessary details for us.

Thank you for the helpful info. I’ve been juggling a lot. I’ll direct those that are working further on this to reach out to support if necessary to debug further. Some notes from what I’ve found below:

Were there any external network elements that act like a circuit breaker or throttler here?

I don’t believe so. Each query receives a result response, they’re just not the same between subsequent requests.

Were those queries taking more than 60 secs to finish?

No, the query results were returned quickly (if the bounding box area was small). The inconsistent query result behavior was the same regardless of the bounding box size.

What happens when you hit the FTS backend directly over curl calls upon the first occurrence of such inconsistent results?

I didn’t attempt this, but I did see the same inconsistent results using both the Java SDK as well as the admin console N1QL query interface.

You may skip the scoring to ease some performance.

Interestingly, I found that after disabling scoring, the inconsistent query results stop occurring. If I perform the queries with scoring for a bit, I eventually get into the state where repeating the same query returns different results. Then if I perform that same query with scoring disabled, the first time the results are still not accurate, but then subsequent queries afterwards with scoring disabled then begin to return the same accurate results. Does this give any clues as to why the query with scoring is not returning accurate results?

having more partitions would help to take benefit of the CPU cores (or nodes) available in the same node to parallelize the query and thereby improve the performance.

I tried increasing the index partitions from 1 to 16, as there are 16 CPU cores on the single-node test server I’m working with. But I found a decrease in performance. A query that previously took ~2 seconds consistently took over 15 seconds after increasing the partitions. Changing the partitions to 2 saw the same query take ~4 seconds. How should I be determining the number of partitions? Would it only benefit if I have multiple nodes running the search service, so not in this single-node test environment?

Missed this conversation, Not sure whether the issue still persists.
What was the FTS RAM quota set here? Insufficient RAM quota could dent the performance too with higher partitions. Without delving into the product/fts.log, it would be difficult to predict what is happening with the system.