The Basics

Before you read any further, please take a few minutes and read the excellent post on geospatial search in Couchbase, as published by my friend and colleague Brian Kane:

Go ahead; I’ll wait.

Now that you’re back, you will know that one great way to leverage Couchbase’s full text search engine is to pass to it a series of vertices which identify a polygon (usually irregular) describing a geographic region. Brian’s example uses ten pairs of lat/long points:

These points roughly bound a region in Tennessee, south of the highway, north of the National Park Boundary, and within a single county…good enough for the required analysis, and easy to paste into your request. Given these, the Couchbase full text search (FTS) index engine can easily return all of the required data elements associated with points inside (or outside) the perimeter. (Brian gives a great example of this in his post.)

The Wrench (or Maybe, the Wrench-shaped District)

But what if your polygonal region is extremely detailed and complex, maybe requiring thousands of pairs of lat/long points to describe? Do we have ready examples of these? Yes! Thanks to the hard work of the fifty State Legislatures and/or their surrogates, we have plenty of examples of regions like this in the form of U.S. Congressional districts. And thanks to Couchbase N1QL and FTS geospatial search, we have the means to manage the data with ease.

The average U.S. Congressional district requires 8,694 vertices to define it. Reasons for this are both practical (all of them are expected to comprise approximately the same number of citizens), political (the parties in power may contort district boundaries in such a way that the voters there will keep them that way–this is called gerrymandering) and geographical (a lot of them are based in part on rivers, lakes, ocean shores, mountains, and other natural boundaries). The most geographically complex district (i.e. the one requiring the largest number of vertices to describe) is the 5th Congressional District of Virginia, which takes a whopping 40,145 lat/long pairs to describe (and looks like a reverse T. Rex rampant). The simplest, requiring only 422, is the 36th Congressional District of New York, which looks like a submarine sneaking away from Lake Erie.

The Data

Clearly, then, we’re going to want to store and retrieve our geo points from a database if we want to implement queries against them on a large scale. And because the points are likely to be found in the form of an embedded array, a JSON document in Couchbase is just the ticket. Below is an example of such a document:

Why is the data shaped like this, you may wonder, with the polygon points embedded in a nameless single-element array, embedded in another “coordinates” one, embedded in a “geometry” object? The simple answer is that sometimes you just work with the data you have. (It is based on the public source I was able to find, one which was remarkably easy to import into Couchbase. Maybe I’ll write a separate post describing that process.) And even though the data is bit cumbersome, the N1QL language, as we will see below, makes it easy to retrieve what we need.

The other dataset which concerns us comprises the main portion of our example. It is a list of millions of registered voters (don’t worry; I faked the names and addresses), along with the party affiliation and voting history for each. A sample document looks like this:

The Use Case and the Set-up

Finally, then, our use case: Given an individual constituent on the phone, how does a member of Congress quickly determine whether or not the person is a member of his or her voting district? We will solve the problem with FTS and N1QL.

First we must prepare the FTS index. In our case, we will index all documents based on the type field _type. We will index the Name field as a keyword, and the Geo field as a geopoint. Here is what it looks like on my console:

(Brian’s post goes into more detail on the steps you will follow to build an index.)

Once this index is built, we will be able to pass it a series of polygon points and receive a series of hits. Following Brian’s lead, I tested this using a curl:

This shows me that a search against a simple polygonal region can and will return a list of names. Theoretically, we could stop there and let the app (or even the user) search through the hits to see if it finds the individual voter name in question. But we can do better. Let’s let the search engine narrow it down. We do this via a “conjunct” search. (Think of a conjunct as a logical AND and a disjunct as a logical OR.) Below is the example curl:

You can read this as “If the geopoint is within the polygon boundaries and the name matches the voter name, return the hit.” Works like a charm, so we know our FTS index is properly defined.

The Extraction

Now, then, we need to test the retrieval of an individual district’s boundaries from the database. Our first pass at this entails just a simple inspection of the data we are likely to use, maybe just for a single district:

This returns a result like this:

…and on and on and on for another 1.3MB of a result set. No wonder we don’t want to cut and paste this.

Our goal, remember, is to end up with something which looks like this:

Here’s how we end up with just that:

This is quite a mouthful, so let’s unpack it. Remember that we’re working with the data that we have, rather than what we might ideally want, and the polygon points we’re after are embedded in a nameless single-element array, embedded in another “coordinates” one, embedded in a “geometry” object. We need to unwind them one-by-one. First, let’s eliminate the nameless array wrapper. We do this by simply requesting that only the single (first, or “zeroth”) member of the array be returned:

The returned JSON object from this query looks like this:

We can convert this to an array (as opposed to a JSON object) by using select value:

Now we have the very large array we’re after, still wrapped in the single element of another array:

Let’s select from that return set:

This yields a bunch of little objects we can bend to our will:

Now that we can address them let’s convert the types and perform our concatenation:

The resulting objects look like this:

Now use select value to receive them as an array:

And we have the results we’re looking for:

The Ease of CTEs

The last trick up our sleeve to pull this all together is a good one. We need a way to reference the array containing the geopoints as a component of a larger SQL statement. Fortunately, N1QL provides us with the means to do so in the form of Common Table Expressions (CTE).  CTE, which are added to a query via the with clause, are evaluated once per query block and can be introduced before a select. This is exactly what we’re looking for:

We now have access to an evaluated return set “geopoints” which can be referenced in subsequent (or multiple subsequent) SQL statements. Perfect. Here it is used in the final query:

There it is, then:  A simple single-screen code block which retrieves the complex boundaries of a district and leverages them as part of a N1QL-driven geospatial search.  Give the technique a try and conquer your own geographical challenges.

Thanks very much to Brian Kane for his original post and to Dmitry Lychagin for help in unraveling the nested arrays.


Posted by Peter Reale

Peter Reale is a Senior Solutions Engineer at Couchbase and has been in the data business since 1984. He is based in Los Angeles.

Leave a reply