What Is This About
So, you have a time series use case? So do I, and this blog article is the proof of it.
When I learn about a new subject, I like using the method of loci to memorize and connect the new concepts. The method uses a familiar place (your house, for instance) you walk through (mentally) and fill with the things you want to remember. If I want my memory to hold my schedule for tomorrow, I can start at the entrance and “hang” my 09:00 doctor’s appointment next to my huge colorful sombrero. As I “walk” into the kitchen, I will see my 10:30 Zoom call “frying” on top of a skillet with scrambled eggs. The weird imagery helps remember things better: our brains cling to the unusual. Interestingly enough, this example is a time series, too: each event on my schedule happens at a certain time, and each event has certain properties (name, place, etc.).
Let’s continue this “walk through a house” analogy and try to make the new time series knowledge stick.
The Green Room
Green is a happy color: nature, renewal, growth, fresh ideas. Our definition of time series belongs in the green room.
When your child has a fever, and you take his or her body temperature every hour, you have a series of measurements over time. When you clandestinely attach a GPS tracker to a criminal’s car, you get a series of locations where the car stops over weeks or months. How about a Raspberry Pi project to collect temperature and humidity readings once a second that your kid built for a school science fair? Have it installed in your cannabis farm to stream this data to your Couchbase cluster! You can then use the data with a Couchbase Mobile app to assure the optimal conditions for the best harvest of this important crop.
As you can see, the time series use cases extend well beyond the old boring stock ticker and server logs examples. They still boil down to a list (sequence, series) of values (readings, measurements, data points) collected over time. Time series use cases are common nowadays in consumer products, industrial technologies, and business services. This popularity inspired the appearance of time series databases, which include different optimizations for working specifically with time series data. E.g., we provide Prometheus exporter (written in Python) to convert Couchbase Server metrics into Prometheus time series format, so you can build your monitoring dashboards with Grafana.
We are leaving our green room with a melted clock hanging from the ceiling (it needs to be weird, right?) and a melting white board with a series of distorted time-stamped measurements as an example of the time series data from a summer day in Arizona (in degrees Fahrenheit):
2019-07-27 07:11:22, temperature=90, humidity="what humidity?", feeling="sweaty"
2019-07-27 10:21:22, temperature=98, humidity="this word again...", feeling="the pain"
2019-07-27 12:34:56, temperature=107, humidity="gotta look up this word in a dictionary", feeling="desperate"
2019-07-27 15:00:27, temperature=120, feeling="boiling inside, fried on the outside"
2019-07-27 15:32:15, temperature=, feeling="confused", thermometer-status="melted"
The Orange Room
Orange is the color that attracts our attention; it usually brings the feelings of enthusiasm and warmth. So, it is a good place to put the details of how we work with time series data in Couchbase Server.
Be prepared to handle a lot of writes! Real-life use cases involving time series data produce thousands of readings – per hour, per minute, per second. When you multiply it by the number of devices or applications generating the data, you will quickly reach millions of new writes per day. Couchbase Server optimizes data ingestion with its memory-first, asynchronous architecture. JSON documents are compressed in memory, on disk, and on the wire.
Couchbase Server handles all types of writes – inserts, updates, and upserts (insert if the key does not exist; otherwise, update) – with equal efficiency. You can further optimize your updates with our sub-document operations available for every supportable SDK (e.g., Java SDK sub-document API).
key = "sensor::temp-press::2020-01-02T12:34:00"
"ts": "2020-01-02 12:34:00",
key = "sensor::temp-press::2020-01-02T12:34:01"
"ts": "2020-01-02 12:34:01",
key = "sensor::temp-press::2020-01-02T12:34:58"
"ts": "2020-01-02 12:34:58",
key = "sensor::temp-press::2020-01-02T12:34:59"
"ts": "2020-01-02 12:34:59",
to a single document with all the readings, like the one below:
key = "sensor::tps-001::1577968440"
"t": [112.4, 110.8, ... null, 113.1],
"p": [21.7, 21.2, ... 22.8, 22.5]
Apart from eliminating duplicate data (timestamps, long JSON field/attribute names), we have also made the following changes:
- replaced ISO timestamp with epoch value in the document key. The epoch value corresponds to the minute (2020-01-02 12:34), for which we collected per-second readings.
- added sensor name to the document key. This way we can query the values for the specific sensor using key-value operations, which is always the fastest way to work with documents.
- listed temperature and pressure data in 60-element arrays, one measurement per second;
- added JSON schema version of the document.
The flexibility of JSON format makes it easy to get your first version out quickly (in the spirit of “Worse is better” principle) and evolve your document schema as your application matures. Specialized time series databases are much more rigid in this respect. However, it is a best practice to keep track of the schema version in your documents. This helps to ensure backward compatibility in your applications. It also allows you to update the documents to a new schema by running a N1QL UPDATE query.
It is time to do something useful with our time series data. Here are the options that Couchbase data platform offers:
- Key-value reads in less than a millisecond is what Couchbase Server does great, all day long. After all, one of the main reasons we aggregated our data was to get it easily in a single read operation.
- SQL queries for your JSON data. N1QL language purposely inherited SQL syntax to make it easy for you to learn.
- Full text search is another way to work with data that’s available with Couchbase Server. Depending on your use case, the search can be a better alternative for language-aware, numeric range, date range, and geospatial queries. Better yet, you can combine search and N1QL in a single query.
- Analytics is a natural fit for time series data. As part of our data platform, we offer Analytics service that allows you to obtain a wide variety of business insights from your data. Couchbase Analytics runs as part of the same cluster where your data resides, so there is no ETL (Extract, Transform, Load) operations necessary. You execute efficient parallel queries against up-to-date shadow copies of the data.
- Couchbase also plays nice with others. We offer supported big data connectors for popular systems like Spark, Kafka, and Elasticsearch.
Sooner or later, it will be time to move on. How long do you need to keep your data? We always ask this question when we help our customers size their clusters. With Couchbase, it is easy to scale up or down. An online gaming company may want to start with a few extra nodes when launching a new game. On the other hand, a startup on a tight budget may have to be extra vigilant about how much data they keep around.
How can we delete data in Couchbase?
- Set document expiration (also known as TTL, time to live). TTL values are part of the document metadata: if TTL is zero, document will not expire. A value greater than zero is the number of seconds, after which the document will be marked as deleted. TTL can be set and updated via SDK methods or (as of Couchbase Server 6.5.1) via N1QL query.
- Set TTL on a bucket. All new documents entering the bucket will get the TTL assigned, unless it’s already set.
- Delete documents via N1QL query or Eventing function. Eventing functions can also be executed by timers .
- Flush or drop the entire bucket.
Prior to deleting data from Couchbase cluster, you may want to archive or share it for some of the reasons below:
- Archive documents to another Couchbase cluster by using XDCR (cross data center replication). The documents in the source bucket can have TTLs, which can be removed before entering the destination bucket. This option is configurable for XDCR replications.
- Archive documents to cheaper storage system (e.g., AWS S3) for longer retention.
- Move documents to a different system for further analysis (e.g., a data lake for long-term scientific research).
As we are leaving our orange room, let’s keep this W.A.R.M. feeling (Write, Aggregate, Read, Move on) in our memory. The life cycle of the time series data is shown with these capital letters – one letter per each wall – and the squiggly arrows (like the one above) connecting them.
More Rooms Coming Soon
Well, yellow is the color of hope, but it’s very hard to read on a white background. I’ll use it here once as a symbol of my hope to share the Episode 2 of this Series on Time Series with you soon. Thank you for your time!