If you have been following the earlier posts on the Couchbase blog, you have seen content about Couchbase Shell latest release. I am trying different things and today I wanted to see what I could do with time series datasets.
Ingest time series data
Browsing Kaggle, I found temperature data per City. I have downloaded it. It’s about 500Mb and that makes it hard to manipulate as a whole. But, of course, it’s a text file so we can easily peek at the data structure like so:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
> 열기 GlobalLandTemperaturesByCity.csv | 먼저 10 ╭───┬────────────┬────────────────────┬───────────────────────────────┬───────┬─────────┬──────────┬───────────╮ │ # │ dt │ AverageTemperature │ AverageTemperatureUncertainty │ City │ Country │ Latitude │ Longitude │ ├───┼────────────┼────────────────────┼───────────────────────────────┼───────┼─────────┼──────────┼───────────┤ │ 0 │ 1743-11-01 │ 6.07 │ 1.74 │ Århus │ Denmark │ 57.05N │ 10.33E │ │ 1 │ 1743-12-01 │ │ │ Århus │ Denmark │ 57.05N │ 10.33E │ │ 2 │ 1744-01-01 │ │ │ Århus │ Denmark │ 57.05N │ 10.33E │ │ 3 │ 1744-02-01 │ │ │ Århus │ Denmark │ 57.05N │ 10.33E │ │ 4 │ 1744-03-01 │ │ │ Århus │ Denmark │ 57.05N │ 10.33E │ │ 5 │ 1744-04-01 │ 5.79 │ 3.62 │ Århus │ Denmark │ 57.05N │ 10.33E │ │ 6 │ 1744-05-01 │ 10.64 │ 1.28 │ Århus │ Denmark │ 57.05N │ 10.33E │ │ 7 │ 1744-06-01 │ 14.05 │ 1.35 │ Århus │ Denmark │ 57.05N │ 10.33E │ │ 8 │ 1744-07-01 │ 16.08 │ 1.40 │ Århus │ Denmark │ 57.05N │ 10.33E │ │ 9 │ 1744-08-01 │ │ │ Århus │ Denmark │ 57.05N │ 10.33E │ ╰───┴────────────┴────────────────────┴───────────────────────────────┴───────┴─────────┴──────────┴───────────╯ |
1 2 3 |
> 열기 GlobalLandTemperaturesByCity.csv | 길이 8599212 |
8599212 lines, this is going to be 8599211 documents. My end goal is to see a temperature graph of several cities over the years. To do that, I am going to import everything to an import bucket first, then I will transform the data into time series.
If I import this naïvely, it gives me one doc per city/month group. So the key of my doc will be like Århus:1743-11-01. Say I only want to upload the first line, it should look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
> 열기 GlobalLandTemperaturesByCity.csv | 먼저 1 | wrap 콘텐츠 | 삽입 id { |x| $"($x.content.City):($x.content.dt)" } ╭───┬───────────────────┬──────────────────╮ │ # │ content │ id │ ├───┼───────────────────┼──────────────────┤ │ 0 │ {기록 7 필드} │ Århus:1743-11-01 │ ╰───┴───────────────────┴──────────────────╯ > 열기 GlobalLandTemperaturesByCity.csv | 먼저 1 | wrap 콘텐츠 | 삽입 id { |x| $"($x.content.City):($x.content.dt)" } |doc 업서트 ╭───┬───────────┬─────────┬────────┬──────────┬─────────╮ │ # │ 처리 │ 성공 │ 실패 │ 클러스터 │ 실패 │ 클러스터 │ ├───┼───────────┼─────────┼────────┼──────────┼─────────┤ │ 0 │ 1 │ 1 │ 0 │ │ 아카펠라 │ ╰───┴───────────┴─────────┴────────┴──────────┴─────────╯ |
And now that I know it works, let’s get everything, and batch it:
1 |
열기 GlobalLandTemperaturesByCity.csv | par-각 -t 5 {|x| wrap 콘텐츠 | 삽입 id {$"($x.City):($x.dt)" } } | doc 업서트 |
I have created a 시리즈 collection in which I import the result of a 선택 aggregation, getting all the dates as timestamps and all the AverageTemperature as value:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
쿼리 ' INSERT INTO series(KEY _k, VALUE _v) SELECT a.City _k, {"City": a.City , "ts_start" : MIN(STR_TO_MILLIS(a.dt)), "ts_end" : MAX(STR_TO_MILLIS(a.dt)), "ts_data" : ARRAY_AGG([STR_TO_MILLIS(a.dt), a.AverageTemperature]) } _v FROM import a WHERE a.dt BETWEEN "1700-01-01" AND "2020-12-31" GROUP BY a.City; ' 삽입 INTO 시리즈(KEY _k, VALUE _v) 선택 a.도시 _k, {"City": a.도시 , "ts_start" : MIN(STR_TO_MILLIS(a.dt)), "ts_end" : MAX(STR_TO_MILLIS(a.dt)), "ts_data" : ARRAY_AGG([STR_TO_MILLIS(a.dt), a.AverageTemperature]) } _v FROM 가져오기 a 어디 SUBSTR(a.도시,0,1) = "Z" AND a.dt 사이 "2010-01-01" AND "2020-12-31" 그룹 BY a.도시; |
And now all the data is available as time series. Say I want the data from Paris, I can use the _timeseries function like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
> let ts_ranges = [946684800000,1375315200000]; 쿼리 $"SELECT t._t time, t._v0 `value` FROM series AS d UNNEST _timeseries\(d, {\"ts_ranges\":($ts_ranges)}\) AS t WHERE d.city = \"Paris\" AND \(d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) \);" | 거부 클러스터 | 에 csv 시간,값 946684800000,3.845 949363200000,6.587000000000001 951868800000,7.872000000000001 954547200000,10.067 957139200000,15.451 959817600000,17.666 962409600000,16.954 965088000000,19.512 967766400000,16.548000000000002 970358400000,11.675999999999998 .... |
To speed things up you can create the following index: CREATE INDEX ix1 ON series(City, ts_end, ts_start);
사용법에 유의하세요. ts_ranges variable at the beginning. You can easily reuse those values in a template 문자열. They start with a $ sign and variables must be in parenthesis like ($my_variable). Which also means that now you need to escape parenthesis character as well as double quotes.
Plot time series
There are a variety of terminal plotting libraries, here I am using youplot:
1 |
> let ts_ranges = [946684800000,1375315200000]; 쿼리 $"SELECT t._t date, t._v0 `value` FROM series AS d UNNEST _timeseries\(d, {\"ts_ranges\":($ts_ranges)}\) AS t WHERE d.city = \"Paris\" AND \(d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) \);" | 거부 클러스터 | 에 csv | uplot 라인 -d, --xlim 946684800000,1375315200000 --ylim -20,35 --title "global temperature in Paris" --xlabel 날짜 --ylabel 온도` |
Now that is all great but ideally I want to have several cities in there to compare them. Any additional CSV columns are picked up automatically as long as you run youplot lines. Doing things gradually, let’s start with the support of multiple cities in the query. So a couple changes here:
-
- grouping the data by time
- 추가 d.city IN ($city) to the where clause. This works because the city array is a String literal of a JSON array. Let’s take a look at the answer as a JSON doc:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
> let ts_ranges = [946684800000,1375315200000]; let 도시 = ['"Aba"','"Berlin"','"London"','"Paris"']; 쿼리 $"SELECT t._t date, \( ARRAY_AGG\({\"도시\": d.도시, \"temp\":t._v0}\) \) FROM series AS d UNNEST _timeseries\(d, {\"ts_ranges\":($ts_ranges)}\) AS t WHERE d.city IN ($city) AND \(d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) \) GROUP BY t._t ORDER BY t._t;" | 거부 클러스터 | 먼저 |에 json [ { "date": "2000-01-01", "$1": [ { "city": "Aba", "temp": 26.985000000000007 }, { "city": "Berlin", "temp": 1.3239999999999998 }, { "city": "London", "temp": 4.6930000000000005 }, { "city": "Paris", "temp": 3.845 } ] } ] |
But this cannot turn into a CSV, even if you flatten everything like this:
1 2 3 4 5 6 7 |
> let ts_ranges = [946684800000,1375315200000]; let 도시 = ['"Aba"','"Berlin"','"London"','"Paris"']; 쿼리 $"SELECT t._t date, \( ARRAY_AGG\({\"도시\": d.도시, \"temp\":t._v0}\) \) FROM series AS d UNNEST _timeseries\(d, {\"ts_ranges\":($ts_ranges)}\) AS t WHERE d.city IN ($city) AND \(d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) \) GROUP BY t._t ORDER BY t._t;" |거부 클러스터 | 먼저 |평평하게|평평하게|에 csv 날짜,도시,temp 2000-01-01,Aba,26.985000000000007 2000-01-01,Berlin,1.3239999999999998 2000-01-01,런던,4.6930000000000005 2000-01-01,파리,3.845 |
So we can transform it into an object like so: Object v.city : v.temp FOR v IN ARRAY_AGG({"city": d.city, "temp":t._v0}) when v IS NOT MISSING END
1 2 3 4 5 6 7 8 9 10 11 12 |
> let ts_ranges = [946684800000,1375315200000]; let 도시 = ['"Aba"','"Berlin"','"London"','"Paris"']; 쿼리 $"SELECT t._t date, \( 개체 v.도시 : v.temp FOR v IN ARRAY_AGG\({\"도시\": d.도시, \"temp\":t._v0}\) when v IS NOT MISSING END\) FROM series AS d UNNEST _timeseries\(d, {\"ts_ranges\":($ts_ranges)}\) AS t WHERE d.city IN ($city) AND \(d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) \) GROUP BY t._t ORDER BY t._t;" |거부 클러스터 | 먼저 | 에 json { "date": "2000-01-01", "$1": { "Aba": 26.985000000000007, "Berlin": 1.3239999999999998, "London": 4.6930000000000005, "Paris": 3.845 } } |
Which can now be flatten into a CSV:
1 2 3 4 5 |
> let ts_ranges = [946684800000,1375315200000]; let 도시 = ['"Aba"','"Berlin"','"London"','"Paris"']; 쿼리 $"SELECT t._t date, \( 개체 v.도시 : v.temp FOR v IN ARRAY_AGG\({\"도시\": d.도시, \"temp\":t._v0}\) when v IS NOT MISSING END\) FROM series AS d UNNEST _timeseries\(d, {\"ts_ranges\":($ts_ranges)}\) AS t WHERE d.city IN ($city) AND \(d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) \) GROUP BY t._t ORDER BY t._t;" |거부 클러스터 | 먼저 | 평평하게| 에 csv 날짜,Aba,Berlin,런던,파리 2000-01-01,26.985000000000007,1.3239999999999998,4.6930000000000005,3.845 |
And with that we are ready to plot several lines:
1 |
> let ts_ranges = [946684800000,1375315200000]; let 도시 = ['"Aba"','"Berlin"','"London"','"Paris"']; 쿼리 $"SELECT t._t date, \( 개체 v.도시 : v.temp FOR v IN ARRAY_AGG\({\"도시\": d.도시, \"temp\":t._v0}\) when v IS NOT MISSING END\) FROM series AS d UNNEST _timeseries\(d, {\"ts_ranges\":($ts_ranges)}\) AS t WHERE d.city IN ($city) AND \(d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) \) GROUP BY t._t ORDER BY t._t;" |거부 클러스터 | 평평하게 | 에 csv | uplot 라인 -H -d, --xlim 946684800000,1375315200000 --ylim -10,50 -w 50 --title "temperature per month" --xlabel 날짜 --ylabel 온도 |
To make that easier, you can use functions. Create a .nu file, like temp.nu, with the following content:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
def tempGraph [$ts_ranges, $도시] { cb-환경 버킷 cbsh cb-환경 범위 globaltemp cb-환경 컬렉션 시리즈 let 쿼리 = ($"SELECT t._t date, \( 개체 v.도시 : v.temp FOR v IN ARRAY_AGG\({\"도시\": d.도시, \"temp\":t._v0}\) when v IS NOT MISSING END\)" + $" FROM series AS d UNNEST _timeseries\(d, {\"ts_ranges\":($ts_ranges)}\) AS t " + $" WHERE d.city IN ($city) " + $" AND \(d.ts_start <= ($ts_ranges.1) AND d.ts_end >= ($ts_ranges.0) \)" + $" GROUP BY t._t ORDER BY t._t;" ) let csv = 쿼리 $쿼리 |거부 클러스터 | 평평하게 | 에 csv $csv | uplot 라인 -H -d, --xlim 946684800000,1375315200000 --ylim -10,50 -w 50 --title "temperature per month" --xlabel 날짜 --ylabel 온도 } |
Then if you source it it becomes much easier to use:
1 2 3 |
> 출처 ./testingcbshell/temp.nu > tempGraph [946684800000,1375315200000] ['"Aba"','"Berlin"','"London"','"Paris"'] |
I hope this gave you a quick overview of Couchbase time series support, of Couchbase Shell data manipulation, and how you can use other shell commands like youplot to make things more integrated and interesting. You could easily generate a full report in various formats from a GitHub Actions for instance, endless possibilities!
-
- Read more posts about 카우치베이스 셸.