reduce vs client-side vs elasticsearch
I am storing a series of flat JSON records which summarize some external search results. The record represents one search term and one day. Each record has a combination of string and numeric fields, but for the purposes of aggregation only the numeric fields matter.
Crucially, searches for the same term+day are common, and more recently generated results are preferred, but old results never expire. The key for each record contains the term, the day, and a timestamp.
1. take the N most recent records for a given term+day, and average each of their numeric fields.
2. effectively take the output of the above for a range of days, and average the results.
Taking some records and averaging their fields certainly sounds like a job for map-reduce, but I am not sure in this case because of the "N most recent" part.
Currently, I use map-only views to pull back a limited number of recent records and then aggregate them client-side. The performance was sub-optimal, so I added the ability to then write the aggregated results back to couchbase under a different kind of key (with a short expiration), and then always check for the pre-calculated version first. This intermediate caching strategy is great for frequently repeated calls, but useless for novel or sparse calls.
I am considering two alternatives. I'd honestly like to know what people consider to be the "right" way to approach this, whether it is one of these alternatives or not.
Alternative 1: write a custom reduce function which understands the records and correctly averages the numbers. Unfortunately, I'm not sure how this works with only the N most recent records (N does not have to be a variable, we could fix at 10). It seems like as new records get written, the incremental reduce would see the aggregate prior results with no way to tease out the recency of each component. However, I have never written a custom reduce and I may be misunderstaning what's available.
Alternative 2: pipe the data to elasticsearch and let it do the heavy lifting. I like this approach because I have previously piped couchbase to elasticsearch for geospatial indexing and was really impressed with the speed.