Is this the right technology? Alternative for chaining views?
Hi,
I am modeling the architecture for a large implementation and I am exploring the possibility of using CouchBase. However, I find myself in the need of chaining views (map/reduce) to merge content or finding a viable alternative.
This is the (simplified) scenario:
I have 10^7 docs. These doc contains:
- product id
- Info on the product (~3 Kb)
- Category (of the info, not the product. Information is divided into 10 main categories)
- Source (who provided that info)
Docs with new or updated info come from different sources continuously.
I have a source hierarchy, and based on that, when the client system requires info on product X, I need to provide the info of all available categories, but for each category I need to select the source based on the hierarchy.
Ideally, if the reduce sizes were not a problem, we could create a map function that list all products as key, and a reduce function that aggregates the info of each category, and filters out information when there is a better source available.
This would be great because we would just insert documents as info arrives from the sources, and the client would always receive the best merged/filtered content. However, since the merged content could reach 30K, that's too much for the reduce function and that would hit performance hard.
We would feed the system:
(Product:1, Category: A, Source: X, Info: A1)
(Product:1, Category: A, Source: Y, Info: A2)
(Product:1, Category: B, Source: X, Info: B1)
And we would retrieve:
(Product:1, [ Cat A : A2, Cat B: B1])
So, I could only think of one alternative:
Create a view with a map function that list all products as key, and a reduce function that keeps the ids of the 10 docs with the selected info, based on hierarchy.
We would feed the system:
(Id 1, Product:1, Category: A, Source: X, Info: A1)
(Id 2, Product:1, Category: A, Source: Y, Info: A2)
(Id 3, Product:1, Category: B, Source: X, Info: B1)
And we would retrieve:
(Product:1, [ Id2, Id3])
We would insert the docs with info as they arrive. Query that view for the products that have changed, get the keys of the info we should use for that product, retrieve those docs, merge them, and insert them in another bucket. Then the client system can consume it. (sort of an offline/online database).
This is much more manual work and more error prone.
It would be better if we could chain the views, meaning that the result of a view if used to generate another. Then we can have a second view feeding from the first one, and producing the actual "merging" of the selected docs. However, I hear it is not possible.
Can you think a better alternative? Or is this only possible with all that work around the technology (synching 2 Dbs manually as we insert more docs)
Thanks,
Serge
Hi Serge,
The tradeoff is between doing incremental updates and chaining. For an online system such as ours, it's built for incremental updates meaning we can rerun the map and reduce logic for a clear subset of origin data.
If there were chains involved, it'd be necessary to rerun everything from the output of the first MapReduce job. That doesn't mean we couldn't do the incremental piece, but using a view as the input for another view is not something currently implemented.
I'll see if some of my colleagues can offer a better alternative.
Thanks,
Matt