LIMIT and OFFSET are what is recommended by query for pagination. You can implement that with the SDK of course.
Are you processing any of the results before writing to a text file? If so, the most memory efficient way to do this is probably to use the async API to react to rows as they come in.
You can see one (slightly complicated) example of that here:
The basic idea is to query and chain in a doOnNext() to handle rows as they come back in. Your code is likely to be much simpler than what is here, but the concept is approximately the same. No need to paginate, and items can be processed and written to a file as needed.
Also, note that if all you need to do is execute the query and write it to a file without processing, there may be an easier way to handle the raw results. Let me know if you want that.
The AsyncBucket is interesting, using this method will I be getting similar functionality to “cursor” based databases?
Meaning I wouldn’t need to pull the whole data to memory before starting writing to file?
Another setback that I am having with this, that I still need it to act as synchronous process (eg calling “next()” explicitly to get next result from the database).
So to achieve that I will have to “mutilate” the above async(), and implement some nasty blocking mechanism that will block until a notification arrives…
Not exactly. A cursor on a traditional RDBMS is additional state handled at the server side that then the client can move about, iterate forward, etc. For your use case here, unless there is an ORDER BY clause, this solution is better than a cursor.
In this solution, items are processed as they come in and the memory is freed as you’re done processing the item. The reason it’s better than a cursor, is that the query processor doesn’t need to manage additional state either. It can more efficiently just pass items along.
In this case, you would not need to call next(). That’d be a java Iterable approach. In observables, you chain in code to be called when data arrives. It’s generally more efficient and expressive than working with iteration.
The RxJava example there actually does the same thing. It’s not really a mutilate kind of thing though. It’s natural to need to block until everything is complete, handle errors, etc. The async API we expose built on the streaming parser of responses will let you chain in methods for blocking, error handling and so on. Have a look at the introduction to reactive extensions and then look at our docs and blogs for more info.
Yes, if you’re forced into that kind of API, that would be a solution. Note that you can possibly lazily fetch the actual document itself so you’d be able to hold the whole queue of IDs of documents in memory. That way you don’t need to hold the actual document and do all of the processing up front.
Do be aware, there can be some consistency issues with this approach, but you may have to control for concurrent modifications anyway. Consider using a query that returns the meta().id and CAS value perhaps.