I’m trying to stream a csv using data from couchbase. I’ve done everything I can think of to keep the memory usage down but it just keeps growing, and I suspect that the couchbase client is playing a role.
I’m using a view query to grab a list of the document IDs that I need and then iterating over the list. For each document ID, I get the document, process it into the desired format, and send it to the output stream as a line of CSV text. My view query is paginated, I am regularly flushing the output buffer, and I cannot see any reason why the memory usage should be growing with each iteration.
This gist contains the essential code.
This code works fine for a small data set (even when the size of the data set spans multiple “pages” of the view query). When the data set is a bit larger the script fails after a certain number of records due to exceeding the memory limit. Given how I am streaming the data I wasn’t expecting any memory issues. I tried the following two solutions with no success:
- Explicitly unset variables at the end of each iteration
- Sleep for a second at the end of each pagination iteration to allow the Garbage Collector to kick in
So then I started profiling the script in an effort to see where it’s all going wrong. This gist contains the updated script. Some output from running this script is posted below:
mem_row_start,delta_mem_row_fetched,delta_mem_row_composed,ref_bucket 3130224,3704,30872,5 3164064,4144,21104,7 3187680,4144,21104,9 3211296,4144,21104,11 3234912,4144,21104,13 3268976,4144,21128,15 3292616,4208,21104,17 3316296,4144,21104,19 3339912,4144,21168,21 3363592,4144,21104,23 3397696,4144,21104,25 3421312,4144,21104,27 3444928,4144,21104,29 3468544,4144,21128,31 3492184,4272,21128,33 3526440,4144,21104,35 3550056,4144,21240,37 3573808,4144,21104,39
I’ve manually stripped away the row data and just left the profiling data. The memory deltas look fairly sane, but since I am unsetting all of variables at the end of each iteration I do not expect to see the memory usage increase in the first column (this is the total memory usage at the start of each iteration).
The ref_bucket column shows the number of refcounts to the couchbase bucket object. It is increasing with every iteration, and this is the basis of my theory that the couchbase SDK is creating references to objects in a way that makes it hard for me to unset them. I do not understand enough about the inner workings of the SDK to back up this theory.
Any help appreciated!