[MB-6219] items are not marked as deleted/expired in couchstore after they expire (View query results with stale=false include expired items) Created: 14/Aug/12  Updated: 31/Jan/14  Resolved: 05/Sep/12

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket, view-engine
Affects Version/s: 2.0
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Deepkaran Salooja Assignee: Peter Wansch (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: build#1580 on Ubuntu 64bit

Attachments: GZip Archive 10.1.3.73-8091-diag.txt.gz    

 Description   

View query results with stale=false include expired items.

Steps to reproduce(build#1580):
1. Create default bucket
2. Load 10 json docs with expiry set to 30 seconds.
3. Create a view(default map func) and query with stale=false.
4. Wait for 2-3 minutes.
5. Query view again with stale=false.

Some of the items are still returned in the query results even when index is rebuilt.
I observed that the number of rows returned by the view query is always the same as curr_items.

Diagnostics are attached.



 Comments   
Comment by Filipe Manana [ 14/Aug/12 ]
That's not an expected.
Items are lazily expired by ep-engine, meaning that it will not perform document deletes in the database after the 30 seconds.

There's no way to control that or know that from the view-engine.
Comment by Farshid Ghods (Inactive) [ 14/Aug/12 ]
seems like something can be modified in ep-engine so that when items expire we dont see them in views anymore
Comment by Peter Wansch (Inactive) [ 14/Aug/12 ]
Chiyoung, is this something Jin or Mike can help out with if it's in ep_engine? If not, it may need to be passed to Aaron. Thank you.
Comment by Filipe Manana [ 14/Aug/12 ]
This was discussed internally a few times, but I don't think any decision was made.

Mike gave some info in the forum to a user about this:

http://www.couchbase.com/forums/thread/expiration-time-docs-dp4

Comment by Chiyoung Seo [ 14/Aug/12 ]
The item or expiry pager wasn't scheduled yet to clear up all expired items from memory hashtable and disk. That's why you still see those expired items in the view query.

The item pager will be scheduled if the current memory usage is above high water mark. The expiry pager will be scheduled once every hour by default, but you can change the expiry pager's interval to a shorter period (e.g., 5 minutes) at runtime.
Comment by Farshid Ghods (Inactive) [ 14/Aug/12 ]
Dipti,

this means that users will see the expired items in the index for sometimes up to an hour which is the default value for the expiry pager.
Comment by Dipti Borkar [ 14/Aug/12 ]
Peter, as discussed this is something we should be able to do at query time. We do need to fix this for 2.0. Can you please help understand the options with the view engine team?
Let me know if you need additional feedback from me.
Comment by Filipe Manana [ 14/Aug/12 ]
There's no efficient way to do this in view engine. It would imply for each stale=false request to scan all documents in every vbucket and check if they expired, not to mention other smaller issues.
Comment by Peter Wansch (Inactive) [ 15/Aug/12 ]
Deep, can you confirm that after an hour, once the expiry pager has run and the next time the indexes are updated, they disappear from the view? If so, then we don't have a bug. There is still a valid discussion going on about how the situation around queries can be improved but I want to find out if things are working as designed for now.
Comment by Deepkaran Salooja [ 16/Aug/12 ]
Yes, that's correct. Once the expiry pager has run and indexes have been updated, the queries do not return the expired items.
Comment by Perry Krug [ 22/Aug/12 ]
Just a thought as I came across this bug. What if for each query result, the query engine contacted memcached to see if each doc was still valid before including it in the query response? That way, the view engine wouldn't have to keep track of all documents in all vbuckets, only the ones that it is sending out. This would not only take care of expiration (since memcached would return "not_found") but also deleted documents that have not yet been removed from disk. Rather than doing a 'get' (which would fetch it from disk in DGM), we could use the "stats key" operation to just check whether the key is still valid within memcached. Since there would be a bit (ableit small) amount of overhead on the query response, this could be an optional check?

The rows would eventually get cleaned from the index, this is just preventing the client from getting a massive amount of already expired items at the minute 59 mark before the hourly process is run.
Comment by Filipe Manana [ 22/Aug/12 ]
Thanks for the suggestion Perry.
Unfortunately it wouldn't work for several reasons.

First view-engine has no way to communicate with memcached currently.

Second, it would slow things down significantly.

Third, how could that work for reduces? For precomputed reduce values, which is the strength of couchdb's btrees + mapreduce, how do you "unreduce", exclude values produced for expired documents, and re-compute reductions? Not only you would need to know the map values produced by expired documents, you would also need to know the map values for the non-expired documents. Not to mention the big performance penalty here.

There's a lot of other technical issues that would impact either correctness, the incremental view update approach or performance. Those 3 listed above are just the ones people in general not familiar with implementation/design would grasp quickly.
Comment by Peter Wansch (Inactive) [ 05/Sep/12 ]
Filipe explained in the comments how this works. To speed up deletion from the indexes, the expiry pager interval can be changed which may have an adverse effect in performance.
Generated at Fri Aug 29 06:49:25 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.