[MB-6860] [system test] Index file descriptor leaks Created: 09/Oct/12  Updated: 10/Jan/13  Resolved: 23/Oct/12

Status: Closed
Project: Couchbase Server
Component/s: view-engine
Affects Version/s: 2.0
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Thuan Nguyen Assignee: Filipe Manana
Resolution: Fixed Votes: 0
Labels: system-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: centos 6.2 64bit build 2.0.0-1808


 Description   
Create a 8 nodes cluster installed with couchbase server 2.0.0-1808. Consistent view is disable.
Each node has 14 GB RAM and 2 ebs volumes, one for /data and another for /view
Create 2 bucket and load 9 million items to each bucket.
Create 3 doc, one for default bucket and 2 for saslbucket.

Let cluster running with load ~ 18K ops for each bucket in more than one day.
Check view directory, I see 2 nodes with disk usage more than 20%

Thuans-MacBook-Pro:testrunner thuan$ python scripts/ssh.py -i ../ini/8-ec2-orange.ini "df -kh /view"
ec2-50-112-210-248.us-west-2.compute.amazonaws.com
Filesystem Size Used Avail Use% Mounted on
/dev/xvdj 247G 60G 175G 26% /view

ec2-50-112-46-220.us-west-2.compute.amazonaws.com
Filesystem Size Used Avail Use% Mounted on
/dev/xvdj 247G 4.4G 230G 2% /view

ec2-54-245-38-16.us-west-2.compute.amazonaws.com
Filesystem Size Used Avail Use% Mounted on
/dev/xvdj 247G 28G 207G 12% /view

ec2-50-112-52-162.us-west-2.compute.amazonaws.com
Filesystem Size Used Avail Use% Mounted on
/dev/xvdj 247G 45G 189G 20% /view

ec2-50-112-17-129.us-west-2.compute.amazonaws.com
Filesystem Size Used Avail Use% Mounted on
/dev/xvdj 247G 27G 208G 12% /view

ec2-54-245-55-107.us-west-2.compute.amazonaws.com
Filesystem Size Used Avail Use% Mounted on
/dev/xvdj 247G 7.4G 227G 4% /view

ec2-54-245-24-204.us-west-2.compute.amazonaws.com
Filesystem Size Used Avail Use% Mounted on
/dev/xvdj 247G 9.4G 225G 4% /view

ec2-50-112-86-218.us-west-2.compute.amazonaws.com
Filesystem Size Used Avail Use% Mounted on
/dev/xvdj 247G 13G 221G 6% /view

** Go to ec2-50-112-210-248.us-west-2.compute.amazonaws.com node, I see actual file size for all index files around 2.8GB

[root@ip-10-249-0-36 view]# du -hs
3.0G .

[root@ip-10-249-0-36 view]# df -kh | grep view
/dev/xvdj 247G 60G 175G 26% /view

** Do lsof +L1, see beam.smp is holding many delete files make them not to be deleted.

[root@ip-10-249-0-36 view]# lsof +L1 | grep view
beam.smp 18926 couchbase 53u REG 202,144 39 0 15859715 /view/.delete/0ed16b72a6e2e1d043b59ba006f32828 (deleted)
beam.smp 18926 couchbase 55u REG 202,144 39 0 15073283 /view/.delete/2d6e9162017b08fa0cb8d5aadaef4311 (deleted)
beam.smp 18926 couchbase 56r REG 202,144 39 0 15073283 /view/.delete/2d6e9162017b08fa0cb8d5aadaef4311 (deleted)
beam.smp 18926 couchbase 57w REG 202,144 39 0 15073283 /view/.delete/2d6e9162017b08fa0cb8d5aadaef4311 (deleted)
beam.smp 18926 couchbase 59r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted)
beam.smp 18926 couchbase 64u REG 202,144 152048674 0 14417947 /view/.delete/85177d34a8fbdd8e851ca37329356a72 (deleted)
beam.smp 18926 couchbase 66r REG 202,144 39 0 15859715 /view/.delete/0ed16b72a6e2e1d043b59ba006f32828 (deleted)
beam.smp 18926 couchbase 79w REG 202,144 39 0 15859715 /view/.delete/0ed16b72a6e2e1d043b59ba006f32828 (deleted)
beam.smp 18926 couchbase 88r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted)
beam.smp 18926 couchbase 95r REG 202,144 25548094584 0 14417926 /view/.delete/2d5d22317781e50b64dde53e74ca8a01 (deleted)
beam.smp 18926 couchbase 105w REG 202,144 0 0 14417929 /view/@indexes/default/replica_87d0cc9a8fffc2e1e434f6ddbb0c168d.view.log (deleted)
beam.smp 18926 couchbase 113u REG 202,144 152048674 0 14417947 /view/.delete/85177d34a8fbdd8e851ca37329356a72 (deleted)
beam.smp 18926 couchbase 121r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted)
beam.smp 18926 couchbase 136r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted)
beam.smp 18926 couchbase 138r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted)
beam.smp 18926 couchbase 144r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted)
beam.smp 18926 couchbase 155r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted)
beam.smp 18926 couchbase 164r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted)
beam.smp 18926 couchbase 178r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted)
beam.smp 18926 couchbase 187r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted)
beam.smp 18926 couchbase 194r REG 202,144 7679604520 0 14417924 /view/.delete/fa9cd11ed6b0f873c825fba96ee44c94 (deleted)
beam.smp 18926 couchbase 195r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted)
beam.smp 18926 couchbase 196r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted)
beam.smp 18926 couchbase 205r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted)
beam.smp 18926 couchbase 213r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted)
beam.smp 18926 couchbase 231r REG 202,144 22818230272 0 14417927 /view/.delete/319125a97816c48c70500af867ddae5b (deleted)
beam.smp 18926 couchbase 263w REG 202,144 0 0 14417935 /view/@indexes/default/replica_87d0cc9a8fffc2e1e434f6ddbb0c168d.view.log (deleted)
beam.smp 18926 couchbase 278r REG 202,144 22818230272 0 14417927 /view/.delete/319125a97816c48c70500af867ddae5b (deleted)
beam.smp 18926 couchbase 334w REG 202,144 0 0 14417936 /view/@indexes/default/replica_87d0cc9a8fffc2e1e434f6ddbb0c168d.view.log (deleted)
beam.smp 18926 couchbase 374r REG 202,144 22818230272 0 14417927 /view/.delete/319125a97816c48c70500af867ddae5b (deleted)
beam.smp 18926 couchbase 384r REG 202,144 22818230272 0 14417927 /view/.delete/319125a97816c48c70500af867ddae5b (deleted)
[root@ip-10-249-0-36 view]#





 Comments   
Comment by Filipe Manana [ 09/Oct/12 ]
The information you give doesn't mean necessarily a problem.
It's common to delete files and keep them open for a while in couchdb (both core database and all the view engines).

You need to tell me for how long you see the same files open after deleted, and, provide all server logs. Otherwise I can't help that much.
Comment by Thuan Nguyen [ 09/Oct/12 ]
I saw this behavious from yesterday after noon, Oct 8 2012. The percentage of view directory aound 20%. The actual index file size is about 3 GB. So we lost 57GB disk space (20 x more than curent index file size).
Link to collect info from all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/ec2/20121008/8nodes-1808-ec2-colinfo-tmp-files-not-del-201009-144222.tgz

** Disk space stats before stop couchbase server on node ec2-50-112-210-248.us-west-2.compute.amazonaws.com

Filesystem Size Used Avail Use% Mounted on
/dev/xvdj 247G 59G 175G 26% /view

** Restart couchbase server on node ec2-50-112-210-248.us-west-2.compute.amazonaws.com
** Disk space stats after restart
Filesystem Size Used Avail Use% Mounted on
/dev/xvdj 247G 2.9G 231G 2% /view


Filesystem Size Used Avail Use% Mounted on
/dev/xvdj 247G 3.7G 230G 2% /view


Filesystem Size Used Avail Use% Mounted on
/dev/xvdj 247G 4.2G 230G 2% /view


** lsof after restart couchbase server

[root@ip-10-249-0-36 view]# lsof +L1 | grep view
beam.smp 27788 couchbase 86u REG 202,144 39 0 15073283 /view/.delete/ca7d4c1fd9d1438949e8da3e787d37b3 (deleted)
beam.smp 27788 couchbase 87r REG 202,144 39 0 15073283 /view/.delete/ca7d4c1fd9d1438949e8da3e787d37b3 (deleted)
beam.smp 27788 couchbase 88w REG 202,144 39 0 15073283 /view/.delete/ca7d4c1fd9d1438949e8da3e787d37b3 (deleted)
beam.smp 27788 couchbase 122u REG 202,144 39 0 15859715 /view/.delete/1ee36838873600759cc33861e05d0727 (deleted)
beam.smp 27788 couchbase 127r REG 202,144 39 0 15859715 /view/.delete/1ee36838873600759cc33861e05d0727 (deleted)
beam.smp 27788 couchbase 128w REG 202,144 39 0 15859715 /view/.delete/1ee36838873600759cc33861e05d0727 (deleted)
[root@ip-10-249-0-36 view]#
[root@ip-10-249-0-36 view]#


[root@ip-10-249-0-36 view]# lsof +L1 | grep view
beam.smp 27788 couchbase 42r REG 202,144 523514916 0 14417923 /view/.delete/b68bb430742efb491d21bbe9f15615c9 (deleted)
beam.smp 27788 couchbase 44r REG 202,144 151753762 0 14417940 /view/.delete/230ae91e0bffbceca514d6667c966d5f (deleted)
beam.smp 27788 couchbase 86u REG 202,144 39 0 15073283 /view/.delete/ca7d4c1fd9d1438949e8da3e787d37b3 (deleted)
beam.smp 27788 couchbase 87r REG 202,144 39 0 15073283 /view/.delete/ca7d4c1fd9d1438949e8da3e787d37b3 (deleted)
beam.smp 27788 couchbase 88w REG 202,144 39 0 15073283 /view/.delete/ca7d4c1fd9d1438949e8da3e787d37b3 (deleted)
beam.smp 27788 couchbase 122u REG 202,144 39 0 15859715 /view/.delete/1ee36838873600759cc33861e05d0727 (deleted)
beam.smp 27788 couchbase 127r REG 202,144 39 0 15859715 /view/.delete/1ee36838873600759cc33861e05d0727 (deleted)
beam.smp 27788 couchbase 128w REG 202,144 39 0 15859715 /view/.delete/1ee36838873600759cc33861e05d0727 (deleted)
[root@ip-10-249-0-36 view]#

After beam.smp killed, 50+ GB free space back to server.
Comment by Filipe Manana [ 09/Oct/12 ]
Volker,

The spatial views, based on an old couchdb view engine, leak view file descriptors once design documents are updated or deleted. This used to happen with couchdb view engine, but it got fixed in:

https://issues.apache.org/jira/browse/COUCHDB-1309

The old view engine, also leaked database file descriptors, see https://issues.apache.org/jira/browse/COUCHDB-1129 and https://issues.apache.org/jira/browse/COUCHDB-926

I've confirmed now (with testrunner and lsof) that spatial views leak old spatial view files on ddoc update/delete. For the databases, I didn't verify it.
Can you verify this?

thanks
Comment by Farshid Ghods (Inactive) [ 09/Oct/12 ]
fix : http://review.couchbase.org/#/c/21468/
Comment by Thuan Nguyen [ 10/Oct/12 ]
Integrated in github-couchdb-preview #513 (See [http://qa.hq.northscale.net/job/github-couchdb-preview/513/])
    MB-6860 Release index file ref counter (Revision 091f7f4f08b6bf22ebb56742c078d71bcfba5b83)

     Result = SUCCESS
pwansch :
Files :
* src/couch_index_merger/src/couch_view_merger.erl
Comment by Thuan Nguyen [ 11/Oct/12 ]
Tested on build 2.0.0-1832. I could not reproduce this bug. I think this bug is fixed and will close it.
Comment by Thuan Nguyen [ 11/Oct/12 ]
Tested on build 2.0.0-1832. I could not reproduce this bug. I think this bug is fixed and will close it.
Comment by Filipe Manana [ 12/Oct/12 ]
Sorry guys, but this was not a full fix.

Read my previous comments. Geocouch also leaks file descriptors, just like old couchdb. And this is serious, as the leaks happens even if users don't use the geo/spatial features.
Comment by Farshid Ghods (Inactive) [ 13/Oct/12 ]
http://review.membase.org/#/c/21588/
Comment by Thuan Nguyen [ 15/Oct/12 ]
Integrated in github-couchdb-preview #517 (See [http://qa.hq.northscale.net/job/github-couchdb-preview/517/])
    MB-6860: Only delete .view files (Revision 23cec9997b38ac82cab310b7560d01db529c1ae2)

     Result = SUCCESS
peter :
Files :
* src/couchdb/couch_view.erl
Comment by Farshid Ghods (Inactive) [ 15/Oct/12 ]
change is merged.
Comment by Filipe Manana [ 15/Oct/12 ]
Sorry Farshid. It's not all, there's still ongoing work to do on GeoCouch and old couchdb views that's not even on gerrit yet.

I'll close this myself (or Volker) when all changes are merged.
Comment by Thuan Nguyen [ 16/Oct/12 ]
Integrated in github-couchdb-preview #518 (See [http://qa.hq.northscale.net/job/github-couchdb-preview/518/])
    MB-6860 Don't open dev indexes during cleanup (Revision 255e3a6d0289d654d7d0702d8eff294816c5a145)

     Result = SUCCESS
Farshid Ghods :
Files :
* src/couchdb/couch_view.erl
* src/couchdb/couch_view_group.erl
Comment by Peter Wansch (Inactive) [ 18/Oct/12 ]
Filipe, please close after the last merge.
Comment by Thuan Nguyen [ 18/Oct/12 ]
Integrated in github-couchdb-preview #520 (See [http://qa.hq.northscale.net/job/github-couchdb-preview/520/])
    MB-6860 Fix old couchdb view cleanup when there are no ddocs (Revision 5353fd9b9078eb4dde81f3ea6d87ce112284df63)
MB-6860 Shutdown outdated dev index processes (Revision eaa98475fb43bcc5605c5d66026b932236a7fdfd)

     Result = SUCCESS
peter :
Files :
* src/couchdb/couch_view.erl

peter :
Files :
* src/couchdb/couch_view.erl
* src/couchdb/couch_db.hrl
* test/etap/Makefile.am
* test/etap/202-dev-view-group-shutdown.t
* src/couchdb/couch_view_group.erl
Comment by Filipe Manana [ 19/Oct/12 ]
Not yet ready to close. There are still changes from Volker in gerrit for geocouch, and another change for geocouch not yet in gerrit.
Comment by Filipe Manana [ 23/Oct/12 ]
All changes merged to master.
Comment by Thuan Nguyen [ 23/Oct/12 ]
Integrated in github-couchdb-preview #522 (See [http://qa.hq.northscale.net/job/github-couchdb-preview/522/])
    MB-6860 Ensure dev index file deleted after db deletion (Revision 0ee52361b16fce383502130f7d56d56e6f427087)

     Result = SUCCESS
Farshid Ghods :
Files :
* src/couchdb/couch_view_group.erl
* src/couchdb/couch_view.erl
* test/etap/202-dev-view-group-shutdown.t
Comment by kzeller [ 09/Nov/12 ]
RN: "For past releases, after a data bucket had been deleted,
       any indexes associated with the bucket were not deleted. This
       has been fixed so the both the data bucket and associated indexes
       are deleted."
Comment by Filipe Manana [ 09/Nov/12 ]
Note Karen: different kinds of leaks were fixed, but none relates to your observation.
The leaks were related to not closing index or database file handles after compaction in some scenarios. Other leaks were related to open (and keep them open) unnecessary/unused files.
Comment by kzeller [ 09/Nov/12 ]
So this should really read: "memory leaks had occurred due to open, unused index files. Now, unused index files are now removed and the memory leaks resolved"?
Comment by Filipe Manana [ 10/Nov/12 ]
Karen:

It would read more like:

For geo/spatial indexes:

1) After updating a design document, or deleting a design document,
the old index files and erlang processes were never released (stealing
disk space and leaking file descriptors);
2) After database (vbucket) compaction, spatial/geo indexes would
never release the file handle of the pre-compaction database files
(meaning that disk space couldn't be reclaimed by the OS)

For mapreduce views:

1) In some cases, after index compaction, the pre-compaction index
files were deleted but held open for a long time (or even forever at
the extreme), preventing the OS from reclaiming the respective disk
space and leaking 1 file descriptor per index compaction.

Both for geo and mapreduce (minor issue):

1) Avoid creating unnecessary empty index files and keep them open for
very long periods (until bucket deletion). This is a minor one, as it
didn't steal disk space - but it helps decreasing the number of open
file descriptors, which is important on OSes with a small limit of max
allowed file descriptors (Windows and Mac OS X).

It's a lot of stuff, but none relates to index files never being
deleted after bucket deletion.
Comment by kzeller [ 12/Nov/12 ]
Ok added: <para>
For geo/spatial indexes, after updating a design document, or deleting a design document,
the old index files and erlang processes were not released. This
unnecessarily took disk space and resulted in leaking file descriptors.
After database shard compaction, spatial/geo indexes would
never release the file handle of the pre-compaction database files.
This meant that disk space couldn't be reclaimed by the OS. This has
now been fixed.
</para>
<para>
For general indexes, after index compaction the pre-compaction index
files were deleted but were somtimes held open for a long time.
This prevented the OS from reclaiming the respective disk
space and leaking one file descriptor per index compaction.
This has been fixed.
</para>
<para>
For both geo/spatial and general indexes,
we now avoid creating unnecessary empty index files and now
avoid keeping them open for
very long periods, such as waiting until bucket deletion.
This is a more minor fix which helps decrease the number of open
file descriptors, which is important if you
are wroking on an operating sytem with a small limit of max
allowed file descriptors, such as Windows and Mac OS X.
 </para>
Generated at Tue Sep 16 15:59:29 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.