[MB-6860] [system test] Index file descriptor leaks Created: 09/Oct/12 Updated: 10/Jan/13 Resolved: 23/Oct/12 |
|
| Status: | Closed |
| Project: | Couchbase Server |
| Component/s: | view-engine |
| Affects Version/s: | 2.0 |
| Fix Version/s: | 2.0 |
| Security Level: | Public |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Thuan Nguyen | Assignee: | Filipe Manana |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | system-test | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: | centos 6.2 64bit build 2.0.0-1808 | ||
| Description |
|
Create a 8 nodes cluster installed with couchbase server 2.0.0-1808. Consistent view is disable.
Each node has 14 GB RAM and 2 ebs volumes, one for /data and another for /view Create 2 bucket and load 9 million items to each bucket. Create 3 doc, one for default bucket and 2 for saslbucket. Let cluster running with load ~ 18K ops for each bucket in more than one day. Check view directory, I see 2 nodes with disk usage more than 20% Thuans-MacBook-Pro:testrunner thuan$ python scripts/ssh.py -i ../ini/8-ec2-orange.ini "df -kh /view" ec2-50-112-210-248.us-west-2.compute.amazonaws.com Filesystem Size Used Avail Use% Mounted on /dev/xvdj 247G 60G 175G 26% /view ec2-50-112-46-220.us-west-2.compute.amazonaws.com Filesystem Size Used Avail Use% Mounted on /dev/xvdj 247G 4.4G 230G 2% /view ec2-54-245-38-16.us-west-2.compute.amazonaws.com Filesystem Size Used Avail Use% Mounted on /dev/xvdj 247G 28G 207G 12% /view ec2-50-112-52-162.us-west-2.compute.amazonaws.com Filesystem Size Used Avail Use% Mounted on /dev/xvdj 247G 45G 189G 20% /view ec2-50-112-17-129.us-west-2.compute.amazonaws.com Filesystem Size Used Avail Use% Mounted on /dev/xvdj 247G 27G 208G 12% /view ec2-54-245-55-107.us-west-2.compute.amazonaws.com Filesystem Size Used Avail Use% Mounted on /dev/xvdj 247G 7.4G 227G 4% /view ec2-54-245-24-204.us-west-2.compute.amazonaws.com Filesystem Size Used Avail Use% Mounted on /dev/xvdj 247G 9.4G 225G 4% /view ec2-50-112-86-218.us-west-2.compute.amazonaws.com Filesystem Size Used Avail Use% Mounted on /dev/xvdj 247G 13G 221G 6% /view ** Go to ec2-50-112-210-248.us-west-2.compute.amazonaws.com node, I see actual file size for all index files around 2.8GB [root@ip-10-249-0-36 view]# du -hs 3.0G . [root@ip-10-249-0-36 view]# df -kh | grep view /dev/xvdj 247G 60G 175G 26% /view ** Do lsof +L1, see beam.smp is holding many delete files make them not to be deleted. [root@ip-10-249-0-36 view]# lsof +L1 | grep view beam.smp 18926 couchbase 53u REG 202,144 39 0 15859715 /view/.delete/0ed16b72a6e2e1d043b59ba006f32828 (deleted) beam.smp 18926 couchbase 55u REG 202,144 39 0 15073283 /view/.delete/2d6e9162017b08fa0cb8d5aadaef4311 (deleted) beam.smp 18926 couchbase 56r REG 202,144 39 0 15073283 /view/.delete/2d6e9162017b08fa0cb8d5aadaef4311 (deleted) beam.smp 18926 couchbase 57w REG 202,144 39 0 15073283 /view/.delete/2d6e9162017b08fa0cb8d5aadaef4311 (deleted) beam.smp 18926 couchbase 59r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted) beam.smp 18926 couchbase 64u REG 202,144 152048674 0 14417947 /view/.delete/85177d34a8fbdd8e851ca37329356a72 (deleted) beam.smp 18926 couchbase 66r REG 202,144 39 0 15859715 /view/.delete/0ed16b72a6e2e1d043b59ba006f32828 (deleted) beam.smp 18926 couchbase 79w REG 202,144 39 0 15859715 /view/.delete/0ed16b72a6e2e1d043b59ba006f32828 (deleted) beam.smp 18926 couchbase 88r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted) beam.smp 18926 couchbase 95r REG 202,144 25548094584 0 14417926 /view/.delete/2d5d22317781e50b64dde53e74ca8a01 (deleted) beam.smp 18926 couchbase 105w REG 202,144 0 0 14417929 /view/@indexes/default/replica_87d0cc9a8fffc2e1e434f6ddbb0c168d.view.log (deleted) beam.smp 18926 couchbase 113u REG 202,144 152048674 0 14417947 /view/.delete/85177d34a8fbdd8e851ca37329356a72 (deleted) beam.smp 18926 couchbase 121r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted) beam.smp 18926 couchbase 136r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted) beam.smp 18926 couchbase 138r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted) beam.smp 18926 couchbase 144r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted) beam.smp 18926 couchbase 155r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted) beam.smp 18926 couchbase 164r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted) beam.smp 18926 couchbase 178r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted) beam.smp 18926 couchbase 187r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted) beam.smp 18926 couchbase 194r REG 202,144 7679604520 0 14417924 /view/.delete/fa9cd11ed6b0f873c825fba96ee44c94 (deleted) beam.smp 18926 couchbase 195r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted) beam.smp 18926 couchbase 196r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted) beam.smp 18926 couchbase 205r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted) beam.smp 18926 couchbase 213r REG 202,144 4087670002 0 14417931 /view/.delete/a7418018f1977c4a4c614ad801ac8add (deleted) beam.smp 18926 couchbase 231r REG 202,144 22818230272 0 14417927 /view/.delete/319125a97816c48c70500af867ddae5b (deleted) beam.smp 18926 couchbase 263w REG 202,144 0 0 14417935 /view/@indexes/default/replica_87d0cc9a8fffc2e1e434f6ddbb0c168d.view.log (deleted) beam.smp 18926 couchbase 278r REG 202,144 22818230272 0 14417927 /view/.delete/319125a97816c48c70500af867ddae5b (deleted) beam.smp 18926 couchbase 334w REG 202,144 0 0 14417936 /view/@indexes/default/replica_87d0cc9a8fffc2e1e434f6ddbb0c168d.view.log (deleted) beam.smp 18926 couchbase 374r REG 202,144 22818230272 0 14417927 /view/.delete/319125a97816c48c70500af867ddae5b (deleted) beam.smp 18926 couchbase 384r REG 202,144 22818230272 0 14417927 /view/.delete/319125a97816c48c70500af867ddae5b (deleted) [root@ip-10-249-0-36 view]# |
| Comments |
| Comment by Filipe Manana [ 09/Oct/12 ] |
|
The information you give doesn't mean necessarily a problem. It's common to delete files and keep them open for a while in couchdb (both core database and all the view engines). You need to tell me for how long you see the same files open after deleted, and, provide all server logs. Otherwise I can't help that much. |
| Comment by Thuan Nguyen [ 09/Oct/12 ] |
|
I saw this behavious from yesterday after noon, Oct 8 2012. The percentage of view directory aound 20%. The actual index file size is about 3 GB. So we lost 57GB disk space (20 x more than curent index file size).
Link to collect info from all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/ec2/20121008/8nodes-1808-ec2-colinfo-tmp-files-not-del-201009-144222.tgz ** Disk space stats before stop couchbase server on node ec2-50-112-210-248.us-west-2.compute.amazonaws.com Filesystem Size Used Avail Use% Mounted on /dev/xvdj 247G 59G 175G 26% /view ** Restart couchbase server on node ec2-50-112-210-248.us-west-2.compute.amazonaws.com ** Disk space stats after restart Filesystem Size Used Avail Use% Mounted on /dev/xvdj 247G 2.9G 231G 2% /view Filesystem Size Used Avail Use% Mounted on /dev/xvdj 247G 3.7G 230G 2% /view Filesystem Size Used Avail Use% Mounted on /dev/xvdj 247G 4.2G 230G 2% /view ** lsof after restart couchbase server [root@ip-10-249-0-36 view]# lsof +L1 | grep view beam.smp 27788 couchbase 86u REG 202,144 39 0 15073283 /view/.delete/ca7d4c1fd9d1438949e8da3e787d37b3 (deleted) beam.smp 27788 couchbase 87r REG 202,144 39 0 15073283 /view/.delete/ca7d4c1fd9d1438949e8da3e787d37b3 (deleted) beam.smp 27788 couchbase 88w REG 202,144 39 0 15073283 /view/.delete/ca7d4c1fd9d1438949e8da3e787d37b3 (deleted) beam.smp 27788 couchbase 122u REG 202,144 39 0 15859715 /view/.delete/1ee36838873600759cc33861e05d0727 (deleted) beam.smp 27788 couchbase 127r REG 202,144 39 0 15859715 /view/.delete/1ee36838873600759cc33861e05d0727 (deleted) beam.smp 27788 couchbase 128w REG 202,144 39 0 15859715 /view/.delete/1ee36838873600759cc33861e05d0727 (deleted) [root@ip-10-249-0-36 view]# [root@ip-10-249-0-36 view]# [root@ip-10-249-0-36 view]# lsof +L1 | grep view beam.smp 27788 couchbase 42r REG 202,144 523514916 0 14417923 /view/.delete/b68bb430742efb491d21bbe9f15615c9 (deleted) beam.smp 27788 couchbase 44r REG 202,144 151753762 0 14417940 /view/.delete/230ae91e0bffbceca514d6667c966d5f (deleted) beam.smp 27788 couchbase 86u REG 202,144 39 0 15073283 /view/.delete/ca7d4c1fd9d1438949e8da3e787d37b3 (deleted) beam.smp 27788 couchbase 87r REG 202,144 39 0 15073283 /view/.delete/ca7d4c1fd9d1438949e8da3e787d37b3 (deleted) beam.smp 27788 couchbase 88w REG 202,144 39 0 15073283 /view/.delete/ca7d4c1fd9d1438949e8da3e787d37b3 (deleted) beam.smp 27788 couchbase 122u REG 202,144 39 0 15859715 /view/.delete/1ee36838873600759cc33861e05d0727 (deleted) beam.smp 27788 couchbase 127r REG 202,144 39 0 15859715 /view/.delete/1ee36838873600759cc33861e05d0727 (deleted) beam.smp 27788 couchbase 128w REG 202,144 39 0 15859715 /view/.delete/1ee36838873600759cc33861e05d0727 (deleted) [root@ip-10-249-0-36 view]# After beam.smp killed, 50+ GB free space back to server. |
| Comment by Filipe Manana [ 09/Oct/12 ] |
|
Volker,
The spatial views, based on an old couchdb view engine, leak view file descriptors once design documents are updated or deleted. This used to happen with couchdb view engine, but it got fixed in: https://issues.apache.org/jira/browse/COUCHDB-1309 The old view engine, also leaked database file descriptors, see https://issues.apache.org/jira/browse/COUCHDB-1129 and https://issues.apache.org/jira/browse/COUCHDB-926 I've confirmed now (with testrunner and lsof) that spatial views leak old spatial view files on ddoc update/delete. For the databases, I didn't verify it. Can you verify this? thanks |
| Comment by Farshid Ghods [ 09/Oct/12 ] |
| fix : http://review.couchbase.org/#/c/21468/ |
| Comment by Thuan Nguyen [ 10/Oct/12 ] |
|
Integrated in github-couchdb-preview #513 (See [http://qa.hq.northscale.net/job/github-couchdb-preview/513/]) Result = SUCCESS pwansch : Files : * src/couch_index_merger/src/couch_view_merger.erl |
| Comment by Thuan Nguyen [ 11/Oct/12 ] |
| Tested on build 2.0.0-1832. I could not reproduce this bug. I think this bug is fixed and will close it. |
| Comment by Thuan Nguyen [ 11/Oct/12 ] |
| Tested on build 2.0.0-1832. I could not reproduce this bug. I think this bug is fixed and will close it. |
| Comment by Filipe Manana [ 12/Oct/12 ] |
|
Sorry guys, but this was not a full fix. Read my previous comments. Geocouch also leaks file descriptors, just like old couchdb. And this is serious, as the leaks happens even if users don't use the geo/spatial features. |
| Comment by Farshid Ghods [ 13/Oct/12 ] |
|
http://review.membase.org/#/c/21588/ |
| Comment by Thuan Nguyen [ 15/Oct/12 ] |
|
Integrated in github-couchdb-preview #517 (See [http://qa.hq.northscale.net/job/github-couchdb-preview/517/]) Result = SUCCESS peter : Files : * src/couchdb/couch_view.erl |
| Comment by Farshid Ghods [ 15/Oct/12 ] |
| change is merged. |
| Comment by Filipe Manana [ 15/Oct/12 ] |
|
Sorry Farshid. It's not all, there's still ongoing work to do on GeoCouch and old couchdb views that's not even on gerrit yet. I'll close this myself (or Volker) when all changes are merged. |
| Comment by Thuan Nguyen [ 16/Oct/12 ] |
|
Integrated in github-couchdb-preview #518 (See [http://qa.hq.northscale.net/job/github-couchdb-preview/518/]) Result = SUCCESS Farshid Ghods : Files : * src/couchdb/couch_view.erl * src/couchdb/couch_view_group.erl |
| Comment by Peter Wansch [ 18/Oct/12 ] |
| Filipe, please close after the last merge. |
| Comment by Thuan Nguyen [ 18/Oct/12 ] |
|
Integrated in github-couchdb-preview #520 (See [http://qa.hq.northscale.net/job/github-couchdb-preview/520/]) Result = SUCCESS peter : Files : * src/couchdb/couch_view.erl peter : Files : * src/couchdb/couch_view.erl * src/couchdb/couch_db.hrl * test/etap/Makefile.am * test/etap/202-dev-view-group-shutdown.t * src/couchdb/couch_view_group.erl |
| Comment by Filipe Manana [ 19/Oct/12 ] |
| Not yet ready to close. There are still changes from Volker in gerrit for geocouch, and another change for geocouch not yet in gerrit. |
| Comment by Filipe Manana [ 23/Oct/12 ] |
| All changes merged to master. |
| Comment by Thuan Nguyen [ 23/Oct/12 ] |
|
Integrated in github-couchdb-preview #522 (See [http://qa.hq.northscale.net/job/github-couchdb-preview/522/]) Result = SUCCESS Farshid Ghods : Files : * src/couchdb/couch_view_group.erl * src/couchdb/couch_view.erl * test/etap/202-dev-view-group-shutdown.t |
| Comment by Karen Zeller [ 09/Nov/12 ] |
|
RN: "For past releases, after a data bucket had been deleted,
any indexes associated with the bucket were not deleted. This has been fixed so the both the data bucket and associated indexes are deleted." |
| Comment by Filipe Manana [ 09/Nov/12 ] |
|
Note Karen: different kinds of leaks were fixed, but none relates to your observation.
The leaks were related to not closing index or database file handles after compaction in some scenarios. Other leaks were related to open (and keep them open) unnecessary/unused files. |
| Comment by Karen Zeller [ 09/Nov/12 ] |
| So this should really read: "memory leaks had occurred due to open, unused index files. Now, unused index files are now removed and the memory leaks resolved"? |
| Comment by Filipe Manana [ 10/Nov/12 ] |
|
Karen:
It would read more like: For geo/spatial indexes: 1) After updating a design document, or deleting a design document, the old index files and erlang processes were never released (stealing disk space and leaking file descriptors); 2) After database (vbucket) compaction, spatial/geo indexes would never release the file handle of the pre-compaction database files (meaning that disk space couldn't be reclaimed by the OS) For mapreduce views: 1) In some cases, after index compaction, the pre-compaction index files were deleted but held open for a long time (or even forever at the extreme), preventing the OS from reclaiming the respective disk space and leaking 1 file descriptor per index compaction. Both for geo and mapreduce (minor issue): 1) Avoid creating unnecessary empty index files and keep them open for very long periods (until bucket deletion). This is a minor one, as it didn't steal disk space - but it helps decreasing the number of open file descriptors, which is important on OSes with a small limit of max allowed file descriptors (Windows and Mac OS X). It's a lot of stuff, but none relates to index files never being deleted after bucket deletion. |
| Comment by Karen Zeller [ 12/Nov/12 ] |
|
Ok added: <para>
For geo/spatial indexes, after updating a design document, or deleting a design document, the old index files and erlang processes were not released. This unnecessarily took disk space and resulted in leaking file descriptors. After database shard compaction, spatial/geo indexes would never release the file handle of the pre-compaction database files. This meant that disk space couldn't be reclaimed by the OS. This has now been fixed. </para> <para> For general indexes, after index compaction the pre-compaction index files were deleted but were somtimes held open for a long time. This prevented the OS from reclaiming the respective disk space and leaking one file descriptor per index compaction. This has been fixed. </para> <para> For both geo/spatial and general indexes, we now avoid creating unnecessary empty index files and now avoid keeping them open for very long periods, such as waiting until bucket deletion. This is a more minor fix which helps decrease the number of open file descriptors, which is important if you are wroking on an operating sytem with a small limit of max allowed file descriptors, such as Windows and Mac OS X. </para> |