CouchDB database files are purely append-only. All document updates and inserts are written to the end of the file. This allows for efficient io on updates and very reliable design.
When written in bulk, the documents are written sequentially on disk. The btrees by_seq and by_id index) that point to them are updated, each node that needs updating rewritten, with the leaf nodes being located and rewritten, and any parents being rewritten to point to the new node(s).
As all writes occur at the end of the file, this results in very fast updates, provided all the btree nodes that are being updated are already in memory (otherwise they must be fetched from physical disk), the updates require minimal seeks possible.
A tradeoff of this scheme is the wasted space as previous documents are simply forgotten, leaving a hole in the file, and also wasted btree nodes as old btree nodes leave gaps as well. It also tends to give the live btree nodes poor locality for scans, as the btree nodes go from an optimized layout with good locality to child and sibling nodes, to one where btree jumps start to look random in the worst case. However, the most recent updates are all near the end of the file, so lookups on the most recent documents are the most efficient.
- A single tail append storage file contains the following things.
- by_id btree index
- by_seq btree index
- local_docs btree index: This is a btree of non-replicating, non-indexed documents
- Header: This contains pointers to the roots of the btrees, and other metadata about the database file. It can also be called a trailer, since it always occurs at the end of the file, after the data and the indexes it points to.
To accurately differentiate between headers and data that looks like a header, every 4k in the file, a single byte takes the value of either 0x00 or 0x01. 0x01 means that what follows the byte is a database header. Non-header, or data blocks, have 0x00.
When opening the file, the file scanned backward to find the first intact header. It's assumed if an intact header is found, all data preceding it is intact, though crc32 is used on all items and btree nodes when read from the filesystem.
For datablocks (documents, btrees nodes or anything not a header), any data that spans across the 4k marker will have the 0x00 added when written, and stripped out when read. Otherwise Couchbase is not aware of any other block boundaries.
A new scheme for storage files is to make them "generational", with a series of aging files that is exponentially larger than the younger files. All updates and inserts happen synchronously at the youngest generation.
As files grow, they either exceed their live data threshold (the amount of non-garbage data they contain is too large), or exceed their garbage data threshold (the amount of dark data, or garbage they contain exceeds the threshold).
Example storage file generations and live data/garbage data max:
Gen 0: 10 mb
Gen 1: 100 mb
Gen 2: 1 gb
Gen 3: 10 gb
For example, the largest the file Gen 1 file could get, before triggering async compaction is 200 mb, with it reaching both its live data threshold and garbage threshold at the same time.
All inserts happen at the youngest generation file.
Before the an insert occurs, revision information is generated, a new sequence is given (locally unique and monotonically increasing to the generation files), and the document is written to disk at the end of the file. A new by_seq and by_id entry is created and stored in the btrees and points to the written document, again with all node rewrites happening at the end of the file.
All updates of existing items happen at the youngest generation file. Deletions also cause a tombstone value to be written, and new by_id and by_seq entries, so that indexers and replicators know to remove or delete the same document as well.
As we keep document metadata in memory, we'll know the previous seq number of the document when we write it.
If the existing item is already in the youngest file, the previous by_seq entry for the item is deleted, the new by_seq value is added and the old by_id entry is overwritten. If it doesn't exist, a tombstone value is written to the by_seq index in place of the old seq entry, to later remove the existing value in a older file. In this case it is an error for there to be no older file where the live document exists.
When we do not keep document metadata in memory, we do not generate a tombstone for the previous by_seq value. The write efficiency remains the same, but indexing and replication might encounter, old obsolete values, index, copy or replicate them, and later encounter the newer updated value that obsoletes the older value in the replica or index.
When an update or insert occurs, it's always written the youngest file. If a deletion occurs and the document doesn't exist in the youngest generation, a tombstone value is inserted.
Once the young file write is complete, the caller gets back an acknowledgement that the write completed.
When the live data in the file exceeds its threshold, a copy process invokes asynchronously, and the young file contents are copied into the next older generation file. If there is no next generation file, the storage file is simply moved/renamed to the next generation. If there is already older generation file, the contents of the file are copied in bulk into the older generation file and the older file's btrees are updated.
Any by_seq tombstone values that encounter their old values are now dropped along with the old value, and the by_id value left alone (The by_id values will be dropped later during compaction)
Once all data is copied, the original file is deleted and a new file is started in its place.
If the older generation is pushed over its Live Data Threshold, its contents also moved/copied up to the next level, in a recursive manner.
When a file reaches its garbage data threshold – the garbage on disk exceeds the live data threshold – it's copy compacted, recopying the values to a new file and rebuilding the btrees for optimized access, and the original file is discarded and the new file is put into its place.
Compaction, even at the lowest levels, will happen asynchronously and optimistically, using a snapshop of the old file. New data can be written the old file, and the compactor will notice once done copying the snapshot and will recopy the missing values into the new file. It will do this until it's caught up with the live writes, then it will do the file switchover.
During this compaction, some % of the oldest values can be copied to the next older generation, if it exists. A heuristic based on % closeness to the live-data threshold can be used, so some % of the oldest live data from the file is copied to the next level.
Any tombstone values it contains are copied up to the next generation during this time (not recopied to the same level), and if found in the next generation, are deleted along with the old value, or the tombstone written intact to the older file (and both cases more garbage is noted to be in the older file).
When tombstone values are written to the next oldest generation, they must be fully committed to that file before the current generation can switch to the newly compacted file.
Any tombstone valus that are previously wiped out a by_seq entry, will also cause the corresponding by_id entry to be dropped. When copy compacting, the by_id btree of the original file isn't copied, instead the by_seq key values are recopied, sorted on-disk by id, and rewritten into a new optimized by_id btree. So any by_id values not eliminated by the original tombstone will be dropped during this phase.
All live updates go to the youngest generation and should be from a single thread/process, though compaction and copying values to older generations can occur concurrently with a compactor thread. Any reader who has a file open will get a consistent snapshot of the file and its btrees, regardless if it's concurrently updated, copied to the next generation, or compacted.
If a reader has a snapshot of file that's being deleted by compaction, the file is inaccessible for new snapshots, but the deleted file is still completely readable for as long as the snapshot reader has the file open. Once closed by all readers, the OS will finally free the space consumed by the now deleted.
When performing a single or batched database lookup (it's possible to lookup by_seq, but in practice will be done by_id). First a key is looked for in the youngest file, then next oldest, until the key is found or the old file has been looked up. As long the files are opened youngest to oldest (not all need to opened at once), consistent results are guaranteed.
A snapshot scan of the logical database is as simple a doing a snapshot btree read in each storage file in each generation, then merging the results as the next value from each file is read, reading the next value from the file whose value is smallest for ascending scan, largest for descending scans. Equivalent values from the youngest files override the same values found in later.
Storage files should be opened and the header read in youngest to oldest file order. This order will ensure that all values are read, nothing missed. Opening in reverse order could cause missing items that are copied to an older file that's a later MVCC version than currently being scanned.
Occasionally, during off peak hours for instance, full database compaction can occur. This is will recover all wasted space (items in older generations removed if they occur in younger generations) and it lays the files and btrees for optimal by_seq scanning and by_id lookups.
Starting from the youngest file, a full compaction occurs, its contents re-laid out and btree optimized, and its tombstones are copied to the next oldest generation. Tombstones stay around until they encounter their corresponding "live" key. Then the next file is compacted, and the process continues until the oldest/largest generation is reached. It is an error to encounter a tombstone in the oldest generation, the tombstone should have encountered its live key and they both should been deleted from the by_seq index.
If the oldest generation file shrinks down its live data size to where it can fit inside its next youngest generation, it will be copied to that generation instead of compacted, and the oldest generation removed.
TODO: Formulas to compute average and amortized costs of common operations. Explain how this works very well for large datasets where randomly distributed amount of the data is updated, and when the update frequency obeys a powerlaw curve, this scheme reduces the IOs for lookups, writes and compactions.