CouchDB database files are purely append-only. All document updates and inserts are written to the end of the file. This allows for efficient io on updates and very reliable design.
When written in bulk, the documents are written sequentially on disk. The btrees by_seq and by_id index) that point to them are updated, each node that needs updating rewritten, with the leaf nodes being located and rewritten, and any parents being rewritten to point to the new node(s). Once all nodes are rewritten up to the root, the database header is written to contain the information about the new root nodes.
As all writes occur at the end of the file, this results in very fast updates, provided all the btree nodes that are being updated are already in memory (otherwise they must be fetched from physical disk), the updates require minimal seeks possible.
A tradeoff of this scheme is the wasted space as previous documents are simply forgotten, leaving a hole in the file, and also wasted btree nodes as old btree nodes leave gaps as well. It also tends to give the live btree nodes poor locality for scans, as the btree nodes go from an optimized layout with good locality to child and sibling nodes, to one where btree jumps start to look random in the worst case. However, the most recent updates are all near the end of the file, so lookups on the most recent documents are the most efficient.
- A single tail append storage file contains the following things.
- by_id btree index
- by_seq btree index
- local_docs btree index: This is a btree of non-replicating, non-indexed documents
- Header: This contains pointers to the roots of the btrees, and other metadata about the database file. It can also be called a trailer, since it always occurs at the end of the file, after the data and the indexes it points to.
To differentiate between headers and data, especially data that might look like a header, every 4k in the file, a single byte takes the value of either 0x00 or 0x01. 0x01 means that what follows the byte is a database header. Non-header, or data blocks, have 0x00.
When opening the file, the file scanned backward to find the first intact header. It's assumed if an intact header is found, all data preceding it is intact, though crc32 is used on all items and btree nodes when read from the filesystem.
For datablocks (documents, btrees nodes or anything not a header), any data that spans across the 4k marker will have the 0x00 added when written, and stripped out when read. Otherwise Couchbase is not aware of any other block boundaries.
The tail append design ensures data integrity of committed data regardless of clean shutdown. Should a write fail do to any kind of crash, even power loss, previous data written and indexes written to the file are still intact and untouched. And incomplete writes and index updates simply appear as garbage data at the end of the file, and are ignored on database open. This is due to the pure tail append design, which means special recovery code, such as consistency checks or fixups, are never needed.
A new scheme for storage files is to make them "generational", with a series of aging files that is exponentially larger than the younger files. All updates and inserts happen synchronously at the youngest generation.
As files grow, they either exceed their live data threshold (the amount of non-garbage data they contain is too large), or exceed their garbage data threshold (the amount of dark data, or garbage they contain exceeds the threshold).
When a level exceeds live data threshold, some (or all) of the data is copied to the next generation. When it exceeds it's garbage threshold, it's copy compacted to recover wasted space.
Example storage file generations and live data and garbage max:
Gen 0: 10 mb
Gen 1: 100 mb
Gen 2: 1 gb
Gen 3: 10 gb
For example, the largest the file Gen 1 file could get, before triggering async compaction is 200 mb, with it reaching both its live data threshold and garbage threshold at the same time.
All inserts happen at the youngest generation file.
Before the insert occurs, a new update sequence is given to the document (locally unique to all documents and monotonically increasing to the generation files), and the document is written to disk at the end of the file. A new by_seq and by_id entry is created and stored in the btrees and points to the written document, again with all node rewrites happening at the end of the file.
All updates, including deletes, happen at the youngest generation file.
Doc updates and deletes generate and are assigned new update sequence values, and remove the old by_seq entries of the document. A document deletion stores by_id "tombstone", and a new by_seq entry, so that indexers and replicators know to delete the same document as well.
If we keep the documents previous update sequence in memory, we'll write a tombstone value for that by_seq entry, so that it's deleted when the tombstone is copied to the older file.
If the existing item is already in the youngest file, the previous by_seq entry for the item is deleted, the new by_seq value is added and the old by_id entry is replaced. If it doesn't exist, a tombstone value is updated to the by_seq index in place of the old seq entry, to note to later remove the existing value in a older file.
When we do not keep document metadata in memory, we do not generate a tombstone for the previous by_seq value. The write efficiency remains the same, but indexing and replication might encounter, old obsolete values, index, copy or replicate them, and later encounter the newer updated value that obsoletes the older value in the replica or index.
When an update or insert occurs, it's always written the youngest file. If a deletion occurs and the document doesn't exist in the youngest generation, a tombstone value is inserted.
Once the young file write is complete, the caller gets back an acknowledgement that the write completed.
When the live data in the file exceeds its threshold, a copy process invokes asynchronously, and the young file contents are copied into the next older generation file. If there is no next generation file, the storage file is simply moved/renamed to the next generation. If there is already older generation file, the contents of the file are copied in bulk into the older generation file and the older file's btrees are updated.
Any by_seq tombstone values that encounter their old values are now dropped along with the old value, and the by_id value left alone (The by_id values will be dropped later during compaction)
Once all data is copied, the original file is deleted and a new file is started in its place.
If the older generation is pushed over its Live Data Threshold, its contents also moved/copied up to the next level, in a recursive manner.
If we don't write tombstones for old by_seq values at the youngest generation, then we must delete it when writing the new document. We must read the old by_id metadata get the old seq, which is something that can be done at the same time the newer by_it metadata is written. Then we delete outdated by_seq entry at the same time we insert the new by_seq enty.
When a file reaches its garbage data threshold – the garbage on disk exceeds the live data threshold – it's copy compacted, recopying the values to a new file and rebuilding the btrees for optimized access, then the new file is put into place and the original file is discarded.
Compaction will happen asynchronously and optimistically, using a snapshop of the pre-compacted file. The file can still be written too in the same tail append way concurrent with the compaction, and any changes written the compactor will notice once done copying the snapshot and will recopy the changes into the new file. It will do this until it's caught up with the live writes, then it will do the file switch-over, adding the new file and deleting the old storage.
Any tombstone values it contains are copied up to the next generation during this time (not recopied to the same level), and if found in the next generation, are deleted along with the old value, or the tombstone written intact to the older file (and both cases more garbage is noted to be in the older file).
When tombstone values or other values are written to the next oldest generation, they must be fully committed to that file before the current generation can switch to the newly compacted file.
Any tombstone values that it previously deleted a by_seq entry, will also cause the corresponding by_id entry to be dropped. When copy compacting, the by_id btree of the original file isn't copied directly, instead the by_seq key values first written to the compacted file are recopied, sorted on-disk by id, and finally written into a new optimized by_id btree. So any by_id values not eliminated by the original tombstone will be dropped during this phase.
All live updates go to the youngest generation and should be from a single thread/process, though compaction and copying values to older generations can occur concurrently with a compactor thread. Any reader who has a file open will get a consistent snapshot of the file and its btrees, regardless if it's concurrently updated, copied to the next generation, or compacted.
If a reader has a snapshot of file that's being deleted by compaction, the file is inaccessible for new snapshots, but the deleted file is still completely readable for as long as the snapshot reader has the file open. Once closed by all readers, the OS will finally free the space consumed by the now deleted.
When performing a single or batched database lookup (it's possible to lookup by_seq, but in practice will be done by_id). First a key is looked for in the youngest file, then next oldest, until the key is found or the old file has been looked up. As long the files are opened youngest to oldest (not all need to opened at once), consistent results are guaranteed.
A snapshot scan of the logical database is as simple a doing a snapshot btree read in each storage file in each generation, then merging the results as the next value from each file is read, reading the next value from the file whose value is smallest for ascending scan, largest for descending scans. Equivalent values from the youngest files override the same values found in later.
Storage files should be opened and the header read in youngest to oldest file order. This order will ensure that all values are read, nothing missed. Opening in reverse order could cause missing items that are copied to an older file that's a later MVCC version than currently being scanned.
An actively writing system always has garbage space and fragmentation, with the smallest files experiencing more compactions. The cost of compaction is ONlog(N), so the larger the # of documents, the longer, and more expensive it is per document.
With most workloads, the oldest files will need compaction far less frequently than the younger files. It might be advantageous to perform full database compaction during off-peak periods.
This is will recover all wasted space (items in older generations removed if they occur in younger generations) and it lays out the documents and btrees for optimal by_seq scanning and by_id lookups.
Starting from the youngest file, a full compaction occurs, its contents re-laid out and btree optimized, and its tombstones are copied to the next oldest generation. Tombstones stay around until they encounter their corresponding "live" key. Then the next file is compacted, and the process continues until the oldest/largest generation is reached. It is an error to encounter a tombstone in the oldest generation, the tombstone should have encountered its live key and they both should been deleted from the by_seq index.
If the oldest generation file shrinks down its live data size to where it can fit inside its next younger generation, it will be copied to that generation instead of compacted, and the oldest generation removed.
TODO: Formulas to compute average and amortized IO costs of common operations. Explain how this works for large datasets where randomly distributed amount of the data is updated and update patterns when the update frequency takes a zipfian distribution (.