Obviously, the migration of data to disk is generally much slower and has much lower throughput than setting things in memory. When an application is setting or otherwise mutating data faster than it can be migrated out of memory to make space available for incoming data, the behavior of the server may be a bit different than the client expects with memcached. In the case of memcached, items are evicted from memory, and the newly mutated item is stored. In the case of couchbase, however, the expectation is that we'll migrate items to disk.
When Couchbase Server determines that RAM is at 90% of the
bucket quota, the server will return
to clients when storing data. This indicates that the out of
memory issue is temporary and can be retried. The reason for the
response is that there are still outstanding items in the disk
write queue that need to be persisted to disk before they can
safely be ejected from memory. The situation is rare and seen
only when very large volumes of writes in a short period of
time. Clients will still be able to read data from memory.
When Couchbase Server determines that there is not enough memory
to store information immediately, the server will return
TMP_OOM, the temporary out of memory error.
This is designed to indicate that the inability to store the
requested information is only a temporary, not a permanent, lack
of memory. When the client receives this error, the storage
process can either be tried later or fail, dependending on the
client and application requirements.
The actual process of eviction is relatively simple now. When we need memory, we look around in hash tables and attempt to find things we can get rid of (i.e. things that are persisted on disk) and start dropping it. We will also eject data as soon as it's persisted iff it's for an inactive (e.g. replica) vBucket if we're above our low watermark for memory. If we have plenty of memory, we'll keep it loaded.
The bulk of this page is about what happens when we encounter values that are not resident.
In the current flow, a get request against a given document ID will first fetch the value from the hash table. For any given item we know about, there will definitely be a document ID and its respective metadata will always be available in the hash table. In the case of an "ejected" record, the value will be missing, effectively pointed to NULL. This is useful for larger objects, but not particularly efficient for small objects. This is being addressed in future versions.
When fetching a value, we will first look in the hash table. If we don't find it, we don't have it. MISS.
If we do have it and it's resident, we return it. HIT.
If we have it and it's not resident, we schedule a background fetch and let the dispatcher pull the object from the DB and reattach it to the stored value in memory. The connection is then placed into a blocking state so the client will wait until the item has returned from slower storage.
The background fetch happens at some point in the future via an asynchronous job dispatcher.
When the job runs, the item is returned from disk and then the in-memory item is pulled and iff it is still not resident, will have the value set with the result of the disk fetch.*
Once the process is complete, whether the item was reattached from the disk value or not, the connection is reawakened so the core server will replay the request from the beginning.
It's possible (though very unlikely) for another eject to occur before this process runs in which case the entire fetch process will begin again. The client has no particular action to take after the get request until the server is able to satisfy it.
An item may be resident after a background fetch either in the case of another background fetch for the same document ID having completed prior to this one or another client has modified the value since we looked in memory. In either case, we assume the disk value is older and will discard it.
Concurrent reads and writes are sometimes possible under the right conditions. When these conditions are met, reads are executed by a new dispatcher that exists solely for read-only database requests, otherwise, the read-write dispatcher is used.
The underlying storage layer reports the level of concurrency it supports at startup time (specifically, post init-script evaluation). For stock SQLite, concurrent reads are allowed if both the journal-mode is WAL and read_uncommitted is enabled.
Future storage mechanisms may allow for concurrent execution under different conditions and will indicate this by reporting their level of concurrency differently.
The concurrentDB engine parameter allows the user to disable concurrent DB access even when the DB reports it's possible.
The possible concurrency levels are reported via the
ep_store_max_readwrite stats. The
dispatcher stats will show the read-only dispatcher when