How Cassandra Write Works (Commit Log, Memtable, SSTable)

All the nodes in a Cassandra cluster are treated as same, client can connect to any node in the cluster and that node works as a Coordinator in between the cluster and client application. Coordinator node determines the request from the client and ensure which node in the cluster should fulfil the request based on how the cluster is configured initially.

How Cassandra Write works ?

When a write request comes to any node in a Cassandra cluster, the node is called Coordinator for the request and acts as a proxy between the client and the replicas throughout the request cycle.

For replica nodes in the same local data center of the coordinator, the coordinator sends a write request to all replicas that own the row being written. The write consistency level determines how many replica nodes must respond with a success acknowledgment in order for the write to be considered successful.

If the cluster spans multiple data centers, the local coordinator node selects a remote coordinator in each of the other data centers to coordinate the write to the replicas in that data center. Each of the remote replicas responds directly to the original coordinator node.

As soon as a sufficient number of replicas have responded to satisfy the consistency level, the original coordinator node acknowledges the write to the client. If a replica doesn't respond within the timeout, it is presumed to be down, and a hint is stored for the write.

Write within a node

In each replica Node the write request is first appended sequentially to the commit login the disk to ensures data durability, the write request data would permanently survive even in a node failure scenario. If a crash occurs before the Memtable got flushed to disk, the commit log is used to replay that data and to rebuild the Memtable.

Memtable

The write request data is then indexed and sent to a structure stored in the memory (RAM), called Memtable. When Memtable reaches a certain threshold (commitlog_total_space_in_mb or memtable_cleanup_threshold), the data is flushed to an immutable data file SSTable (every time to a fresh new SSTable) on disk using sequential I/O and the data in the commit log is purged (deleted).

Memtable allow merging several writes on the same key, and when flushed, perform only a single write to the SSTables. Increase Memtable thresholds under the situation when either the write load includes a high volume of updates on a smaller set of data or a steady stream of continuous writes occurs. This action leads to more efficient compaction (Less number of flush thus less number of bigger SSTables - Less compaction needed).

Memtables can be located on heap or off heap, however in general scenario they are not moved entirely off heap, but two options are provided using "memtable_allocation_type" with values "offheap_buffers" or "offheap_objects".

offheap_buffers : This has the lowest impact on reads because the values are still "live" on Java buffers and only the cell name and value are moved to DirectBuffer objects. offheap_objects : Moves the entire cell off heap only the pointer remains on heap, the cell needs to temporarily moved back to heap when a read request comes.

Writes are about 5% faster with offheap_objects enabled, primarily because Cassandra doesn't need to flush as frequently. Bigger sstables means less compaction is needed.

memtable_cleanup_threshold: It is used with memtable_heap_space_in_mb and memtable_offheap_space_in_mb, when a total of memtable_heap_space_in_mb and memtable_offheap_space_in_mb is reached to the value of memtable_cleanup_threshold multiplied by total space, the bigger meltable among heap and off heap is flushed to disk.

SSTable

SSTables are append only immutable data files stored on disk sequentially and maintained for each Cassandra table. Every time a Memtable reaches a certain threshold, the data is flushed to a fresh new SSTable.

SSTables (sorted string table) are periodically consolidated using a process called compaction, this also involves discarding obsolete data marked for deletion with a tombstone. To ensure all data across the cluster stays consistent, various repair mechanisms are used.

All SSTables are stored by default compressed using LZ4 algorithm. Compression is about reducing data size (for less space, better I/O) while compaction is about the way data get reorganised.

Writes to a given key might be spread across several SSTables. Cassandra stores data on disks in regular files and folders of the filesystem. Cassandra maintains separate folders per table and stores its data in SSTable format consisting of multiple files per table. Depending on compaction strategy, some of the SSTable files can be relatively large.

Why are writes so fast?

Writing data to Cassandra is very fast as all the writes are append-only and for each Write Request one sequential disk write plus one in-memory write occur, both of which are extremely fast.

Consistency Level

The Cassandra consistency level is defined as the minimum number of Cassandra nodes that must acknowledge a read or write operation before the operation can be considered successful.

Consistency levels are passed to the Cassandra Cluster by a client for a particular session by using CONSISTENCY to set the consistency level for all queries in the current cqlsh session.

Client applications can set the consistency level using an appropriate driver. For example, using the Java driver, call QueryBuilder.insertInto with setConsistencyLevel to set a QUORUM consistency level.

1) ANY : a write must succeed on any available node
2) ONE : a write must succeed on any node responsible for that row (either primary or replica)
3) QUORUM : a write must succeed on a quorum or replica nodes (replication_factor / 2 + 1)
4) LOCAL_QUORUM : a write must succeed on a quorum or replica nodes in the same data center as the coordinator node
5) EACH_QUORUM : a write must succeed on a quorum of replica nodes in all data centers
6) ALL : a write must succeed on all replica nodes for a row key

Hinted Handoff

Lets assume the REPLICATION_FACTOR of a column-family (table) is 3 and WRITE_CONSISTENCY_LEVEL for a request is TWO, there will be three replicas for any data. In case only two out of three replicas are available the coordinator will behave normally and will return success acknowledgement to the client because WRITE_CONSISTENCY_LEVEL is 2 even if one replica is down the condition is still intact.

If any of the replica is down, the coordinator will do one more extra step, it will write locally the hint ( the write request blob along with some metadata) in the disk (hints directory) and would keep the hint there for 3 hours (by default) waiting for the replica node to become available again.

If the replica node is offline for more than 3 hours, then a read repair is needed. This process is referred as Hinted Handoff.