How Cassandra Read Works (Partitioner, Snitches)

All the nodes in the cluster are treated as same, client can connect to any node in the cluster and that node works as a coordinator in between the cluster and client application. Coordinator node determines the request from the client and ensure which node in the cluster should fulfil the request based on how the cluster is configured initially.

How Cassandra Read works ?

In Cassandra, a clients can connect to any node in the cluster to perform reads, without having to know whether a particular node acts as a replica for that data or not.

If a client connects to a node that doesn't have the data it's trying to read, the node it's connected to will act as coordinator node to read the data from a node that does have it, identified by token ranges.

Cassandra Reads are a bit slower than the Writes because, to satisfy a read, Cassandra must combine results from the active Memtable and potentially multiple SSTables.

Cassandra Read operations are directly dependent on two mechanism: 1) Partitioner, and 2) Snitches

Partitioner

In a Cassandra table each row of data is uniquely identified by a primary key, which may be the same as its partition key, but which may also include other clustering columns.

The Murmur3Partitioner is the default partitioning strategy for a Cassandra Cluster, a partitioner determines which node will receive the first replica of a piece of data, and how to distribute data over other replicas across the nodes in the cluster.

A partitioner uses a hash function to derives a token from the partition key of a row, this token is used to determine which nodes in the cluster receive the replicas of that row.

Snitches

In Cassandra, snitch job is to determine which data centers and racks it should use to read data from and write data to, all snitch are dynamic by default. Snitch is also helps in avoiding multiple replica and to avoid storing multiple replicas of data on the same rack.

The job of a snitch is to simply determine relative host proximity. Hence, if a node has 3 options to copy the data, which host should it select ? Which host should it prefer to read the data from ?. Snitches gather information about network topology and helps the nodes to see which host is relatively nearer to route requests efficiently.

Snitches also helps Cassandra to ensure that there are no more than one replica on the same "rack" by grouping machines into "datacenters" and "racks."

1) SimpleSnitch

In Cassandra, it is default snitch and good for development environments. It does not look for cassandra-topologies.properties file and hence is unaware of datacenters or racks (unusable for multi-datacenter environments).

It has the strategy of placing the copy of the row on the next available node walking clockwise through the ring.

2) GossipingPropertyFileSnitch

GossipingPropertyFileSnitch is recommends by datastax for production use. The rack and datacenter for the local node are defined in cassandra-rackdc.properties and propagated to other nodes via gossip.

There are multiple other Snitches available to use like: Ec2Snitch, Ec2MultiRegionSnitch, GoogleCloudSnitch, RackInferringSnitch, PropertyFileSnitch and CloudstackSnitch etc.

Read path

The coordinator node uses the partitioner to determine the replicas and checks that there are enough replicas up to satisfy the requested consistency level. The coordinator then sends a read request to the fastest replica and a digest request to the other replicas. The fastest replica is determined by the dynamic snitch.

A digest request is similar to a standard read request, except the replicas return a digest (hash) of the requested data, rather than the complete dataset to the coordinator.

The coordinator calculates the digest hash for data returned from the fastest replica and compares it to the digests returned from the other replicas. If the digests are consistent, and the desired consistency level has been met, then the data from the fastest replica can be returned. If the digests are not consistent, then the coordinator must perform a read repair.

Optimizing Reads

We can not afford a full SStable scan on each read therefore, Cassandra by default provides few mechanism to optimise reads such as "Bloom filters", "Row Cache" and "Partition Key Cache". The three mechanism use RAM but stores data off heap.

Cassandra includes integrated caching and distributes cache data around the cluster. When a node goes down, the client can read from another cached replica of the data.

The integrated cache alleviates the cold start problem by saving the cache to disk periodically. Cassandra reads contents back into the cache and distributes the data when it restarts. The cluster does not start with a cold cache.

Bloom filters

Bloom filters are a probabilistic data structure that allows Cassandra to determine one of two possible states: The data definitely does not exist in the given file, or The data probably exists in the given file.

To avoid checking every SSTable data file for the partition being requested, Cassandra checks the Bloom filter to discover which SSTables are likely to have the request partition data. Each SSTable has a Bloom filter associated with it.

Bloom filters can not guarantee that the data exists in a given SSTable bur bloom filters can be made more accurate by adjusting the the bloom_filter_fp_chance to a float between 0 and 1. The value of bloom_filter_fp_chance indicates the percentage of false-positive chances, i.e. lesser the value of bloom_filter_fp_chance the more accurate the prediction and more RAM required by the Bloom filters.

Bloom filters are table level property and its value can be changed with an ALTER TABLE statement:


ALTER TABLE keyspace.table WITH bloom_filter_fp_chance=0.01

However, because the Bloom filter is a probabilistic function, it can result in false positives. Not all SSTables identified by the Bloom filter will have data. If the Bloom filter does not rule out an SSTable, Cassandra checks the partition key cache.

Partition Key Cache

In Cassandra a Partition Index is where a particular partition begins in the SStables. The partition key cache is a cache of the Partition Index for a Cassandra table.

Enabling just the key cache results in disk activity to actually read the requested data rows, but not enabling the key cache results in more reads from disk.

If a partition key is not found in the key cache, then the partition summary is searched. The partition key cache size is configurable, as are the number of partition keys to store in the key cache.

Row Cache

The row cache, if enabled, stores a subset of the partition data stored on disk in the SSTables in memory. The subset stored in the row cache use a configurable amount of memory for a specified period of time.

Desired partition data is read from the row cache. The rows stored in row cache are frequently accessed rows that are merged and saved to the row cache from the SSTables as they are accessed.

The row cache uses LRU (least-recently-used) eviction to reclaim memory when the cache has filled up.

If a write comes in for the row, the cache for that row is invalidated and is not cached again until the row is read. When the desired partition data is not found in the row cache, then the Bloom filter is checked.

Row caching, when feasible, can save the system from performing any disk seeks at all when fetching a cached row. Then, whenever growth in the read load begins to impact your hit rates, you can add capacity to quickly restore optimal levels of caching.

To enable row caching, define row_cache_class_name (org.apache.cassandra.cache.OHCProvider : fully off-heap) and row_cache_size_in_mbin cassandra.yamlfile, and activate row_cache for each table partition something like this:


use tb_ks ;
ALTER TABLE tb_data
WITH caching = {?keys?: ?ALL?, ?rows_per_partition?: ?10?} ;

Enable a row cache only when the number of reads is much bigger (rule of thumb is 95%) than the number of writes. Consider using the operating system page cache instead of the row cache, because writes to a partition invalidate the whole partition in the cache.

Linux page caching

Cassandra stores data on disks in regular files and folders of the filesystem. Cassandra maintains separate folders per table and stores its data in SSTable format consisting of multiple files per table. Depending on compaction strategy, some of the SSTable files can be relatively large.

When reading its files, Cassandra's Java code is using regular file I/O and benefits from Linux page cache via Java's memory map feature. This means that after parts of the file are read one time, the read content is stored in the OS page cache, thereby causing subsequent read access to the same data to be much faster.

This also indicates that when executing Cassandra read again on same data, the second and subsequent reads will appear to be much faster than the original read which needed to access data on the remote data disk.