Secondary Indexes in Cassandra

Secondary Index in Cassandra

In Cassandra data which is stored can be retrieved by using the partition key or entire primary key. Cassandra is not designed to retrieve data by using the elements which are not present in the primary key. If we use it that way it throws the following error. 

Example: 

CREATE TABLE ratings_by_title (
email TEXT,
title TEXT,
year INT STATIC,
rating INT,
PRIMARY KEY((title),email)
);

select * from ratings_by_movie where rating=8;

InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

This is where we use "allow filtering" this can give data requested for but will hit each and every node as it does not know the exact location. Hence it is not good to use allow filtering.

select * from ratings_by_movie where rating=8 Allow filtering;

Creating secondary indexes is a common practice in traditional RDBMS. But it is not recommended in Cassandra. When indexes are created, a hidden table is created in a background process. To query a secondary index the partition key and secondary index column should be included in order to be successful. By including the partition key and the secondary index column only one node will need to be queried.

Using secondary index is a very complex process. To reduce some of the difficulties secondary index must be used along with partition key only. Cardinality is very important for secondary index, we should use secondary index with low cardinality. Secondary Indexes should not be used on columns that have high cardinality, a large number of unique values. Additionally, columns that have extremely low cardinality, such as a column storing booleans, are also not going to be particularly useful. 

Secondary indexes should not be used on tables that are frequently updated. Interestingly, Cassandra does not eliminate tombstones beyond 100 thousand cells. Once the tombstone limit is reached a query using the indexed value will fail. Secondary indexes should also be avoided in looking for values contained in a large partition unless the query is very narrow.

Secondary Indexes do not support ranged queries ( WHERE rating > 8 ). They can only be used on equality queries. Also, maintaining indexes through hidden tables means they are going through a separate compaction process. . Independently compacting sstables and indexes means the location of the data and the index information are completely decoupled. If the data is compacted, a new sstable is written, and our index is now incorrect. This means we can’t simply point to a location on disk in an index because the location of the data can change.
 


                                                                                                  
 

Comments

Popular posts from this blog

Cassandra Reaper Configuration

Authorization in Cassandra

Authentication in Cassandra