Cassandra Tombstones – DEMYSTIFIED

One thing that Cassandra provides is the “optimized for writes” feature.

In Cassandra everything is a write including logical deletion of data which results in tombstones – we can call them special deletion records. Indeed the lack of understanding of tombstones is often the root cause of production issues that people experience with Cassandra.

While working on the production cluster we have seen tombstones can be tricky and actually they are not only associated to delete operations. There are many other cases that may generate tombstones.
Basically, we should not worry much as its just a way to delete data in an append-only structure. However, it can affect performances so you’d better be aware when they are generated when designing your data model and queries.

One recent scenario I can think of is that we started getting the warning from Cassandra after we deployed in production a distributed system that uses Cassandra as its persistent storage. Not long after we noticed that there were many warnings about tombstones in Cassandra logs.

Read 500 live rows and 4771 tombstone cells for query
SELECT * FROM demokeyspacename.demotable WHERE token(id) <= -9199254245259770878 LIMIT 100 (see tombstone_warn_threshold) Read 0 live rows and 1045 tombstone cells for query SELECT * FROM demokeyspacename.demotable WHERE token(id) > -7664392309359282798 AND token(id) <= -7642153335644958030 LIMIT 100 (see tombstone_warn_threshold).

Also, there is a warning mentioned in the Cassandra.yml file –

############### # SAFETY THRESHOLDS # ##################### #

When executing a scan, within or across a partition, we need to keep the # tombstones seen in memory so we can return them to the coordinator, which # will use them to make sure other replicas also know about the deleted rows. # With workloads that generate a lot of tombstones, this can cause performance # problems and even exhaust the server heap. # (http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets) # Adjust the thresholds here if you understand the dangers and want to # scan more tombstones anyway.

These thresholds may also be adjusted at runtime # using the StorageService mbean.

tombstone_warn_threshold: 1000

tombstone_failure_threshold: 100000

So basically a Large Number of Tombstones Causes Latency and Heap Pressure.

So first let us see how and under what scenarios tombstones are created and how we should deal with them and make our application more performant.

1) TombStone From DELETE

Let’s create one dummy table and see how tombstones are getting created –

CREATE TABLE demokeyspace.demotable ( id int, clust1 text, clust2 text, val1 text, PRIMARY KEY (id, clust1, clust2) ) WITH CLUSTERING ORDER BY (clust1 ASC, clust2 ASC); Let’s now insert some records – cqlsh> INSERT INTO demokeyspace.demotable (id , clust1 , clust2 , val1 ) VALUES ( 1,’clust1′,’clust2′,’val1′);
cqlsh> INSERT INTO demokeyspace.demotable (id , clust1 , clust2 , val1 ) VALUES ( 2,’clust2′,’clust3′,’val2′);
cqlsh> INSERT INTO demokeyspace.demotable (id , clust1 , clust2 , val1 ) VALUES ( 3,’clust3′,’clust4′,’val3′);
cqlsh> INSERT INTO demokeyspace.demotable (id , clust1 , clust2 , val1 ) VALUES ( 4,’clust4′,’clust5′,’val4′);

cqlsh> SELECT * FROM demokeyspace.demotable ;

id | clust1 | clust2 | val1
—-+——–+——–+——
1 | clust1 | clust2 | val1
2 | clust2 | clust3 | val2
4 | clust4 | clust5 | val4
3 | clust3 | clust4 | val3

 

A partition level range tombstone would be achieved using the following delete query:


cqlsh> DELETE from demokeyspace.demotable WHERE id =1;

A multi-row range tombstone would require the following query:

cqlsh> DELETE from demokeyspace.demotable WHERE id = 2 and clust1 = ‘clust2’;

You can view the tombstone that can be query using the tracing feature –

cqlsh> tracing on
Now Tracing is enabled

cqlsh> SELECT * FROM demokeyspace.demotable ;

id | clust1 | clust2 | val1
—-+——–+——–+——
4 | clust4 | clust5 | val4
3 | clust3 | clust4 | val3

(2 rows)

Tracing session: 59dab000-7484-11e9-bde8-21235adfa6d8

activity | timestamp | source | source_elapsed | client
—————————————————————————————————————————–+—————————-+———–+—————-+———–
Execute CQL3 query | 2019-05-12 12:35:40.352000 | 127.0.0.1 | 0 | 127.0.0.1
Parsing SELECT * FROM demokeyspace.demotable ; [Native-Transport-Requests-1] | 2019-05-12 12:35:40.352000 | 127.0.0.1 | 346 | 127.0.0.1
Preparing statement [Native-Transport-Requests-1] | 2019-05-12 12:35:40.352000 | 127.0.0.1 | 699 | 127.0.0.1
Computing ranges to query [Native-Transport-Requests-1] | 2019-05-12 12:35:40.353000 | 127.0.0.1 | 1277 | 127.0.0.1
Submitting range requests on 257 ranges with a concurrency of 1 (0.0 rows per range expected) [Native-Transport-Requests-1] | 2019-05-12 12:35:40.354000 | 127.0.0.1 | 2004 | 127.0.0.1
Submitted 1 concurrent range requests [Native-Transport-Requests-1] | 2019-05-12 12:35:40.355000 | 127.0.0.1 | 3309 | 127.0.0.1
Executing seq scan across 0 sstables for (min(-9223372036854775808), min(-9223372036854775808)] [ReadStage-2] | 2019-05-12 12:35:40.355000 | 127.0.0.1 | 3553 | 127.0.0.1
Read 2 live rows and 2 tombstone cells [ReadStage-2] | 2019-05-12 12:35:40.357000 | 127.0.0.1 | 4755 | 127.0.0.1
Request complete | 2019-05-12 12:35:40.357770 | 127.0.0.1 | 5770 | 127.0.0.1

Creating a single row range tombstone would be done as follows:

cqlsh> DELETE FROM demokeyspace.demotable WHERE id = 3 AND clust1 = ‘clust3’ AND clust2 = ‘clust4’;

cqlsh> SELECT * FROM demokeyspace.demotable ;

id | clust1 | clust2 | val1
—-+——–+——–+——
4 | clust4 | clust5 | val4

(1 rows)

Tracing session: eb60de50-7484-11e9-bde8-21235adfa6d8

activity | timestamp | source | source_elapsed | client
—————————————————————————————————————————–+—————————-+———–+—————-+———–
Execute CQL3 query | 2019-05-12 12:39:44.501000 | 127.0.0.1 | 0 | 127.0.0.1
Parsing SELECT * FROM demokeyspace.demotable ; [Native-Transport-Requests-1] | 2019-05-12 12:39:44.502000 | 127.0.0.1 | 369 | 127.0.0.1
Preparing statement [Native-Transport-Requests-1] | 2019-05-12 12:39:44.502000 | 127.0.0.1 | 1091 | 127.0.0.1
Computing ranges to query [Native-Transport-Requests-1] | 2019-05-12 12:39:44.503000 | 127.0.0.1 | 1864 | 127.0.0.1
Submitting range requests on 257 ranges with a concurrency of 1 (0.0 rows per range expected) [Native-Transport-Requests-1] | 2019-05-12 12:39:44.504000 | 127.0.0.1 | 2500 | 127.0.0.1
Submitted 1 concurrent range requests [Native-Transport-Requests-1] | 2019-05-12 12:39:44.505000 | 127.0.0.1 | 3612 | 127.0.0.1
Executing seq scan across 0 sstables for (min(-9223372036854775808), min(-9223372036854775808)] [ReadStage-2] | 2019-05-12 12:39:44.505000 | 127.0.0.1 | 3736 | 127.0.0.1
Read 1 live rows and 3 tombstone cells [ReadStage-2] | 2019-05-12 12:39:44.506000 | 127.0.0.1 | 4710 | 127.0.0.1
Request complete | 2019-05-12 12:39:44.506910 | 127.0.0.1 | 5910 | 127.0.0.1

2) updating the values with NULL –

Tombstones are also created if you update your field value as NULL.

cqlsh> UPDATE demokeyspace.demotable SET val1 = null WHERE id = 4 AND clust1 = ‘clust4’ AND clust2 = ‘clust5’ ;

cqlsh> SELECT * FROM demokeyspace.demotable ;

id | clust1 | clust2 | val1
—-+——–+——–+——
4 | clust4 | clust5 | null

(1 rows)

Tracing session: 953fa780-7485-11e9-bde8-21235adfa6d8

activity | timestamp | source | source_elapsed | client
—————————————————————————————————————————–+—————————-+———–+—————-+———–
Execute CQL3 query | 2019-05-12 12:44:29.496000 | 127.0.0.1 | 0 | 127.0.0.1
Parsing SELECT * FROM demokeyspace.demotable ; [Native-Transport-Requests-1] | 2019-05-12 12:44:29.497000 | 127.0.0.1 | 606 | 127.0.0.1
Preparing statement [Native-Transport-Requests-1] | 2019-05-12 12:44:29.497000 | 127.0.0.1 | 1193 | 127.0.0.1
Computing ranges to query [Native-Transport-Requests-1] | 2019-05-12 12:44:29.498000 | 127.0.0.1 | 2032 | 127.0.0.1
Submitting range requests on 257 ranges with a concurrency of 1 (0.0 rows per range expected) [Native-Transport-Requests-1] | 2019-05-12 12:44:29.499000 | 127.0.0.1 | 3228 | 127.0.0.1
Submitted 1 concurrent range requests [Native-Transport-Requests-1] | 2019-05-12 12:44:29.501000 | 127.0.0.1 | 5398 | 127.0.0.1
Executing seq scan across 0 sstables for (min(-9223372036854775808), min(-9223372036854775808)] [ReadStage-2] | 2019-05-12 12:44:29.502000 | 127.0.0.1 | 6012 | 127.0.0.1
Read 1 live rows and 4 tombstone cells [ReadStage-2] | 2019-05-12 12:44:29.503000 | 127.0.0.1 | 6766 | 127.0.0.1
Request complete | 2019-05-12 12:44:29.503904 | 127.0.0.1 | 7904 | 127.0.0.1

3) Expiring Data with TTL

Expiring data by setting a TTL (Time To Live) is one an alternative to deleting data explicitly but technically results in the same tombstones recorded by Cassandra and requiring the same level of attention as other types of tombstones.

4) Inserting data in collection type columns –

Let’s see with an example –

CREATE TABLE demokeyspace.democoll_table(
key int PRIMARY KEY,
col_1 list,
col_2 map<int, text>
);

Insert some data  – 

cqlsh> INSERT INTO demokeyspace.democoll_table (key , col_1, col_2 ) VALUES (1, [‘one’, ‘two’], {3 : ‘three’, 4 : ‘four’});
cqlsh> INSERT INTO demokeyspace.democoll_table (key , col_1, col_2 ) VALUES (2, [‘one’, ‘two’], {3 : ‘three’, 4 : ‘four’});
cqlsh> INSERT INTO demokeyspace.democoll_table (key , col_1, col_2 ) VALUES (3, [‘one’, ‘two’], {3 : ‘three’, 4 : ‘four’});

Cassandra optimizes for writes and does not check if the list has changed (or even existed), instead, it immediately deletes it’s before inserting the new one. Be aware of this when choosing to use collections as column types.

For visualizing this we have to look into the SStables –

before that, one needs to first flush the data into the disk and let the SStables be formed.
./nodetool flush

and then we can use another tool called sstabledump
bin/sstabledump /var/software/apache-cassandra-3.11.2/data/data/demokeyspace/democoll_table-cb575a12748611e9bde821235adfa6d8/mc-1-big-Data.db > ~/Downloads/tabledump.json

I am pasting the first row records form the file(tabledump.json) and it can be seen clearly the deletion is marked –

[
{
“partition” : {
“key” : [ “1” ],
“position” : 0
},
“rows” : [
{
“type” : “row”,
“position” : 95,
“liveness_info” : { “tstamp” : “2019-05-12T07:24:42.545726Z” },
“cells” : [
{ “name” : “col_1”, “deletion_info” : { “marked_deleted” : “2019-05-12T07:24:42.545725Z”, “local_delete_time” : “2019-05-12T07:24:42Z” } },
{ “name” : “col_1”, “path” : [ “02a86c70-7487-11e9-bde8-21235adfa6d8” ], “value” : “one” },
{ “name” : “col_1”, “path” : [ “02a86c71-7487-11e9-bde8-21235adfa6d8” ], “value” : “two” },
{ “name” : “col_2”, “deletion_info” : { “marked_deleted” : “2019-05-12T07:24:42.545725Z”, “local_delete_time” : “2019-05-12T07:24:42Z” } },
{ “name” : “col_2”, “path” : [ “3” ], “value” : “three” },
{ “name” : “col_2”, “path” : [ “4” ], “value” : “four” }
]
}
]
},
{
“partition” : {
“key” : [ “2” ],
“position” : 96
},

5) Using the Materialised views –

A materialized view is a table that is maintained by Cassandra. A major advantage we get from this view is that we can define a different primary key than the one in the base table. You can re-order the fields of the primary key from the base table, but you can also add one extra field into the primary key of the view.

This is great as it allows to define a different partitioning or clustering but it also generates more tombstones in the view.

Possible Solutions And Tunings –

Set the appropriate gc_grace_seconds on tables and performs Routine Repairs

Routine repairs must be run on clusters where deletions occur (they may occur even if you don’t explicitly delete anything, see above) to avoid among other things deleted data becoming live again. You must run repairs more often than the minimum chosen gc_grace_period. Make sure you are capable of supporting repairs more frequent than the minimum gc_grace_seconds among all your tables.

For more info – cassandra repairs

Changing the tombstone warning/failure threshold

Invisible column range tombstones aside, there are two tombstone threshold settings in “cassandra.yaml” helpful for detecting a large number of tombstones affecting performance:
tombstone_warn_threshold (default: 1000): if the number of tombstones scanned by a query exceeds this number Cassandra will log a warning.
tombstone_failure_threshold (default: 10000): if the number of tombstones scanned by a query exceeds this number Cassandra will abort the query. The is a mechanism to prevent one or more nodes from running out of memory and crashing.

These values should only be changed upwards if you are really confident about the memory use pattern of your cluster.

Conclusion –

Tombstones are among the most misunderstood features of Cassandra and can cause significant performance problems if not investigated, monitored and dealt with in a timely manner. Here at knoldus, we have acquired expertise in detecting and solving tombstone related problems on various projects by successfully using the techniques shared above.

Knoldus-blog-footer-image

Written by 

Piyush Rana is a Senior Software Consultant having experience of more than 6 years. He is familiar with Object Oriented Programming Paradigms and .NET based technologies. From the past 2 years he has been handling Big Data and is working on technologies like Hadoop, Hive, Pig, Hbase. His hobbies includes gaming (strategy based, FPS and role-playing), watching series, and listening songs.

Leave a Reply

Knoldus Pune Careers - Hiring Freshers

Get a head start on your career at Knoldus. Join us!