Dealing with a corrupt SSTable in Cassandra

Gracefully recover corrupted data

Corrupted SSTables in Cassandra Icon

Corruption. It happens. And when it happens to Cassandra’s data files, one form it can take is of a corrupt SSTable file. This is exactly what happened to us in the last week, and I wanted to share the steps we took to fix the corrupted data in a safe way, without losing any data.

Important context

Before we start, there are a few important things to note:

Firstly, we’re running Cassandra 1.2.8, so the output, commands and steps we took were performed using that version. If you’re running a different version of Cassandra, particularly <= v1.1 or >= v2.0 then it’s very possible that things will be different for you. In fact, hopefully the problem that caused the corruption in the first place will have been fixed in > v1.2.8 and you don’t encounter any corruption at all!

Secondly, we’re running Cassandra with a Replication Factor (RF) of 3, which ensures there are at least 3 separate nodes in the cluster with a copy of every piece of data. This is a recommended RF for Cassandra clusters and ensures that if you lose one node, you’ll still have a copy of all your data available from the remaining nodes. This is how we are able to recover the corrupted data gracefully. If your RF is less than 3, or you don’t have data redundancy available in some other way, then you may still lose data in the event of corruption. Additionally, you may still lose data if more than one of your nodes have corrupted the same data. In that case you’d probably need to restore from a snapshot which is a very different subject to what we cover in this post. If you’re running Cassandra but you aren’t sure about the implications of the Replication Factor, read up on it.

Thirdly, actual keyspace and column family names have been replaced with keyspace and cf respectively.

With that said, let’s begin.

Detecting corruption

Cassandra regularly performs housekeeping on its data files, taking care of compaction, compression, writing new data to disk, and recording various database activities. If something’s awry with one of these files and it doesn’t work as normal, Cassandra will shout out about it in its log files:

==> /var/log/cassandra/system.log <==
ERROR [CompactionExecutor:7] 2014-03-20 12:00:07,454 (line 192) Exception in thread Thread[CompactionExecutor:7,1,main] (/raid0/cassandra/data/keyspace/cf/keyspace-cf-ic-4698-Data.db): corruption detected, chunk at 41674041 of length 47596.

While investigating high load on this node, I spotted this scary looking exception in Cassandra’s main log file. It announces that in this particular case, Cassandra had trouble reading the keyspace-cf-ic-4698-Data.db file due to a corruption error. This file belongs to an SSTable, which stores the data for a column family. Cassandra isn’t recovering from this problem itself, so what can we do?

Take the node offline

At this point we’ve identified a problem with the node so it’d probably be a good idea to deactivate it from the live cluster as a precautionary measure but also to give us a bit more leeway for our repair work.  Do this ONLY if you have sufficient redundancy measures in place in your cluster (see important context above). It’s also a good idea to check the status of your apps (connection pools, reconnection handlers, etc.) and other nodes in the Cassandra cluster (load, logs, etc.) to make sure they’re able to handle this node going offline.

Gracefully shut down Cassandra on the affected server:

service cassandra stop

Check that Cassandra has fully shut down cleanly.

Scrub the SSTable

Cassandra ships with a tool called sstablescrub. In its description, it states you should “Use this tool to fix (throw away) corrupted tables” and before using it you should “try rebuild[ing] the tables using nodetool scrub”. I had tried a nodetool scrub but that failed with an SSTable corruption error. The offline sstablescrub wasn’t much different, also giving me a table corruption error, but you could try running this and seeing if it works in deleting the corrupted files for you:

Note: Be careful which system user you run this command as. It rewrites the SSTables with permissions for that user so you may have to chown afterwards.

sstablescrub keyspace cf

Pre-scrub sstables snapshotted into snapshot pre-scrub-1395327387317
Scrubbing SSTableReader(path='/raid0/cassandra/data/keyspace/cf/keyspace-cf-ic-5273-Data.db')
Scrubbing SSTableReader(path='/raid0/cassandra/data/keyspace/cf/keyspace-cf-ic-4698-Data.db')
WARNING: Non-fatal error reading row (stacktrace follows)
WARNING: Row at 85207395 is unreadable; skipping to next
WARNING: Non-fatal error reading row (stacktrace follows)
WARNING: Row at 106721044 is unreadable; skipping to next
Error scrubbing SSTableReader(path='/raid0/cassandra/data/keyspace/cf/keyspace-cf-ic-4698-Data.db'): (/raid0/cassandra/data/keyspace/cf/keyspace-cf-ic-4698-Data.db): corruption detected, chunk at 52001433 of length 23873.

This command rewrote all of the other valid SSTables to new files, leaving only the corrupted one in its original un-touched state and making it show up like a sore thumb in an ls -alh of the column family’s directory (all the other SSTable files got new consecutive -ic-xxxx suffixes).

Since the command didn’t delete the corrupted SStables files, we’ve not got much choice but to clear them up ourselves.

Remove the corrupted SSTable

Grab the prefix of your SSTable files, in this case /raid0/cassandra/data/keyspace/cf/keyspace-cf-ic-4698-, and move the files to a backup folder somewhere, just in case we need them later (we should already have a snapshot created by sstablescrub anyway):

mkdir -p /raid0/backups/corrupt-sstables
mv /raid0/cassandra/data/keyspace/cf/keyspace-cf-ic-4698-* /raid0/backups/corrupt-sstables/


Now that we’ve effectively deleted a portion of the column family’s data on this node by removing the SSTable, we must start Cassandra back up on this server and run a repair on the column family. The repair should check the integrity of the data on the node and recover missing data from replicas stored by other nodes. The repair process takes a while (depending on the size of your data etc.) so perhaps you should run it in a terminal mutiplexer like tmux or screen, in case you need to close your connection to the server while it runs:

service cassandra start
nodetool repair keyspace cf

Wrapping up

Once the repair process completes, verify that all of your logs are clear of corruption exceptions and that things are looking normal.

To clean up, you might want to remove the snapshot created by sstablescrub using nodetool clearsnapshot and remove your backup files rm -r /raid0/backups/corrupt-sstables

Never miss a post

  • Gourav Nayyar

    Very nicely written.

  • J.B. Langston

    I’m from DataStax tech support. This is a great article, thanks for writing it! I noticed you ran into a little trouble getting scrub to work so I thought I could shed some light on that.

    Contrary to what the documentation says, scrub actually removes individual corrupted records from an sstable; it doesn’t remove the entire sstable. I’ll ask the documentation team to fix that.

    Based on the exception you got, it looks like the compression metadata got corrupted, which makes the entire sstable unreadable. If the entire table is corrupted, scrub won’t be able to fix it and it will skip over the file without doing anything to it. In this case, as you discovered, the only recourse is to remove the entire file.

    Also worth mentioning–corruption could be an indication of a hardware problem. If you have a lot of corrupted sstables or find repeated corruption on the same node after you’ve already repaired it, you should definitely be suspicious of your hardware. If this happens, it’s advisable to check your kernel messages (syslog/dmesg) for errors, and run a S.M.A.R.T. status check on your drive to make sure everything’s OK. Even if the drive checks out, it’s also possible that bad memory could corrupt the data in memory before it’s written to disk, so that’s worth checking too.

    • Really appreciate your input on this – that’s some excellent proactive support!

      That all makes sense. As for hardware, we’re running the nodes on EC2, and this node was writing to instance-store volumes in dmraid 0 configuration as set up by the DSC AMI at the time.

      I’m not sure how commonly C* experiences corruption on virtualised platforms like EC2, but we’ve since moved the nodes to new r3.xlarge instances writing to gp2 SSD EBS volumes and haven’t seen any corruption issues since.

    • Sivaji Kota

      Just wondering if we can follow the similar process for one of the system table “sstable_activity” ?. We noticed it never got compacted in our 18 node cluster (v 2.1.8) although repair job is successful . Its data files on file systems are ever growing..

      We see lots of java related errors in system.log for the above table:

      java.lang.AssertionError & java.lang.IllegalStateException.null


      Sivaji Kota

  • Shyam Salim Kumar

    Nice article, but what should one do if you encounter a corrupted sstable while running a rolling restart? Based on we are not supposed to run a `nodetool repair` while in the partial upgrade state. I just threw out the corrupted sstables (was trying it out on a test cluster) for the various secondary indices (hoping they would be rebuilt once the cluster was stabilized). But I had to halt the upgrade once a cf sstable was corrupted. I was going from 1.2.6 -> 1.2.9 -> 2.0.0 -> 2.0.7 (halted) and I wanted to finally upgrade to 2.1.9.

  • Hema

    Very nice article. Could you please throw some light on how to handle corrupted commit log files?
    We see error as :”org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Could not read commit log descriptor in file /a/abc.log”