Wednesday, April 16, 2014

File System Reliability: NTFS, journaling, CHKDSK etc.

Reliability of file systems and disks can be a very confusing topic with a lot of nuances. I'd like to address at least some of these topics here and hopefully make them a little easier to understand.

Journaling File System

In the older days file systems were very vulnerable to abrupt system shutdowns, whether that's a power failure, hardware failure, or a software bug causing the system to crash. When such a failure happened if the file system was in the middle of updating its metadata things could get really ugly. You could end up with disappearing files/directories, orphaned files/directories (files that belong to no directory), files/directories that appear in more than one directory, and many more like this. You could lose even an entire file system or a big chunk of it.

Modern file systems like NTFS addressed this by introducing journaling, which made updates to file system metadata transactional. By implementing transactional logging, file systems could guarantee that metadata updates were atomic, meaning they would either happen in entirety or not happen at all. This made file systems very resilient compared to their predecessors.

Journaling vs. File Content Corruption

In a journaling file system, file system's METADATA is updated atomically. File system metadata and file content are different things. The list of files in a directory is part of metadata and updated via journaling. Content of your word file is user data and it doesn't have the same guarantees. So content of your word file can be left in an undefined state if the system crashes, even with journaling.

Having Transactional Guarantees for File Content

For that guarantee you either need to use a database system that supports transactions, or use the TxF API (transactional file system) on Windows to modify file contents within a transaction. TxF didn't gain any popularity in the developer community though. I guess people just used databases when they needed transactions. And Microsoft removed this support in its latest server file system ReFS.

CHKDSK (Check Disk)

Chkdsk is a tool in the Windows platform that verifies the health of a file system and make repairs if necessary. Chkdsk works with both FAT and NTFS file systems. The fact that NTFS is a journaling file system confuses a lot of people regarding chkdsk. After all, why do you need such a tool if file system protects itself from metadata inconsistencies via journaling right?

Well it's a little complicated than that. First of all chkdsk addresses different problems for FAT and NTFS. FAT can end up with inconsistent metadata whenever there was an abrupt shutdown. NTFS is resilient to that, but bad sectors complicate life here. So because of bad sectors even NTFS needs chkdsk to make things right when it's needed.

Software and hardware both work on preventing data loss caused by bad sectors as explained in here. However there can still be unrecoverable bad sector errors. If this happens to a sector that used to contain file system metadata, file system needs to be repaired by bringing it to a consistent state. This might still result in metadata/data loss though.

I have seen on the internet people mentioning CHKDSK is responsible for working with journal records to bring file system to a consistent state, and that's its main job on NTFS volumes. This is incorrect. Scanning journal records during volume mount is routine NTFS activity and has nothing to do with CHKDSK.

  

No comments:

Post a Comment