Thursday, June 26, 2014

What Constitutes an OS Kernel?


The question in the title might seem very trivial to most. But you'd be surprised if you knew the extent of kernel being a misunderstood concept. I have heard following sentences several times in my life and to a person familiar with OS kernels they are really absurd.

  • They changed the kernel in this version
  • They used the same kernel in this version
  • Until Windows 8 they have been using the same kernel since Windows 98
    • This is wrong in so many levels, such that I have to specifically talk about this later in this post.
I'll stick to Windows NT as an example here but mainstream OSes like Linux and Mac OS X are similar in this perspective.

Kernel Mode

Before diving deep into discussion we should define the 'kernel mode' concept. Processors have different privilege levels. At the most privileged level anything is possible. At lesser privilege levels certain actions are restricted. Although most modern processors support several (e.g 4) privilege levels (rings) mainstream operating systems like Windows and Linux only make user 2 of them for practical reasons. These two privilege levels are called user mode and kernel mode.

User mode is what all user programs run in. In this mode processor can't execute certain instructions, can't access certain registers etc. This is to keep user mode programs from damaging and destabilizing the entire system. Kernel mode is the opposite and everything is allowed.

Potential Definitions of the 'Kernel'

Most user mode programmers think of the kernel as an atomic indivisible piece of software that magically does everything user mode programs need, from file system management to hardware bus communication. Funny enough if you stick to the stricter definitions of a kernel these two jobs fall out of the kernel boundary.

First of all you should understand that there isn't even a single definition of the kernel. When people say kernel they might be referring to many significantly different things. Please take a moment to take a look at the architecture diagram below for Windows NT. As you can see the boxes below the 'kernel' line are numerous and complicated in their interactions.

Now let's start with potential definitions of the kernel from most narrow to most inclusive.

Core Kernel

'Core kernel' would refer to pieces of software that run in kernel mode, written by the OS vendor (e.g Microsoft, Apple) and contains only pieces that are minimal in the services they provide. These services are mainly:
  • Synchronization primitives
  • Thread dispatcher, or simply the dispatcher
  • Interrupt dispatcher/handler
This definition of the kernel somewhat incorrectly corresponds to the microkernel in the above diagram. kernel components that seem very core and low level as virtual memory manager are left out of this definition. This may seem really weird but from an architectural standpoint it is correct as this core part of the kernel doesn't depend on the rest of the kernel in what they do.

Core Kernel + Executive Components

You can view this definition of the kernel as the 'static kernel'. By static I'm referring to the fact that components that are part of this definition aren't dynamically loaded, they are always present and they can't be replaced. For example kernel wouldn't dynamically load a virtual memory manager or unload it. Same for I/O manager. These 'static' parts of the kernel are also written by the OS vendor company. A sample of these static kernel components are:
  • Virtual memory manager
  • I/O manager
  • Security subsystem
  • Cache manager
Note that crucial parts of the system such as file systems or TCP driver (yes it's a driver in Windows) aren't included in this definition as they are dynamic and can be loaded/unloaded.

Core Kernel + Executive + Drivers Shipped by the Vendor

This definition of the kernel extends the previous one by including the drivers that are developed by the vendor. I'm not saying that come off the box because some drivers that come off the box are still 3rd party software. These drivers are considered crucial parts of the OS yet they're architecturally dynamic and different than the rest of the kernel. I think a great example of these drivers is the NTFS driver. You can't imagine a Windows kernel without it yet technically it's an outer part of the kernel and it's dynamic by nature.

Anything That Runs in Kernel Mode Including 3rd Party Drivers

This is also a very legitimate definition of the kernel. Running in the kernel mode gives unlimited access to the processor and the hardware and presents similar challenges to both vendor shipped code and 3rd party code. These are challenges and concepts that simply don't exist in the user mode. A few examples are:
  • More complicated synchronization primitives (like fast mutexes and spinlocks. you might think you used spinlocks in user mode but they're substantially different)
  • Responding to interrupts
  • Dealing with different processor interrupt request levels
  • Being able to use non-paged memory opposed to all memory being paged. This is not just a matter of ability but is sometimes a requirement
Most importantly any crash in kernel mode drivers is as fatal as a crash in the core kernel.


So going back to where we started, when you hear someone talking about kernel changing or not changing now it should be clear why that discussion doesn't make much sense. First of all, what do you mean by kernel?

With every OS release there are definitely some changes to what can be defined as kernel at any level. 

  • OS vendors ship new drivers all the time (like the reFS file system driver that shipped with Windows Server 2012)
  • They extend existing drivers all the time (e.q TCP driver adding support for IPv6)
  • There are new APIs added to executive level components such as I/O manager, object manager
  • Even core kernels change with every release, sometimes in big ways sometimes in small ways. For example Microsoft made the dispatcher lock more granular in Windows7 which resulted in a better multi processor support with less lock contention
My impression is when people see that their existing drivers don't work in a new release of the OS they think the kernel has changed. Most of the time this is simply because of a breaking change in a kernel mode API, most likely in the I/O manager as kernel drivers interact with it extensively. This tells very little about the over all changes made in the kernel mode components.

Appendix - Windows 98 to Windows 8

This one is really funny. Windows 98 and Windows 8 don't even belong to the same family of OSes. 

Until Windows XP Microsoft built two different families of OSes, Windows and Windows NT. Windows was a descendant of DOS and it was an inferior OS in many ways. However Windows NT was an OS designed by David Cutler later in Microsoft and it was a much modern and robust architecture compared to Windows. These two OSes coexisted until 2000. Following Windows 95 and Windows 98, the Windows line had its final release in Windows ME. Following the Windows NT 4.0, Windows 2000, Windows XP line Windows NT is still alive and the latest release has been Windows 8.1 in 2013.

So saying Windows had the same kernel since Windows 98 until Windows 8 doesn't make any sense.

Saturday, May 31, 2014

Volatile, Memory Barriers, and Load/Store Reordering


Volatile is a memory access keyword in several languages which is much more complicated and confusing in practice then what it appears to be. There are pieces of information about volatile and related concepts everywhere but it's hard to build a complete understanding. Let's first understand the underlying concepts.

Some Memory Concepts


This is a more formal terminology for memory read/write. Load means brining something from main memory (ram) to processor registers, and store is the same thing in other direction.

Load/Store Omission by the Compiler

Every memory read/write in a program source code doesn't have to result in a processor instruction that does an actual load/store. Compilers can decide to cache values instead of actually loading them from memory when they think it doesn't change the results of the program. This can result in buggy programs in multi threaded programs and in programs with 'unusual memory'.

'Unusual Memory'

Unusual memory refers to a memory location that can change in value without compiler having anyway to know it. Hardware mapped memory is such a case. If a sensor hardware is updating a memory location on its own, compiler can't know about this and incorrectly omit stores from that memory location assuming the value will never change.

Load/Store Reordering by the Compiler

Compilers can reorder loads and stores in a program for optimization reasons, if the reordering won't change program results. This logic can breakdown in multithreaded programs and cause buggy behavior in the program.

Out of Order Execution and Cache Coherency

Similar to compilers reordering loads/stores, CPUs can do the same via 'out of order execution'. This means that a CPU might execute instructions in a slightly different order then they appear in computer memory which includes instructions getting reordered and interleaved execution of two or more instructions. Interleaved execution means instructions aren't strictly ordered in their execution i.e one instruction might begin its execution before the 'already in progress' instruction is 'finished'.

Also, when a CPU stores a value in a memory location, all CPUs may not see the same stores in the same order because of cache coherency issues i.e each processor having its own cache of the main memory.

 These issues create buggy behavior in multithreaded programs. To prevent this programmers can use memory barriers, which are special instructions telling the CPU not to do out of order execution across the barrier instruction. Memory barriers also ensure cache consistency which means we won't get hurt by cache coherency problems.

Atomic Load/Store

Loading OR storing a primitive value from main memory may or may not be atomic. Don't confuse this with 'read, modify, write' atomicity which can only be achieved either by proper locking or special 'interlocked' instructions. On certain hardware platforms loading an integer may not be atomic i.e some bytes of the integer may be modified while we're reading the integer which means we'll end up with a value of the integer that was never the value of it any point in time. Pointer aligned word sized (32bits) load/store operations on X86 architecture are atomic.

Volatile in C/C++ vs. Java and C#

Volatile has different meanings in different programming languages. Let's see these differences.


Volatile provides:
  • When a volatile variable is loaded/stored, it will actually be loaded/stored and the load/store won't be cached or omitted. Code reading volatile variable will always see the latest value.
  • Loads/Stores of volatile variables won't be reordered i.e they'll appear in the same order they appear in the source code.
Volatile doesn't provide:
  • Atomic load/store of the volatile variable. For example if the volatile variable is bigger than word size or it's not pointer aligned on an X86 h/w load/store operations on the variable may not be atomic. Again this is different than read, modify, write atomicity which is an altogether different problem.
  • Guarantee of not reordering non volatile load/store operations around the volatile loads/stores. Compiler can place a non volatile store operation after a volatile store operation even if non volatile store happens before in the source code. This is actually called the 'happens before' relationship, and volatile keyword doesn't provide that in C/C++.
  • Memory barriers around volatile loads/stores. CPU can still reorder instructions around load/store of volatile variables and there can still be CPU cache inconsistencies.
Because of these issues volatile is only recommended for memory mapped I/O (i.e unusual memory) in C/C++. If our only concern is seeing the latest value set by a h/w device or some other process that we're sharing memory with, volatile does the job.


Volatile provides:
  • Guarantee that load/store operations won't be cached/omitted. Code reading volatile variable will always see the latest value.
  • Guarantee that load/store operations of volatile variables won't be reordered, including the non volatile loads/stores around them. This guarantee came to Java starting with Java 5 only. We'll see why this is important in the case study.
  • Atomic load/store of the volatile variable i.e loads/stores to a volatile variable won't be interleaved.
  • Memory barriers. CPU won't do out of order execution of volatile loads/stores (and other loads/stores around them) and it won't cause cache inconsistencies.
Volatile doesn't provide:
  • A solution to all your multi threading problems. There are countless scenarios where you'd need proper locking via mutexes etc. or interlocked API that operate on data atomically.

A Case Study

Let's look at a small piece of pseudocode which will provide us with the basis for discussion.

MyType my_global;
volatile bool f_my_global_ready = false;

void thread1()
  my_global = setup_my_object();
f_my_global_ready = true;

void thread2()
while (!f_my_global_ready);

Code is fairly simple. One thread initializes a global object and sets the flag indicating that it's ready, the other busy waits (polls) until the flag is set and then uses the object prepared by the other thread.

I should mention that while loads/stores of f_my_global_ready are volatile load/store operations, initialization and consumption of my_global consists of non volatile loads/stores. This will be important in terms of reordering guarantees.

Now let's think about how using volatile keyword in different languages would result for the above piece of code.

Analysis of the Case Study


The code specified in case study won't work as expected in C/C++ and there are several reasons.

If the compiler optimizes the read of f_my_global_ready in Thread2, we may never see the flag getting set and the while loop will keep going forever. To prevent compiler doing this optimization we can declare f_my_global_ready as volatile.

However, even if we make the flag volatile there are other issues making the code buggy. First there's the atomicity of the load. In this specific example we're loading a bool which is a single byte so I doubt this will be a problem on any h/w platform. But if we were working with a multi byte primitive such as 32 bit integer we could be vulnerable to race conditions, depending on the h/w platform and depending on how the integer is aligned in memory. Volatile doesn't fix these issues in C/C++.

Second, compiler won't optimize or reorder loads/stores to the volatile flag but it can reorder other 'non volatile' load and stores around it. This means that part of the instructions setting up my_global might get placed by the compiler after the setting of the flag. This will result in Thread2 breaking the while loop and using my_global before my_global was fully initialized. This would be a serious bug.

Third, similar to the compiler reordering, the CPU might do out of order execution which again will result in Thread2 accessing a not fully initialized my_global. Cache coherency issues can also result in the same problem as other CPUs might see loads/stores in a different order.


For Java (5 and later) and C# code in the case study will work just fine. Thanks to the guarantees given by volatile in these languages everything works as expected. Let's go over these one more time:

  • After the flag is set, Thread2 will see this new value the next time it polls the flag because it's declared as volatile and compiler won't cache the value for the flag.
  • Load/store of the variable is atomic (again, not read, modify, and write) so we're not concerned with reading an interleaved value accidentally.
  • Initialization of my_global will be done strictly before the f_my_global_ready flag is set because compiler doesn't reorder non volatile load/store operations across volatile load/store operations. The volatile load/store acts as a strict barrier for all other loads/stores.
  • CPU won't do out of order execution or have cache inconsistency for the volatile loads/stores thanks to the compiler inserted memory barrier before the volatile loads/stores.

Solution to the C/C++ Case

Volatile doesn't help with the code we've been working on in C/C++, so what other options do we have?

  • If we're using C++ 11, we can use the std::atomic<T> types which act in the same way as volatile in C#/Java.
  • We can use the Interlocked* API on Windows or similar API present on other platforms.
  • Use proper locking techniques like spinlocks, mutexes, semaphores etc.

Further Reading 

Friday, May 16, 2014

How Do File Recovery Tools Work?


If you are a software person you must have wondered how file recovery tools are able to find and recover deleted files on disk. Sure, it's easy to guess that sectors used by the file are potentially untouched and in that case the tool recovers the file but the problem is each sector on disk is simply and array of bytes. How does the recovery tool know what file it used to belong to when it looks at a sector? Here's how.

The 'Master' File

Any serious file system has a 'master' file which keeps a list of all files that exist on a volume. This is called the master file table in NTFS and the catalog file in HFS+. This is a special file and file system knows how to locate this file in a predefined way when it mounts the volume. For example HFS+ keeps the location of the catalog file in volume header which itself is at a fixed location.

You can think of the master file as a giant table which has an entry for each file on the volume. These entries keep certain important information about the file, one of which is the disk sectors occupied by the file. In fact when you read from a file this is how the file system knows from which disk sectors to read.

Adding/Removing a File

It's pretty easy to see that master file can't be a full table where every entry is occupied by a current file record and when a new file is added we just add a new entry at the end. This is not impossible, but it's infeasible. Keeping such a compact table would be very costly and it'd kill performance.

When a file is deleted from the volume, the file system does as little as possible work on the master file to have high performance. Most of the times it simply marks the corresponding entry in the master file as 'deleted' so in the future it can know file pointed by that entry was removed.

When a new file is added it has to have a corresponding entry in the master file. To do this file system can reuse the entry for a previously deleted file, or it can allocate new space on disk for a brand new entry. Obviously file system has to do a lot of house keeping to be able to do these operations efficiently.

Finding Deleted Files

Conceptually finding the deleted files is pretty straightforward. You just need to walk over all entries in the master file and see which entries are marked as deleted. Obviously a problem is if an entry is reused after the file referred by it is deleted, it's not possible to recover that file anymore as we simply don't know about that file. So all file entries in the master file that are marked as deleted are potentially recoverable file.

Recovering the Data For a File

Now that we have a list of deleted files, the problem is how to find sectors that were making up the deleted files. We'll use the entry in the master file for this. A key point is when a file system marks the file entry as deleted, that's pretty much all it does. Sector information for that file will still be there. This is great but we're far from done.

Let's say a file occupied two sectors on disk and we know which sectors they are. Now the question is how do we know whether these sectors aren't being used by another file now. If that's the case we won't be able to recover the file. To determine this we need to use the 'sector map' which is a data structure on disk used by the file system to keep track of used/unused sectors. If both sectors of the file are showing up as unused then we may be able to recover the file.

Complications Around Recovery

We still can't be certain because if we're unlucky those sectors might have been used by some other file at some point, and then got freed. In this case we'd think the sectors still contain data from the deleted file we're looking for but in fact they would contain data from another deleted file. There isn't a bullet proof way to know whether the data we recovered is truly intact. One way to extract further information would be to see if deleted files have any colliding sectors. If there are N collisions for a specific sector, at least N - 1 of those files have lost their data in that sector. There are still no guarantees around this because master file entries of some files using that sector might have been overwritten by entries of newer files, which means we have no idea those files ever existed.

Also I should note that we may only recover certain sectors of a file but not the others. If some sectors of the deleted file are in use by other files, it's certain that we lost at least part of the file's data. If all are unused currently we have some hope to recover the file in its entirety but we can't be sure for the reasons I explained above. If the file has some internal verification information (like MD5 hash of the content) we can determine if the data was really recovered entirely or not.

Wednesday, April 30, 2014

Paged Memory vs. Non Paged Memory


Most computer science students learn that memory is paged in and out of main memory to hard disk depending on need, during their freshman year. They learn that the OS does it for them and it is the only way. This is a mostly accurate view of the computing universe but there are exceptions. Since most software engineers will only write user mode programs (opposed to kernel mode programs) in their entire professional life this is accurate information. But when you think about kernel mode code things get more complicated.

A Short Introduction to Kernel Mode

Typical computer software is a user mode program. It's a binary file that lives on the hard disk, loaded to memory by the OS, given a virtual memory space of its own and bunch of crucial computing resources like CPU time. Word processors, web browsers, web servers, software development tools are all examples of user mode programs.

Then there is the kernel space, which is actually much more heterogeneous than the average software engineer imagines. Core OS components, file systems, device drivers, functional drivers (drivers that don't manage physical hardware) all run in the kernel space. Code running in kernel space is more privileged in terms of what it can do on the processor compared to user mode code. One of these several differences is working with non-paged memory.

Why Does Non-Paged Memory Even Exist - An Intuition

Imagine the memory manager component of the OS. It is the component that is responsible for paging in and out memory pages. When a memory page is paged out it is written to part of something called the 'page file' which is a file on disk obviously. The memory manager needs to maintain some metadata to be able to locate page data in this file right? What if this metadata itself is paged out and memory manager has to page in that data to be able to do page ins? You must see where I'm going, it's a contradictory situation.

So even theoretically it's not possible to make all memory pageable. Some memory pages need to be resident at all times.

Reasons for Using Non-Paged Memory


If a driver or a core OS component is doing something performance sensitive, obviously using pageable memory to read/write data can be problematic. Waiting for a memory page to be brought in by the memory manager can take forever which would kill a performance sensitive driver/OS component.

Avoiding Never Ending Cycles

Imagine a disk driver that is trying to read a sector from the disk. Obviously it needs some memory to write the data read from device (actual architecture is more complicated with DMA etc. but skipping that for simplicity). What if that memory page is paged out? We need to go back to disk to bring that page in. In worst case scenario this could go on forever. If the disk driver writes the data back to a non-paged memory page than that page will be guaranteed to be resident.

Code Running At High Interrupt Masking Levels

At times processor will be running in a high interrupt masking level. This happens either when an interrupt handling routine is called by the processor as a result of an interrupt or when a piece of kernel code explicitly raises the interrupt level of the processor. Whatever the reason might be if a code is running in high interrupt level and it accesses a pageable memory page, that's simply a bug and sooner or later it will cause a crash of the entire OS. The reason this is so disastrous is the fact that memory manager itself takes action as a result of bunch of interrupts and if the current interrupt masking level of the processor is high enough, those interrupts will be masked meaning that they won't be served until the code running at the high interrupt level is done. But that code won't be done until it can bring the memory page it needs back into memory. Well you see the problem. That piece of code is accessing an inaccessible piece of memory and that's a fatal issue.

So any code that's running at a high enough interrupt level (DPC_LEVEL on Windows) needs to access only non-paged memory. Any other behavior is a fatal bug that'll surely surface.

One Exception to the User Mode Case

I previously said that user mode programs don't get to work with non-paged memory but that's not entirely true, they sometimes do without knowing it. Drivers running in kernel mode can 'pin' some memory pages in a process' address space which makes the memory page non-paged for a temporary duration (until driver unpins the page). In these cases user mode programs would be accessing non-paged memory for a while.

Why Even Bother With Paged Memory?

It sounds like using paged memory is nothing but trouble. If all memory was non-paged none of the complications mentioned above would be there. So why are we insisting on having pageable memory?

Well obviously if all memory was non-paged than we could never go over the physical available memory of a computer which is a very big constraint. If we had mostly non-paged memory and some paged memory then code working with pageable memory would incur a huge performance penalty because the rate of page faults would go through the roof.

So operating systems try to limit the amount of non-paged memory to a minimum to avoid these problems. Non-paged memory is a scarce resource and kernel mode code should use it very wisely.

Tuesday, April 29, 2014

Storing Image Data in Transactional Databases


You might have noticed that a lot of websites serve their images as links to regular files, instead of storing them as a field in database rows. Generally it's not recommended to store image files directly in the database (e.g Mysql) but people don't say why. I'll talk about two reasons for not doing so.

Serving Media Through Content Distribution Networks

For websites with users all around the world and high traffic it makes a lot of sense to store media content at content distribution networks (CDN). Media content is generally immutable once it's created and they consume a lot of bandwidth to download. By having the content stored as plain files on a CDN the website makes use of CDN's bandwidth and global points of presence (POP). Having several POPs around the globe also makes roundtrip time to reach the data much smaller which results in more prompt downloads.

Keeping media files separate form the database fits better to working with CDN strategy. You can actually still keep the primary copy of the data in your database and put the corresponding file to a CDN but to me that's a weird design. Also this brings us to our next point.

Cost of Having Transaction Properties

Almost all production quality database engines provide the so called ACID transaction properties. A(atomicity) and D(durability) of transactions require a design that employs 'transaction logging'. There can be numerous ways to implement transaction logging but two popular ways are 'write ahead logging' and 'undo/redo logging'. I won't go into details of these techniques but in a nutshell they involve writing the database mutations to transaction logs AND to the actual database image itself. Depending on the logging technique used this means image data will be written to disk 2 or 3 times at least. A single image can occupy space as big as thousands of regular database rows would occupy (imagine tables that have just a few integer and string columns in their definitions) so on a busy database server incurring this extra write(s) is a non trivial performance penalty. For most large scale websites this is unacceptable.


I've picked image files as a case study because it's a fairly common case but this applies to all BLOB (binary large object) types. Purpose of this article isn't to say 'thou shall not store BLOBs in databases'. I'm simply saying know the implications if you want to do so.

Thursday, April 17, 2014

When and How Files Get Deleted by a File System


We had an interesting discussion with a colleague today. We were discussing what would happen if you rename a file to an existing file if there are already open file descriptors (FD, or handle if you're coming from Windows world) against the existing file. Here's the pseudo code for the scenario:
  1. Open an FD to file a/b/c
  2. Read 5 bytes from the FD
  3. Rename an existing file a/b/d to a/b/c
  4. Using the already open FD read another 5 bytes.

So what would be the content of the 5 bytes we read in step 4? Would it be content of the original a/b/c file or would it be the content of a/b/d which was later renamed as a/b/c? I thought we would read the content from the file that was originally a/b/d. I was dead wrong. My colleague wrote an experiment code and we were clearly reading data from old file. This didn't make any sense to me. I was really bothered by two questions:
  • How can we be seeing two different file content for the same file at the same time?
  • Since we can still read from original file, sectors occupied by that file are still accessible. But who's accounting for those sectors now?

Understanding Rename System Call

If you look at the man page for rename system call you'll see that if the destination file name already exists, it'll be unlinked first then the source file will be renamed. So this clears out the first question. Obviously we're not seeing two different data for the same file. Original file is deleted and some other file takes its place.

But still it's weird that we're reading from a deleted file and god knows what happened to those poor disk sectors occupied by that file.

Global File List vs. Directories

Most modern file systems keep a global list of files per volume. This is called master file table in NTFS and catalog file in HFS+. A file on a volume is mainly represented by its presence in this table.

When we think of directories we're generally mistaken by the idea that directories 'own' files. Quite the contrary. Directories simply 'link' to files that are accounted for in the master file table. One file in the master file table can be linked by one or more directories. So in this sense Unix terminology of link/unlink is extremely good fitting.

What Happens When a File is Deleted?

When a file is to be deleted file system marks the corresponding entry in the master file table as deleted. It also marks the sectors previously occupied by the file as free, which means they can be allocated by other files from that point on.

When is a File Deleted?

If you look at the man page of unlink system call it very clearly lists conditions for a file to be deleted:
  • No directory has a link against the file
  • No process has an open file descriptor against the file
Now this explains the behavior we saw in the experiment described in introduction. We were unlinking the file from the directory but the open FD against is was keeping it from deletion. So the data we read was coming from a perfectly alive file. And sectors of the file are accounted by the file definition in the master file table. Life is good again. When the final FD is closed file will be really deleted this time.

Final Missing Piece

Then yet another thing started bothering me. Sure it's nice that file stays around until closing of the last FD. But what happens if the system crashes while the FD is still open. When the system reboots we'll end up with a zombie file in the master file table and there's no way to reach it anymore because it doesn't exist in the namespace. 

My gut feeling was when the last link to file is removed, this information is persisted somewhere so even if the system crashes until the file is really deleted, it will know about this situation the next time volume is mounted. Sadly I couldn't find concrete information regarding this for NTFS or HFS+, but one EXT3 related paper clearly mentions it:

"In order to handle these deferred delete/truncate requests in a crash-safe manner, the inodes to be unlinked/truncated are added into the ext3 orphan list. This is an already existing mechanism by which ext3 handles file unlink/truncates that might be interrupted by a crash. A persistent singly-linked list of inode numbers is linked from the superblock and, if this list is not empty at filesystem mount time, the ext3 code will first walk the list and delete/truncate all of the files on it before the mount is completed."

I'm sure other file systems employ a similar mechanism to keep track of such files.


At this point I should thank my colleague, Slobodan Predolac, whom I had the discussion with and who wrote the experiment code and pointed me to related man pages.

Appendix - Experiment Source Code

#include <sys/types.h> #include <sys/uio.h> #include <unistd.h>
#import <Foundation/Foundation.h>
int main(int argc, const char * argv[]) {
@autoreleasepool { NSString *tmpfname = [NSTemporaryDirectory() stringByAppendingPathComponent:@"MyFile"]; NSString *str = @"1234567890ABCDEFG"; [str writeToFile:tmpfname atomically:YES encoding:NSUTF8StringEncoding error:nil]; int fd = open([tmpfname UTF8String], O_RDONLY); if (-1 == fd) { perror("Oh sh"); exit(1); } char buf[64] = {0}; if (-1 == read(fd, buf, 5)) { perror("Oh sh"); exit(1); } else { NSLog(@"First 5 = '%s'", buf); } NSString *str1 = @"qwertyqwertyqwerty";
BOOL succ = [str1 writeToFile:tmpfname atomically:YES encoding:NSUTF8StringEncoding error:nil]; if (!succ) { NSLog(@"Write failed"); exit(1); } else { int fd1 = open([tmpfname UTF8String], O_RDONLY); } char buf1[64] = {0}; if (-1 == read(fd, buf1, 5)) { perror("Oh sh"); exit(1); } else { NSLog(@"Next 5 = '%s'", buf1); }
close(fd); } return 0; }

Wednesday, April 16, 2014

File System Reliability: NTFS, journaling, CHKDSK etc.

Reliability of file systems and disks can be a very confusing topic with a lot of nuances. I'd like to address at least some of these topics here and hopefully make them a little easier to understand.

Journaling File System

In the older days file systems were very vulnerable to abrupt system shutdowns, whether that's a power failure, hardware failure, or a software bug causing the system to crash. When such a failure happened if the file system was in the middle of updating its metadata things could get really ugly. You could end up with disappearing files/directories, orphaned files/directories (files that belong to no directory), files/directories that appear in more than one directory, and many more like this. You could lose even an entire file system or a big chunk of it.

Modern file systems like NTFS addressed this by introducing journaling, which made updates to file system metadata transactional. By implementing transactional logging, file systems could guarantee that metadata updates were atomic, meaning they would either happen in entirety or not happen at all. This made file systems very resilient compared to their predecessors.

Journaling vs. File Content Corruption

In a journaling file system, file system's METADATA is updated atomically. File system metadata and file content are different things. The list of files in a directory is part of metadata and updated via journaling. Content of your word file is user data and it doesn't have the same guarantees. So content of your word file can be left in an undefined state if the system crashes, even with journaling.

Having Transactional Guarantees for File Content

For that guarantee you either need to use a database system that supports transactions, or use the TxF API (transactional file system) on Windows to modify file contents within a transaction. TxF didn't gain any popularity in the developer community though. I guess people just used databases when they needed transactions. And Microsoft removed this support in its latest server file system ReFS.

CHKDSK (Check Disk)

Chkdsk is a tool in the Windows platform that verifies the health of a file system and make repairs if necessary. Chkdsk works with both FAT and NTFS file systems. The fact that NTFS is a journaling file system confuses a lot of people regarding chkdsk. After all, why do you need such a tool if file system protects itself from metadata inconsistencies via journaling right?

Well it's a little complicated than that. First of all chkdsk addresses different problems for FAT and NTFS. FAT can end up with inconsistent metadata whenever there was an abrupt shutdown. NTFS is resilient to that, but bad sectors complicate life here. So because of bad sectors even NTFS needs chkdsk to make things right when it's needed.

Software and hardware both work on preventing data loss caused by bad sectors as explained in here. However there can still be unrecoverable bad sector errors. If this happens to a sector that used to contain file system metadata, file system needs to be repaired by bringing it to a consistent state. This might still result in metadata/data loss though.

I have seen on the internet people mentioning CHKDSK is responsible for working with journal records to bring file system to a consistent state, and that's its main job on NTFS volumes. This is incorrect. Scanning journal records during volume mount is routine NTFS activity and has nothing to do with CHKDSK.


Bad Sector Problems and How They're Addressed

'Bad sector' is a frequently heard term by tech people but most people don't understand it deeply. Handling of bad sectors involve cooperation of storage hardware, volume drivers, and file system drivers. Hence understanding how such problems are handled and mitigated is pretty complicated. Let's address the subtleties of the topic one by one.

Sector Remapping by the Disk Hardware Itself

Most modern disks have some error correction code (ECC) per each sector. So if the disk h/w notices that data read from a sector doesn't add up to its checksum, it corrects the sector data using error correction code for that sector. Disk remaps this sector to a spare sector as soon as possible to prevent permanent data loss. Even disk drives that have this functionality can run out of spare sectors and may not be able to do remapping anymore.

Another case is even if the sector has gone bad beyond error correction, sector still can be remapped the next time it's written to. This is because the old data was going to be overwritten by the new write and old data in the sector becomes irrelevant.

Sector Remapping by the Volume Driver Software

SCSI disks provide an interface to explicitly map a bad sector to a spare sector. Imagine a sector has gone bad beyond the point of error correction (or maybe disk sectors didn't have ECC in the first place). In this case disk hardware can't remap the sector without data loss. But if there's redundancy in place (like a mirrored RAID system) volume driver can get the redundant sector data, remap the sector to a spare one on the disk drive, and re-write the remapped sector with the redundant data. This way sector is saved without involving the file system driver at all.

If the sector data can be saved because there's redundancy but the bad sector can't be remapped because there aren't any remaining spare sectors, then there's still a solution but it requires cooperation from the file system.

Sector Remapping by the File System

First I should say that file systems don't really do sector remapping. All they do is marking a sector as bad so they won't use it in the future. Depending on whether the current operation is a read or write and whether sector data had some redundancy to it, flow of events will be different.

  • Reading the sector
    • Sector has gone bad but volume driver has recovered the data via redundancy (e.g mirrored RAID)
      • In this case file system marks the sector as unusable, finds a free sector to replace it and writes the sector data to this new sector. Note that this is not a physical remapping, file system is just ignoring this bad sector from now on. It's like a black hole in the sector bitmap.
    • Sector has gone bad and data is lost.
      • Any reads from the related file's related part will return an I/O error to the caller.
      • If the failed sector was part of file system's metadata, then a repair action on the file system is necessary but still file system will lose some metadata in all likelihood.
  • Writing to the sector
    • File system will mark the bad sector as unusable in its sector map, and use a free sector in its place and write that sector with the new write data.

A Case Study

Recently a colleague of mine had checksum errors with his git repository. As the issue was being discussed some people suggested his hard drive is probably failing and he's having bad sectors. As you'd see in the above notes when there's a bad sector file systems don't simply return garbage data from the sector. If the sector data can be saved by something in the I/O stack (FS driver, volume driver, disk driver, disk H/W etc.) it will be. Otherwise caller of the file read operation will get an error code saying data couldn't be read. Of course this may not be the case for all file systems in the world but any production quality file system has to behave this way.

So any checksum problems like my colleague had would most likely be the result of application level bugs rather than file system returning gibberish data for bad sectors.  


Most information here is based on NTFS and Windows' out of box volume driver. But it should still be mostly applicable to most modern production quality software.

Saturday, April 12, 2014

Directory Scalability of File Systems


Recently I've been hearing from several people that having too many files in a directory is bad for performance and you need to shard the files in multiple directories instead. If you're familiar with file system internals you'll notice this claim doesn't sound right at all. 

Most mainstream production quality file systems keep a B+ tree (or a variant) index per directory. B+ trees are very scalable data structures that self balance when full rate goes below a certain threshold for each node. And since each node is a large consecutive block containing hundreds of entries they work with sector based persistent storage very well.

Similarly most database servers make use of B+ trees for their index keeping and they do handle billions of records per each table without trouble.

When you combine these it should be intuitive to expect from file systems handle numerous files in a directory without too much trouble.

Edit: Turns out HFS+ doesn't have per directory B+ Tree index after all. It has one global index which is called the catalog file and directory traversal happens through this index. This means sharding files to several directories for performance reasons is pretty much irrelevant.


This information should convince most software engineers but part out of curiosity part to convince the remaining people I did a small experiment to see how a file system directory scales with respect to number of files.

Experiment Environment

For a while now I've been working with Apple development tools so the experiment setup is consisting of Apple technologies.

Operating System
MAC OS X 10.9.2
Macbook Pro Retina, 3 GHz Intel Core i7, 8 GB 1600 MHz DDR3, APPLE SSD SM256E Media
File System
Programming Language
Objective C

Experiment Setup

In the experiment I setup different directories and populated them with 1K, 10K, 100K, 1M very small files (4 bytes each) respectively. I measured time to populate the directories, time to enumerate all files in the directories, and time to lookup 1K files in each directories.

I'm sure someone will be troubled by the small file sizes. I kept the file sizes as small as possible because file size is irrelevant to directory metadata. File data and directory metadata are kept totally separate by the file system. By keeping the file size small I kept the file data write overhead to a minimum and most of the time taken to populate the directories would be spent on updating directory metadata rather than updating the files themselves.

Experiment Results

All measurements are in seconds and in wall clock time. For enumeration and lookup timings I rebooted the computer to make sure file system caches get purged. Indeed without cache purge I've observed that both lookups and enumerations are much faster.

1K Files
10K Files
100K Files
1M Files
Time to create the files
Time to lookup 1K Files
Time to enumerate all files in the directory


As you can see from the results HFS+ file system handled up to 1 million files extremely well. Total file creation and enumeration times increased roughly linear to the number of files. Looking up 1 thousand files took pretty much the same time for all directories.

Total time to enumerate 1 million files seems to take a slight divergence from other directories but this is probably about memory locality. My absurdly simple experiment code reads all file names in a single array and an array with  1 million elements is big enough to put a strain on that.

I did this experiment on HFS+ but I'm fairly sure all mainstream file systems like NTFS, EXT3, EXT4 etc. would produce similar results.

Appendix A

Why Do People Think File Systems Can't Handle Too Many Files in a Directory

I feel like this is a residue in collective software engineers' memory. In the early 90s file systems used to be much primitive compared to today. Most didn't even have journaling so you could lose file system metadata when system shutdown abruptly. But starting from mid 90s most major operating system platforms started having robust file systems that also scaled much better compared to their predecessors.

Appendix B

Why Does 'ls -l' Take Much Longer Than Plain 'ls'

ls -l returns bunch of file attributes among the file names where ls just returns the file names. File attributes are part of file data so they don't really concern the directory and it's not kept with directory metadata. So in effect doing ls -l requires looking up the files one by one and reading their attributes. This is why it's wildly slower then plain ls.

Appendix C

Should I Rule Out Directory Sharding Forever

For 99.999% of software engineers, yes, at least for performance reasons. If you're building a file server that'll host millions of files from thousands of very active users, then a big maybe. Here's why: file systems keep in memory objects representing the entities on disk like files and directories which is called FCB (file control block) in Windows world. FCB is an object that's created per file on disk not for each open(handle) of that file. So the FCB for the very large directory can become a hot point under heavy load. Per FCB resources like locks will be under contention and can potentially hurt performance.

I'm personally skeptic even with this because per FCB resources will matter for file open time mostly. And this is only part of the I/O performance picture where other parts are read, write, flush, sysctl(ioctl) etc. times.

Appendix D

Experiment Code

Sorry for the crudity of the code. For the different directory setups I manually modified parts of the code so you can't really use it out of box.

#import <Foundation/Foundation.h>

#include <QuartzCore/QuartzCore.h>

NSString *path = @"your path here";

void lookup1KFiles(void)
  int found = 0;
  double start = CACurrentMediaTime();
  for (int i = 0; i < 1000; i++) {
    if ([[NSFileManager defaultManager] fileExistsAtPath:[NSString stringWithFormat:@"%@/%d.txt", path, i]
                                             isDirectory:NO]) {
  double end = CACurrentMediaTime();
  NSLog(@"lookup time %lf", end - start);
  NSLog( @"Lookup found %d", found);

void populate(void)
  NSData *data = [NSData dataWithBytes:"abcd" length:4];
  double start = CACurrentMediaTime();
  //NSString *uuid = nil;
  for (int i=0; i < 1000; i++) {
    if (i % 10000 == 0) {
      //uuid = [[NSUUID UUID] UUIDString];
      NSLog(@"%d", i);
    [data writeToFile:[NSString stringWithFormat:@"%@/%d.txt", path, i]
  double end = CACurrentMediaTime();
  NSLog(@"populate time %lf", end - start);

void enumerate(void)
  NSDirectoryEnumerationOptions options = NSDirectoryEnumerationSkipsSubdirectoryDescendants |
  NSDirectoryEnumerationSkipsPackageDescendants |
  NSURL *url = [NSURL fileURLWithPath:path];
  double start = CACurrentMediaTime();
  __unused NSArray *array = [[NSFileManager defaultManager] contentsOfDirectoryAtPath:path error:nil];
  [[NSFileManager defaultManager] contentsOfDirectoryAtURL:url
                                includingPropertiesForKeys:[[NSArray alloc] initWithObjects:NSURLIsRegularFileKey, nil]
  double end = CACurrentMediaTime();
  NSLog(@"enumerate time %lf", end - start);
  NSLog(@"count %d", (int)[array count]);