Monday, February 13, 2017

Notes on C++ Exception Handling and Stack Unwinding


Introduction

If you take a look at the LLVM documentation, you might be surprised to see that different processor architectures have their own exception handling ABI, and there's an entire section on how things work on Windows. Why would processors care about C++ exceptions, and why would things work differently on Windows - what is it to the OS what a C++ program does within the user space (opposed to the kernel space)?

There are interesting answers to these questions, but first let's take a closer look at the C++ exceptions.

Exceptions Demystified

C++ exceptions give the look and feel of magic. Based on throw and catch statements the control flow seems to jump over functions, doing proper cleanup and knowing when to stop. When you look under the hood it is no surprise that it feels like magic - especially the catching of an exception is really complicated. Breaking things down and making them more concrete helps us better understand things so let's try to break the life of an exception into more concrete albeit high level steps.

Everything starts by compiler's transformation of the keywords throw and catch. Throw is replaced with a pair of __cxa_allocate_exception and __cxa_throw calls, catch is replaced with a pair of cxa_begin_catch and __cxa_end_catch calls. These are functions in the C++ runtime. Now at least we know that those keywords are just some function calls under the hood. Now let's look at the high level flow of events when an exception is thrown during program execution:

  1. __cxa_allocate_exception allocates the exception to be thrown.
  2. __cxa_throw starts the chain of events involved in throwing and handling the exception.
    1. This call never returns. Either the exception is caught by some call frame on the stack and execution continues from that point, or no one catches it and std::terminate is called
  3. __Unwind_RaiseException (comes with C++ runtime, part of the Unwind library) walks back the function call frames on the stack one by one.
    1. To be able to walk back the stack, __Unwind_RaiseException uses the CFI (Call Frame Information) that's found in the .eh_frame section of the object/executable file (e.g ELF).
  4. As __Unwind_RaiseException walks back the stack, for each frame it calls something called the personality function (__gxx_personality_v0 for GCC) for that frame. The personality function is a language specific function which
    1. Checks whether the current frame will be able to catch the function.
    2. Does the proper cleanup as the current frame is unwound, like calling destructors for objects going out of scope.
  5. When the personality function finds a frame that can catch the exception, control is transferred to the corresponding landing pad, from where execution continues.
  6. If no call frame on the stack can catch the exception then std::terminate is called.

Unwinding the Stack - C++ ABI

In the previous section it might have caught your eye that we need some sort of metadata in the object file (e.g ELF) to be able to find the previous stack frames. If you have looked at how the x86 function call stack works this probably sounds weird to you, why can't I just follow the EBP register (which keeps the frame pointer) to find the previous stack frames?


Yes, on x86 processors you don't need any metadata to be able to walk back the stack because the frame pointers (EBP) are saved on the stack and practically for a linked list that you can traverse. However x86 is an old architecture at this point and newer architectures like x86-64, Itanium, and ARM don't necessarily save the frame pointer on the stack - they may not even have a dedicated register for the frame pointer. [1] Compiler generates code in a way that it can always refer to local variables with varying offsets to the stack pointer (ESP) [2].

This is where the C++ ABI and eh_frame/CFI (Call Frame Information) come in. C++ ABI for specific platforms (e.g [3]) specify what kind of metadata needs to be generated by the compiler to enable stack unwinding. This metadata is used by
  1. Exception handling by the runtime
  2. Stack back trace calls in the program itself
  3. Debuggers
It's not surprising this ABI is usually specified per processor architecture - it directly relates to processor registers. 

You can stop your compiler from generating this metadata by using the -fno-unwind-tables flag, in which case I'd imagine exceptions won't work and you can't see the stack trace in a debugger.

Stack unwinding forms the language independent part of the exception handling. There's also a language specific part, which we'll look at next.

Personality Function and Call Frame Cleanup

So the unwinding process is able to walk back the stack without knowing anything about the programming language or the compiler - thanks to the ABI - which is great. But someone needs to handle two things:
  1. Understanding whether current call frame (during the unwind process) can catch this exception
  2. Cleaning up the current frame, e.g calling the destructors for the proper objects on the stack
These are very language specific, in fact compiler specific actions. This is why compiler generates something called personality function, which is simply one function per compiler, __gxx_personality_v0 for GCC. This function is called by the unwinder [4] for each frame on the call stack and does the two things mentioned. To be able to do those, the personality function uses the LSDA (language specific data area), which is metadata generated for each function and placed at the end of the function's code. LSDA contains information to determine what exceptions the function can catch [5] and if it doesn't catch the exception how to clean up this frame (e.g destructors).

The cleanup metadata that is in the LSDA is called exception table, GCC calls it gcc_except_table. The -fno-exeptions compiler flag tells the compiler not to generate this metadata. An exception thrown by such a program would result in a call to std::terminate.

Nothrow

After reading the last section it should be clear why a C++ program terminates when an exception escapes a nothrow function. For such a function the compiler wouldn't generate the exception table which would make the cleanup of the call frames for that function impossible during unwinding. The runtime has no option but to terminate.

One Pass vs. Two Pass Exception Handling

The high level exception handling flow given in the second section is missing one aspect. Some runtimes decide not to do call frame cleanup (e.g destructors) until they make sure someone will handle the exception. They take a first pass at the stack unwinding, trying to locate a frame that will catch the exception. If none found, runtime directly calls std::terminate without bothering with frame cleanups. If a frame catches the exception then the runtime does another pass on the stack this time actually calling the personalization function for frame cleanup. Apparently GCC works this way.

Exceptions in Windows

This is interesting, if you spent your life in the UNIX universe you must be thinking what the hell does an OS to do with exceptions. You're right, UNIX systems don't have the notion of exceptions. They know about signals, and exceptions are an entirely application concept.

Windows is different - it has the concept of exceptions and it manages them closely. Architects of Windows made the interesting choice to unify hardware exceptions (e.g page fault) with software exceptions. The OS defines what an exception looks like, how the exception handling frames should be setup etc. and this is called Structured Exception Handling [6]. This has some cool implications, you can catch hardware exceptions like page fault or divide by zero in your C++ catch block. You just need to compile with the SEH flag enabled. This is not possible on UNIX systems, hardware exceptions are delivered as UNIX signals.

On Windows there's a system API that's language agnostic to throw exceptions - RaiseException. Thanks to the extensive OS support Microsoft's C compiler also supports exceptions, through the __throw and __catch keywords.

First Chance / Second Chance Exceptions

If you have debugged a Windows application that throws and catches exceptions you might have noticed something peculiar in the debugger. It'll print statements about  'first chance exception's while executing. This is basically an OS provided debug port feature, the OS provides debuggers insights into the exceptions that are happening in the program while the program is running and the debugger is not broken into. First chance exception means an exception has been thrown but stack unwind hasn't happened yet so the program code might still catch the exception. Second chance exception means the stack unwind has finished but no call frame has caught the exception so the program is about to crash [7].

As far as I know this is a pretty unique Windows feature and it is possible because the OS is involved in the exception throwing/handling process.


Notes

[1] Some ARM compilers seem to reserve a register as the frame pointer, but this is more of a configuration/implementation detail for that given compiler - unlike the universal use of the EBP register on x86. So runtimes and tools still need to rely on eh_frame and CFI to be able to work for all compilers on the same ABI.
[2] "In common with the IA-64 runtime architecture, the ARM Exception ABI specifies separate, per-function unwinding tables indexed by program counter."
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0038b/IHI0038B_ehabi.pdf
[3] http://infocenter.arm.com/help/topic/com.arm.doc.ihi0038b/IHI0038B_ehabi.pdf
[4] Unwinding runtime code knows where to find the personality function thanks to the CFI (Call Frame Information)
[5] Obviously this depends on where the program counter (PC) was when that frame made the call to another function
[6] Apparently this was later replaced by Vector Exception Handling but still the OS is a key player in exception throwing and handling.
[7] https://support.microsoft.com/en-us/help/105675/first-and-second-chance-exception-handling


References

Thursday, June 26, 2014

What Constitutes an OS Kernel?


Introduction

The question in the title might seem very trivial to most. But you'd be surprised if you knew the extent of kernel being a misunderstood concept. I have heard following sentences several times in my life and to a person familiar with OS kernels they are really absurd.

  • They changed the kernel in this version
  • They used the same kernel in this version
  • Until Windows 8 they have been using the same kernel since Windows 98
    • This is wrong in so many levels, such that I have to specifically talk about this later in this post.
I'll stick to Windows NT as an example here but mainstream OSes like Linux and Mac OS X are similar in this perspective.

Kernel Mode

Before diving deep into discussion we should define the 'kernel mode' concept. Processors have different privilege levels. At the most privileged level anything is possible. At lesser privilege levels certain actions are restricted. Although most modern processors support several (e.g 4) privilege levels (rings) mainstream operating systems like Windows and Linux only make user 2 of them for practical reasons. These two privilege levels are called user mode and kernel mode.

User mode is what all user programs run in. In this mode processor can't execute certain instructions, can't access certain registers etc. This is to keep user mode programs from damaging and destabilizing the entire system. Kernel mode is the opposite and everything is allowed.

Potential Definitions of the 'Kernel'

Most user mode programmers think of the kernel as an atomic indivisible piece of software that magically does everything user mode programs need, from file system management to hardware bus communication. Funny enough if you stick to the stricter definitions of a kernel these two jobs fall out of the kernel boundary.

First of all you should understand that there isn't even a single definition of the kernel. When people say kernel they might be referring to many significantly different things. Please take a moment to take a look at the architecture diagram below for Windows NT. As you can see the boxes below the 'kernel' line are numerous and complicated in their interactions.


Now let's start with potential definitions of the kernel from most narrow to most inclusive.

Core Kernel

'Core kernel' would refer to pieces of software that run in kernel mode, written by the OS vendor (e.g Microsoft, Apple) and contains only pieces that are minimal in the services they provide. These services are mainly:
  • Synchronization primitives
  • Thread dispatcher, or simply the dispatcher
  • Interrupt dispatcher/handler
This definition of the kernel somewhat incorrectly corresponds to the microkernel in the above diagram. kernel components that seem very core and low level as virtual memory manager are left out of this definition. This may seem really weird but from an architectural standpoint it is correct as this core part of the kernel doesn't depend on the rest of the kernel in what they do.

Core Kernel + Executive Components

You can view this definition of the kernel as the 'static kernel'. By static I'm referring to the fact that components that are part of this definition aren't dynamically loaded, they are always present and they can't be replaced. For example kernel wouldn't dynamically load a virtual memory manager or unload it. Same for I/O manager. These 'static' parts of the kernel are also written by the OS vendor company. A sample of these static kernel components are:
  • Virtual memory manager
  • I/O manager
  • Security subsystem
  • Cache manager
Note that crucial parts of the system such as file systems or TCP driver (yes it's a driver in Windows) aren't included in this definition as they are dynamic and can be loaded/unloaded.

Core Kernel + Executive + Drivers Shipped by the Vendor

This definition of the kernel extends the previous one by including the drivers that are developed by the vendor. I'm not saying that come off the box because some drivers that come off the box are still 3rd party software. These drivers are considered crucial parts of the OS yet they're architecturally dynamic and different than the rest of the kernel. I think a great example of these drivers is the NTFS driver. You can't imagine a Windows kernel without it yet technically it's an outer part of the kernel and it's dynamic by nature.

Anything That Runs in Kernel Mode Including 3rd Party Drivers

This is also a very legitimate definition of the kernel. Running in the kernel mode gives unlimited access to the processor and the hardware and presents similar challenges to both vendor shipped code and 3rd party code. These are challenges and concepts that simply don't exist in the user mode. A few examples are:
  • More complicated synchronization primitives (like fast mutexes and spinlocks. you might think you used spinlocks in user mode but they're substantially different)
  • Responding to interrupts
  • Dealing with different processor interrupt request levels
  • Being able to use non-paged memory opposed to all memory being paged. This is not just a matter of ability but is sometimes a requirement
Most importantly any crash in kernel mode drivers is as fatal as a crash in the core kernel.

Wrap-up

So going back to where we started, when you hear someone talking about kernel changing or not changing now it should be clear why that discussion doesn't make much sense. First of all, what do you mean by kernel?

With every OS release there are definitely some changes to what can be defined as kernel at any level. 

  • OS vendors ship new drivers all the time (like the reFS file system driver that shipped with Windows Server 2012)
  • They extend existing drivers all the time (e.q TCP driver adding support for IPv6)
  • There are new APIs added to executive level components such as I/O manager, object manager
  • Even core kernels change with every release, sometimes in big ways sometimes in small ways. For example Microsoft made the dispatcher lock more granular in Windows7 which resulted in a better multi processor support with less lock contention
My impression is when people see that their existing drivers don't work in a new release of the OS they think the kernel has changed. Most of the time this is simply because of a breaking change in a kernel mode API, most likely in the I/O manager as kernel drivers interact with it extensively. This tells very little about the over all changes made in the kernel mode components.

Appendix - Windows 98 to Windows 8

This one is really funny. Windows 98 and Windows 8 don't even belong to the same family of OSes. 

Until Windows XP Microsoft built two different families of OSes, Windows and Windows NT. Windows was a descendant of DOS and it was an inferior OS in many ways. However Windows NT was an OS designed by David Cutler later in Microsoft and it was a much modern and robust architecture compared to Windows. These two OSes coexisted until 2000. Following Windows 95 and Windows 98, the Windows line had its final release in Windows ME. Following the Windows NT 4.0, Windows 2000, Windows XP line Windows NT is still alive and the latest release has been Windows 8.1 in 2013.

So saying Windows had the same kernel since Windows 98 until Windows 8 doesn't make any sense.

Saturday, May 31, 2014

Volatile, Memory Barriers, and Load/Store Reordering

Introduction

Volatile is a memory access keyword in several languages which is much more complicated and confusing in practice then what it appears to be. There are pieces of information about volatile and related concepts everywhere but it's hard to build a complete understanding. Let's first understand the underlying concepts.

Some Memory Concepts

Load/Store

This is a more formal terminology for memory read/write. Load means brining something from main memory (ram) to processor registers, and store is the same thing in other direction.

Load/Store Omission by the Compiler

Every memory read/write in a program source code doesn't have to result in a processor instruction that does an actual load/store. Compilers can decide to cache values instead of actually loading them from memory when they think it doesn't change the results of the program. This can result in buggy programs in multi threaded programs and in programs with 'unusual memory'.

'Unusual Memory'

Unusual memory refers to a memory location that can change in value without compiler having anyway to know it. Hardware mapped memory is such a case. If a sensor hardware is updating a memory location on its own, compiler can't know about this and incorrectly omit stores from that memory location assuming the value will never change.

Load/Store Reordering by the Compiler

Compilers can reorder loads and stores in a program for optimization reasons, if the reordering won't change program results. This logic can breakdown in multithreaded programs and cause buggy behavior in the program.

Out of Order Execution and Cache Coherency

Similar to compilers reordering loads/stores, CPUs can do the same via 'out of order execution'. This means that a CPU might execute instructions in a slightly different order then they appear in computer memory which includes instructions getting reordered and interleaved execution of two or more instructions. Interleaved execution means instructions aren't strictly ordered in their execution i.e one instruction might begin its execution before the 'already in progress' instruction is 'finished'.

Also, when a CPU stores a value in a memory location, all CPUs may not see the same stores in the same order because of cache coherency issues i.e each processor having its own cache of the main memory.

 These issues create buggy behavior in multithreaded programs. To prevent this programmers can use memory barriers, which are special instructions telling the CPU not to do out of order execution across the barrier instruction. Memory barriers also ensure cache consistency which means we won't get hurt by cache coherency problems.

Atomic Load/Store

Loading OR storing a primitive value from main memory may or may not be atomic. Don't confuse this with 'read, modify, write' atomicity which can only be achieved either by proper locking or special 'interlocked' instructions. On certain hardware platforms loading an integer may not be atomic i.e some bytes of the integer may be modified while we're reading the integer which means we'll end up with a value of the integer that was never the value of it any point in time. Pointer aligned word sized (32bits) load/store operations on X86 architecture are atomic.

Volatile in C/C++ vs. Java and C#

Volatile has different meanings in different programming languages. Let's see these differences.

C/C++

Volatile provides:
  • When a volatile variable is loaded/stored, it will actually be loaded/stored and the load/store won't be cached or omitted. Code reading volatile variable will always see the latest value.
  • Loads/Stores of volatile variables won't be reordered i.e they'll appear in the same order they appear in the source code.
Volatile doesn't provide:
  • Atomic load/store of the volatile variable. For example if the volatile variable is bigger than word size or it's not pointer aligned on an X86 h/w load/store operations on the variable may not be atomic. Again this is different than read, modify, write atomicity which is an altogether different problem.
  • Guarantee of not reordering non volatile load/store operations around the volatile loads/stores. Compiler can place a non volatile store operation after a volatile store operation even if non volatile store happens before in the source code. This is actually called the 'happens before' relationship, and volatile keyword doesn't provide that in C/C++.
  • Memory barriers around volatile loads/stores. CPU can still reorder instructions around load/store of volatile variables and there can still be CPU cache inconsistencies.
Because of these issues volatile is only recommended for memory mapped I/O (i.e unusual memory) in C/C++. If our only concern is seeing the latest value set by a h/w device or some other process that we're sharing memory with, volatile does the job.

Java/C#

Volatile provides:
  • Guarantee that load/store operations won't be cached/omitted. Code reading volatile variable will always see the latest value.
  • Guarantee that load/store operations of volatile variables won't be reordered, including the non volatile loads/stores around them. This guarantee came to Java starting with Java 5 only. We'll see why this is important in the case study.
  • Atomic load/store of the volatile variable i.e loads/stores to a volatile variable won't be interleaved.
  • Memory barriers. CPU won't do out of order execution of volatile loads/stores (and other loads/stores around them) and it won't cause cache inconsistencies.
Volatile doesn't provide:
  • A solution to all your multi threading problems. There are countless scenarios where you'd need proper locking via mutexes etc. or interlocked API that operate on data atomically.

A Case Study

Let's look at a small piece of pseudocode which will provide us with the basis for discussion.

MyType my_global;
volatile bool f_my_global_ready = false;

void thread1()
{
  my_global = setup_my_object();
f_my_global_ready = true;
}

void thread2()
{
while (!f_my_global_ready);
use_my_global_object();
}

Code is fairly simple. One thread initializes a global object and sets the flag indicating that it's ready, the other busy waits (polls) until the flag is set and then uses the object prepared by the other thread.

I should mention that while loads/stores of f_my_global_ready are volatile load/store operations, initialization and consumption of my_global consists of non volatile loads/stores. This will be important in terms of reordering guarantees.

Now let's think about how using volatile keyword in different languages would result for the above piece of code.

Analysis of the Case Study

C/C++

The code specified in case study won't work as expected in C/C++ and there are several reasons.

If the compiler optimizes the read of f_my_global_ready in Thread2, we may never see the flag getting set and the while loop will keep going forever. To prevent compiler doing this optimization we can declare f_my_global_ready as volatile.

However, even if we make the flag volatile there are other issues making the code buggy. First there's the atomicity of the load. In this specific example we're loading a bool which is a single byte so I doubt this will be a problem on any h/w platform. But if we were working with a multi byte primitive such as 32 bit integer we could be vulnerable to race conditions, depending on the h/w platform and depending on how the integer is aligned in memory. Volatile doesn't fix these issues in C/C++.

Second, compiler won't optimize or reorder loads/stores to the volatile flag but it can reorder other 'non volatile' load and stores around it. This means that part of the instructions setting up my_global might get placed by the compiler after the setting of the flag. This will result in Thread2 breaking the while loop and using my_global before my_global was fully initialized. This would be a serious bug.

Third, similar to the compiler reordering, the CPU might do out of order execution which again will result in Thread2 accessing a not fully initialized my_global. Cache coherency issues can also result in the same problem as other CPUs might see loads/stores in a different order.

Java/C#

For Java (5 and later) and C# code in the case study will work just fine. Thanks to the guarantees given by volatile in these languages everything works as expected. Let's go over these one more time:

  • After the flag is set, Thread2 will see this new value the next time it polls the flag because it's declared as volatile and compiler won't cache the value for the flag.
  • Load/store of the variable is atomic (again, not read, modify, and write) so we're not concerned with reading an interleaved value accidentally.
  • Initialization of my_global will be done strictly before the f_my_global_ready flag is set because compiler doesn't reorder non volatile load/store operations across volatile load/store operations. The volatile load/store acts as a strict barrier for all other loads/stores.
  • CPU won't do out of order execution or have cache inconsistency for the volatile loads/stores thanks to the compiler inserted memory barrier before the volatile loads/stores.

Solution to the C/C++ Case

Volatile doesn't help with the code we've been working on in C/C++, so what other options do we have?

  • If we're using C++ 11, we can use the std::atomic<T> types which act in the same way as volatile in C#/Java.
  • We can use the Interlocked* API on Windows or similar API present on other platforms.
  • Use proper locking techniques like spinlocks, mutexes, semaphores etc.

Further Reading 




Friday, May 16, 2014

How Do File Recovery Tools Work?

Introduction

If you are a software person you must have wondered how file recovery tools are able to find and recover deleted files on disk. Sure, it's easy to guess that sectors used by the file are potentially untouched and in that case the tool recovers the file but the problem is each sector on disk is simply and array of bytes. How does the recovery tool know what file it used to belong to when it looks at a sector? Here's how.

The 'Master' File

Any serious file system has a 'master' file which keeps a list of all files that exist on a volume. This is called the master file table in NTFS and the catalog file in HFS+. This is a special file and file system knows how to locate this file in a predefined way when it mounts the volume. For example HFS+ keeps the location of the catalog file in volume header which itself is at a fixed location.

You can think of the master file as a giant table which has an entry for each file on the volume. These entries keep certain important information about the file, one of which is the disk sectors occupied by the file. In fact when you read from a file this is how the file system knows from which disk sectors to read.

Adding/Removing a File

It's pretty easy to see that master file can't be a full table where every entry is occupied by a current file record and when a new file is added we just add a new entry at the end. This is not impossible, but it's infeasible. Keeping such a compact table would be very costly and it'd kill performance.

When a file is deleted from the volume, the file system does as little as possible work on the master file to have high performance. Most of the times it simply marks the corresponding entry in the master file as 'deleted' so in the future it can know file pointed by that entry was removed.

When a new file is added it has to have a corresponding entry in the master file. To do this file system can reuse the entry for a previously deleted file, or it can allocate new space on disk for a brand new entry. Obviously file system has to do a lot of house keeping to be able to do these operations efficiently.

Finding Deleted Files

Conceptually finding the deleted files is pretty straightforward. You just need to walk over all entries in the master file and see which entries are marked as deleted. Obviously a problem is if an entry is reused after the file referred by it is deleted, it's not possible to recover that file anymore as we simply don't know about that file. So all file entries in the master file that are marked as deleted are potentially recoverable file.

Recovering the Data For a File

Now that we have a list of deleted files, the problem is how to find sectors that were making up the deleted files. We'll use the entry in the master file for this. A key point is when a file system marks the file entry as deleted, that's pretty much all it does. Sector information for that file will still be there. This is great but we're far from done.

Let's say a file occupied two sectors on disk and we know which sectors they are. Now the question is how do we know whether these sectors aren't being used by another file now. If that's the case we won't be able to recover the file. To determine this we need to use the 'sector map' which is a data structure on disk used by the file system to keep track of used/unused sectors. If both sectors of the file are showing up as unused then we may be able to recover the file.

Complications Around Recovery

We still can't be certain because if we're unlucky those sectors might have been used by some other file at some point, and then got freed. In this case we'd think the sectors still contain data from the deleted file we're looking for but in fact they would contain data from another deleted file. There isn't a bullet proof way to know whether the data we recovered is truly intact. One way to extract further information would be to see if deleted files have any colliding sectors. If there are N collisions for a specific sector, at least N - 1 of those files have lost their data in that sector. There are still no guarantees around this because master file entries of some files using that sector might have been overwritten by entries of newer files, which means we have no idea those files ever existed.

Also I should note that we may only recover certain sectors of a file but not the others. If some sectors of the deleted file are in use by other files, it's certain that we lost at least part of the file's data. If all are unused currently we have some hope to recover the file in its entirety but we can't be sure for the reasons I explained above. If the file has some internal verification information (like MD5 hash of the content) we can determine if the data was really recovered entirely or not.


Wednesday, April 30, 2014

Paged Memory vs. Non Paged Memory

Introduction

Most computer science students learn that memory is paged in and out of main memory to hard disk depending on need, during their freshman year. They learn that the OS does it for them and it is the only way. This is a mostly accurate view of the computing universe but there are exceptions. Since most software engineers will only write user mode programs (opposed to kernel mode programs) in their entire professional life this is accurate information. But when you think about kernel mode code things get more complicated.

A Short Introduction to Kernel Mode

Typical computer software is a user mode program. It's a binary file that lives on the hard disk, loaded to memory by the OS, given a virtual memory space of its own and bunch of crucial computing resources like CPU time. Word processors, web browsers, web servers, software development tools are all examples of user mode programs.

Then there is the kernel space, which is actually much more heterogeneous than the average software engineer imagines. Core OS components, file systems, device drivers, functional drivers (drivers that don't manage physical hardware) all run in the kernel space. Code running in kernel space is more privileged in terms of what it can do on the processor compared to user mode code. One of these several differences is working with non-paged memory.

Why Does Non-Paged Memory Even Exist - An Intuition

Imagine the memory manager component of the OS. It is the component that is responsible for paging in and out memory pages. When a memory page is paged out it is written to part of something called the 'page file' which is a file on disk obviously. The memory manager needs to maintain some metadata to be able to locate page data in this file right? What if this metadata itself is paged out and memory manager has to page in that data to be able to do page ins? You must see where I'm going, it's a contradictory situation.

So even theoretically it's not possible to make all memory pageable. Some memory pages need to be resident at all times.

Reasons for Using Non-Paged Memory

Performance

If a driver or a core OS component is doing something performance sensitive, obviously using pageable memory to read/write data can be problematic. Waiting for a memory page to be brought in by the memory manager can take forever which would kill a performance sensitive driver/OS component.

Avoiding Never Ending Cycles

Imagine a disk driver that is trying to read a sector from the disk. Obviously it needs some memory to write the data read from device (actual architecture is more complicated with DMA etc. but skipping that for simplicity). What if that memory page is paged out? We need to go back to disk to bring that page in. In worst case scenario this could go on forever. If the disk driver writes the data back to a non-paged memory page than that page will be guaranteed to be resident.

Code Running At High Interrupt Masking Levels

At times processor will be running in a high interrupt masking level. This happens either when an interrupt handling routine is called by the processor as a result of an interrupt or when a piece of kernel code explicitly raises the interrupt level of the processor. Whatever the reason might be if a code is running in high interrupt level and it accesses a pageable memory page, that's simply a bug and sooner or later it will cause a crash of the entire OS. The reason this is so disastrous is the fact that memory manager itself takes action as a result of bunch of interrupts and if the current interrupt masking level of the processor is high enough, those interrupts will be masked meaning that they won't be served until the code running at the high interrupt level is done. But that code won't be done until it can bring the memory page it needs back into memory. Well you see the problem. That piece of code is accessing an inaccessible piece of memory and that's a fatal issue.

So any code that's running at a high enough interrupt level (DPC_LEVEL on Windows) needs to access only non-paged memory. Any other behavior is a fatal bug that'll surely surface.

One Exception to the User Mode Case

I previously said that user mode programs don't get to work with non-paged memory but that's not entirely true, they sometimes do without knowing it. Drivers running in kernel mode can 'pin' some memory pages in a process' address space which makes the memory page non-paged for a temporary duration (until driver unpins the page). In these cases user mode programs would be accessing non-paged memory for a while.

Why Even Bother With Paged Memory?

It sounds like using paged memory is nothing but trouble. If all memory was non-paged none of the complications mentioned above would be there. So why are we insisting on having pageable memory?

Well obviously if all memory was non-paged than we could never go over the physical available memory of a computer which is a very big constraint. If we had mostly non-paged memory and some paged memory then code working with pageable memory would incur a huge performance penalty because the rate of page faults would go through the roof.

So operating systems try to limit the amount of non-paged memory to a minimum to avoid these problems. Non-paged memory is a scarce resource and kernel mode code should use it very wisely.

Tuesday, April 29, 2014

Storing Image Data in Transactional Databases

Introduction

You might have noticed that a lot of websites serve their images as links to regular files, instead of storing them as a field in database rows. Generally it's not recommended to store image files directly in the database (e.g Mysql) but people don't say why. I'll talk about two reasons for not doing so.

Serving Media Through Content Distribution Networks

For websites with users all around the world and high traffic it makes a lot of sense to store media content at content distribution networks (CDN). Media content is generally immutable once it's created and they consume a lot of bandwidth to download. By having the content stored as plain files on a CDN the website makes use of CDN's bandwidth and global points of presence (POP). Having several POPs around the globe also makes roundtrip time to reach the data much smaller which results in more prompt downloads.

Keeping media files separate form the database fits better to working with CDN strategy. You can actually still keep the primary copy of the data in your database and put the corresponding file to a CDN but to me that's a weird design. Also this brings us to our next point.

Cost of Having Transaction Properties

Almost all production quality database engines provide the so called ACID transaction properties. A(atomicity) and D(durability) of transactions require a design that employs 'transaction logging'. There can be numerous ways to implement transaction logging but two popular ways are 'write ahead logging' and 'undo/redo logging'. I won't go into details of these techniques but in a nutshell they involve writing the database mutations to transaction logs AND to the actual database image itself. Depending on the logging technique used this means image data will be written to disk 2 or 3 times at least. A single image can occupy space as big as thousands of regular database rows would occupy (imagine tables that have just a few integer and string columns in their definitions) so on a busy database server incurring this extra write(s) is a non trivial performance penalty. For most large scale websites this is unacceptable.

Conclusion

I've picked image files as a case study because it's a fairly common case but this applies to all BLOB (binary large object) types. Purpose of this article isn't to say 'thou shall not store BLOBs in databases'. I'm simply saying know the implications if you want to do so.

Thursday, April 17, 2014

When and How Files Get Deleted by a File System

Introduction

We had an interesting discussion with a colleague today. We were discussing what would happen if you rename a file to an existing file if there are already open file descriptors (FD, or handle if you're coming from Windows world) against the existing file. Here's the pseudo code for the scenario:
  1. Open an FD to file a/b/c
  2. Read 5 bytes from the FD
  3. Rename an existing file a/b/d to a/b/c
  4. Using the already open FD read another 5 bytes.

So what would be the content of the 5 bytes we read in step 4? Would it be content of the original a/b/c file or would it be the content of a/b/d which was later renamed as a/b/c? I thought we would read the content from the file that was originally a/b/d. I was dead wrong. My colleague wrote an experiment code and we were clearly reading data from old file. This didn't make any sense to me. I was really bothered by two questions:
  • How can we be seeing two different file content for the same file at the same time?
  • Since we can still read from original file, sectors occupied by that file are still accessible. But who's accounting for those sectors now?

Understanding Rename System Call

If you look at the man page for rename system call you'll see that if the destination file name already exists, it'll be unlinked first then the source file will be renamed. So this clears out the first question. Obviously we're not seeing two different data for the same file. Original file is deleted and some other file takes its place.

But still it's weird that we're reading from a deleted file and god knows what happened to those poor disk sectors occupied by that file.

Global File List vs. Directories

Most modern file systems keep a global list of files per volume. This is called master file table in NTFS and catalog file in HFS+. A file on a volume is mainly represented by its presence in this table.

When we think of directories we're generally mistaken by the idea that directories 'own' files. Quite the contrary. Directories simply 'link' to files that are accounted for in the master file table. One file in the master file table can be linked by one or more directories. So in this sense Unix terminology of link/unlink is extremely good fitting.

What Happens When a File is Deleted?

When a file is to be deleted file system marks the corresponding entry in the master file table as deleted. It also marks the sectors previously occupied by the file as free, which means they can be allocated by other files from that point on.

When is a File Deleted?

If you look at the man page of unlink system call it very clearly lists conditions for a file to be deleted:
  • No directory has a link against the file
  • No process has an open file descriptor against the file
Now this explains the behavior we saw in the experiment described in introduction. We were unlinking the file from the directory but the open FD against is was keeping it from deletion. So the data we read was coming from a perfectly alive file. And sectors of the file are accounted by the file definition in the master file table. Life is good again. When the final FD is closed file will be really deleted this time.

Final Missing Piece

Then yet another thing started bothering me. Sure it's nice that file stays around until closing of the last FD. But what happens if the system crashes while the FD is still open. When the system reboots we'll end up with a zombie file in the master file table and there's no way to reach it anymore because it doesn't exist in the namespace. 

My gut feeling was when the last link to file is removed, this information is persisted somewhere so even if the system crashes until the file is really deleted, it will know about this situation the next time volume is mounted. Sadly I couldn't find concrete information regarding this for NTFS or HFS+, but one EXT3 related paper clearly mentions it:

"In order to handle these deferred delete/truncate requests in a crash-safe manner, the inodes to be unlinked/truncated are added into the ext3 orphan list. This is an already existing mechanism by which ext3 handles file unlink/truncates that might be interrupted by a crash. A persistent singly-linked list of inode numbers is linked from the superblock and, if this list is not empty at filesystem mount time, the ext3 code will first walk the list and delete/truncate all of the files on it before the mount is completed."

I'm sure other file systems employ a similar mechanism to keep track of such files.

Credits

At this point I should thank my colleague, Slobodan Predolac, whom I had the discussion with and who wrote the experiment code and pointed me to related man pages.

Appendix - Experiment Source Code

#include <sys/types.h> #include <sys/uio.h> #include <unistd.h>
#import <Foundation/Foundation.h>
int main(int argc, const char * argv[]) {
@autoreleasepool { NSString *tmpfname = [NSTemporaryDirectory() stringByAppendingPathComponent:@"MyFile"]; NSString *str = @"1234567890ABCDEFG"; [str writeToFile:tmpfname atomically:YES encoding:NSUTF8StringEncoding error:nil]; int fd = open([tmpfname UTF8String], O_RDONLY); if (-1 == fd) { perror("Oh sh"); exit(1); } char buf[64] = {0}; if (-1 == read(fd, buf, 5)) { perror("Oh sh"); exit(1); } else { NSLog(@"First 5 = '%s'", buf); } NSString *str1 = @"qwertyqwertyqwerty";
BOOL succ = [str1 writeToFile:tmpfname atomically:YES encoding:NSUTF8StringEncoding error:nil]; if (!succ) { NSLog(@"Write failed"); exit(1); } else { int fd1 = open([tmpfname UTF8String], O_RDONLY); } char buf1[64] = {0}; if (-1 == read(fd, buf1, 5)) { perror("Oh sh"); exit(1); } else { NSLog(@"Next 5 = '%s'", buf1); }
close(fd); } return 0; }