Wednesday, April 30, 2014

Paged Memory vs. Non Paged Memory

Introduction

Most computer science students learn that memory is paged in and out of main memory to hard disk depending on need, during their freshman year. They learn that the OS does it for them and it is the only way. This is a mostly accurate view of the computing universe but there are exceptions. Since most software engineers will only write user mode programs (opposed to kernel mode programs) in their entire professional life this is accurate information. But when you think about kernel mode code things get more complicated.

A Short Introduction to Kernel Mode

Typical computer software is a user mode program. It's a binary file that lives on the hard disk, loaded to memory by the OS, given a virtual memory space of its own and bunch of crucial computing resources like CPU time. Word processors, web browsers, web servers, software development tools are all examples of user mode programs.

Then there is the kernel space, which is actually much more heterogeneous than the average software engineer imagines. Core OS components, file systems, device drivers, functional drivers (drivers that don't manage physical hardware) all run in the kernel space. Code running in kernel space is more privileged in terms of what it can do on the processor compared to user mode code. One of these several differences is working with non-paged memory.

Why Does Non-Paged Memory Even Exist - An Intuition

Imagine the memory manager component of the OS. It is the component that is responsible for paging in and out memory pages. When a memory page is paged out it is written to part of something called the 'page file' which is a file on disk obviously. The memory manager needs to maintain some metadata to be able to locate page data in this file right? What if this metadata itself is paged out and memory manager has to page in that data to be able to do page ins? You must see where I'm going, it's a contradictory situation.

So even theoretically it's not possible to make all memory pageable. Some memory pages need to be resident at all times.

Reasons for Using Non-Paged Memory

Performance

If a driver or a core OS component is doing something performance sensitive, obviously using pageable memory to read/write data can be problematic. Waiting for a memory page to be brought in by the memory manager can take forever which would kill a performance sensitive driver/OS component.

Avoiding Never Ending Cycles

Imagine a disk driver that is trying to read a sector from the disk. Obviously it needs some memory to write the data read from device (actual architecture is more complicated with DMA etc. but skipping that for simplicity). What if that memory page is paged out? We need to go back to disk to bring that page in. In worst case scenario this could go on forever. If the disk driver writes the data back to a non-paged memory page than that page will be guaranteed to be resident.

Code Running At High Interrupt Masking Levels

At times processor will be running in a high interrupt masking level. This happens either when an interrupt handling routine is called by the processor as a result of an interrupt or when a piece of kernel code explicitly raises the interrupt level of the processor. Whatever the reason might be if a code is running in high interrupt level and it accesses a pageable memory page, that's simply a bug and sooner or later it will cause a crash of the entire OS. The reason this is so disastrous is the fact that memory manager itself takes action as a result of bunch of interrupts and if the current interrupt masking level of the processor is high enough, those interrupts will be masked meaning that they won't be served until the code running at the high interrupt level is done. But that code won't be done until it can bring the memory page it needs back into memory. Well you see the problem. That piece of code is accessing an inaccessible piece of memory and that's a fatal issue.

So any code that's running at a high enough interrupt level (DPC_LEVEL on Windows) needs to access only non-paged memory. Any other behavior is a fatal bug that'll surely surface.

One Exception to the User Mode Case

I previously said that user mode programs don't get to work with non-paged memory but that's not entirely true, they sometimes do without knowing it. Drivers running in kernel mode can 'pin' some memory pages in a process' address space which makes the memory page non-paged for a temporary duration (until driver unpins the page). In these cases user mode programs would be accessing non-paged memory for a while.

Why Even Bother With Paged Memory?

It sounds like using paged memory is nothing but trouble. If all memory was non-paged none of the complications mentioned above would be there. So why are we insisting on having pageable memory?

Well obviously if all memory was non-paged than we could never go over the physical available memory of a computer which is a very big constraint. If we had mostly non-paged memory and some paged memory then code working with pageable memory would incur a huge performance penalty because the rate of page faults would go through the roof.

So operating systems try to limit the amount of non-paged memory to a minimum to avoid these problems. Non-paged memory is a scarce resource and kernel mode code should use it very wisely.

Tuesday, April 29, 2014

Storing Image Data in Transactional Databases

Introduction

You might have noticed that a lot of websites serve their images as links to regular files, instead of storing them as a field in database rows. Generally it's not recommended to store image files directly in the database (e.g Mysql) but people don't say why. I'll talk about two reasons for not doing so.

Serving Media Through Content Distribution Networks

For websites with users all around the world and high traffic it makes a lot of sense to store media content at content distribution networks (CDN). Media content is generally immutable once it's created and they consume a lot of bandwidth to download. By having the content stored as plain files on a CDN the website makes use of CDN's bandwidth and global points of presence (POP). Having several POPs around the globe also makes roundtrip time to reach the data much smaller which results in more prompt downloads.

Keeping media files separate form the database fits better to working with CDN strategy. You can actually still keep the primary copy of the data in your database and put the corresponding file to a CDN but to me that's a weird design. Also this brings us to our next point.

Cost of Having Transaction Properties

Almost all production quality database engines provide the so called ACID transaction properties. A(atomicity) and D(durability) of transactions require a design that employs 'transaction logging'. There can be numerous ways to implement transaction logging but two popular ways are 'write ahead logging' and 'undo/redo logging'. I won't go into details of these techniques but in a nutshell they involve writing the database mutations to transaction logs AND to the actual database image itself. Depending on the logging technique used this means image data will be written to disk 2 or 3 times at least. A single image can occupy space as big as thousands of regular database rows would occupy (imagine tables that have just a few integer and string columns in their definitions) so on a busy database server incurring this extra write(s) is a non trivial performance penalty. For most large scale websites this is unacceptable.

Conclusion

I've picked image files as a case study because it's a fairly common case but this applies to all BLOB (binary large object) types. Purpose of this article isn't to say 'thou shall not store BLOBs in databases'. I'm simply saying know the implications if you want to do so.

Thursday, April 17, 2014

When and How Files Get Deleted by a File System

Introduction

We had an interesting discussion with a colleague today. We were discussing what would happen if you rename a file to an existing file if there are already open file descriptors (FD, or handle if you're coming from Windows world) against the existing file. Here's the pseudo code for the scenario:
  1. Open an FD to file a/b/c
  2. Read 5 bytes from the FD
  3. Rename an existing file a/b/d to a/b/c
  4. Using the already open FD read another 5 bytes.

So what would be the content of the 5 bytes we read in step 4? Would it be content of the original a/b/c file or would it be the content of a/b/d which was later renamed as a/b/c? I thought we would read the content from the file that was originally a/b/d. I was dead wrong. My colleague wrote an experiment code and we were clearly reading data from old file. This didn't make any sense to me. I was really bothered by two questions:
  • How can we be seeing two different file content for the same file at the same time?
  • Since we can still read from original file, sectors occupied by that file are still accessible. But who's accounting for those sectors now?

Understanding Rename System Call

If you look at the man page for rename system call you'll see that if the destination file name already exists, it'll be unlinked first then the source file will be renamed. So this clears out the first question. Obviously we're not seeing two different data for the same file. Original file is deleted and some other file takes its place.

But still it's weird that we're reading from a deleted file and god knows what happened to those poor disk sectors occupied by that file.

Global File List vs. Directories

Most modern file systems keep a global list of files per volume. This is called master file table in NTFS and catalog file in HFS+. A file on a volume is mainly represented by its presence in this table.

When we think of directories we're generally mistaken by the idea that directories 'own' files. Quite the contrary. Directories simply 'link' to files that are accounted for in the master file table. One file in the master file table can be linked by one or more directories. So in this sense Unix terminology of link/unlink is extremely good fitting.

What Happens When a File is Deleted?

When a file is to be deleted file system marks the corresponding entry in the master file table as deleted. It also marks the sectors previously occupied by the file as free, which means they can be allocated by other files from that point on.

When is a File Deleted?

If you look at the man page of unlink system call it very clearly lists conditions for a file to be deleted:
  • No directory has a link against the file
  • No process has an open file descriptor against the file
Now this explains the behavior we saw in the experiment described in introduction. We were unlinking the file from the directory but the open FD against is was keeping it from deletion. So the data we read was coming from a perfectly alive file. And sectors of the file are accounted by the file definition in the master file table. Life is good again. When the final FD is closed file will be really deleted this time.

Final Missing Piece

Then yet another thing started bothering me. Sure it's nice that file stays around until closing of the last FD. But what happens if the system crashes while the FD is still open. When the system reboots we'll end up with a zombie file in the master file table and there's no way to reach it anymore because it doesn't exist in the namespace. 

My gut feeling was when the last link to file is removed, this information is persisted somewhere so even if the system crashes until the file is really deleted, it will know about this situation the next time volume is mounted. Sadly I couldn't find concrete information regarding this for NTFS or HFS+, but one EXT3 related paper clearly mentions it:

"In order to handle these deferred delete/truncate requests in a crash-safe manner, the inodes to be unlinked/truncated are added into the ext3 orphan list. This is an already existing mechanism by which ext3 handles file unlink/truncates that might be interrupted by a crash. A persistent singly-linked list of inode numbers is linked from the superblock and, if this list is not empty at filesystem mount time, the ext3 code will first walk the list and delete/truncate all of the files on it before the mount is completed."

I'm sure other file systems employ a similar mechanism to keep track of such files.

Credits

At this point I should thank my colleague, Slobodan Predolac, whom I had the discussion with and who wrote the experiment code and pointed me to related man pages.

Appendix - Experiment Source Code

#include <sys/types.h> #include <sys/uio.h> #include <unistd.h>
#import <Foundation/Foundation.h>
int main(int argc, const char * argv[]) {
@autoreleasepool { NSString *tmpfname = [NSTemporaryDirectory() stringByAppendingPathComponent:@"MyFile"]; NSString *str = @"1234567890ABCDEFG"; [str writeToFile:tmpfname atomically:YES encoding:NSUTF8StringEncoding error:nil]; int fd = open([tmpfname UTF8String], O_RDONLY); if (-1 == fd) { perror("Oh sh"); exit(1); } char buf[64] = {0}; if (-1 == read(fd, buf, 5)) { perror("Oh sh"); exit(1); } else { NSLog(@"First 5 = '%s'", buf); } NSString *str1 = @"qwertyqwertyqwerty";
BOOL succ = [str1 writeToFile:tmpfname atomically:YES encoding:NSUTF8StringEncoding error:nil]; if (!succ) { NSLog(@"Write failed"); exit(1); } else { int fd1 = open([tmpfname UTF8String], O_RDONLY); } char buf1[64] = {0}; if (-1 == read(fd, buf1, 5)) { perror("Oh sh"); exit(1); } else { NSLog(@"Next 5 = '%s'", buf1); }
close(fd); } return 0; }

Wednesday, April 16, 2014

File System Reliability: NTFS, journaling, CHKDSK etc.

Reliability of file systems and disks can be a very confusing topic with a lot of nuances. I'd like to address at least some of these topics here and hopefully make them a little easier to understand.

Journaling File System

In the older days file systems were very vulnerable to abrupt system shutdowns, whether that's a power failure, hardware failure, or a software bug causing the system to crash. When such a failure happened if the file system was in the middle of updating its metadata things could get really ugly. You could end up with disappearing files/directories, orphaned files/directories (files that belong to no directory), files/directories that appear in more than one directory, and many more like this. You could lose even an entire file system or a big chunk of it.

Modern file systems like NTFS addressed this by introducing journaling, which made updates to file system metadata transactional. By implementing transactional logging, file systems could guarantee that metadata updates were atomic, meaning they would either happen in entirety or not happen at all. This made file systems very resilient compared to their predecessors.

Journaling vs. File Content Corruption

In a journaling file system, file system's METADATA is updated atomically. File system metadata and file content are different things. The list of files in a directory is part of metadata and updated via journaling. Content of your word file is user data and it doesn't have the same guarantees. So content of your word file can be left in an undefined state if the system crashes, even with journaling.

Having Transactional Guarantees for File Content

For that guarantee you either need to use a database system that supports transactions, or use the TxF API (transactional file system) on Windows to modify file contents within a transaction. TxF didn't gain any popularity in the developer community though. I guess people just used databases when they needed transactions. And Microsoft removed this support in its latest server file system ReFS.

CHKDSK (Check Disk)

Chkdsk is a tool in the Windows platform that verifies the health of a file system and make repairs if necessary. Chkdsk works with both FAT and NTFS file systems. The fact that NTFS is a journaling file system confuses a lot of people regarding chkdsk. After all, why do you need such a tool if file system protects itself from metadata inconsistencies via journaling right?

Well it's a little complicated than that. First of all chkdsk addresses different problems for FAT and NTFS. FAT can end up with inconsistent metadata whenever there was an abrupt shutdown. NTFS is resilient to that, but bad sectors complicate life here. So because of bad sectors even NTFS needs chkdsk to make things right when it's needed.

Software and hardware both work on preventing data loss caused by bad sectors as explained in here. However there can still be unrecoverable bad sector errors. If this happens to a sector that used to contain file system metadata, file system needs to be repaired by bringing it to a consistent state. This might still result in metadata/data loss though.

I have seen on the internet people mentioning CHKDSK is responsible for working with journal records to bring file system to a consistent state, and that's its main job on NTFS volumes. This is incorrect. Scanning journal records during volume mount is routine NTFS activity and has nothing to do with CHKDSK.

  

Bad Sector Problems and How They're Addressed

'Bad sector' is a frequently heard term by tech people but most people don't understand it deeply. Handling of bad sectors involve cooperation of storage hardware, volume drivers, and file system drivers. Hence understanding how such problems are handled and mitigated is pretty complicated. Let's address the subtleties of the topic one by one.

Sector Remapping by the Disk Hardware Itself

Most modern disks have some error correction code (ECC) per each sector. So if the disk h/w notices that data read from a sector doesn't add up to its checksum, it corrects the sector data using error correction code for that sector. Disk remaps this sector to a spare sector as soon as possible to prevent permanent data loss. Even disk drives that have this functionality can run out of spare sectors and may not be able to do remapping anymore.

Another case is even if the sector has gone bad beyond error correction, sector still can be remapped the next time it's written to. This is because the old data was going to be overwritten by the new write and old data in the sector becomes irrelevant.

Sector Remapping by the Volume Driver Software

SCSI disks provide an interface to explicitly map a bad sector to a spare sector. Imagine a sector has gone bad beyond the point of error correction (or maybe disk sectors didn't have ECC in the first place). In this case disk hardware can't remap the sector without data loss. But if there's redundancy in place (like a mirrored RAID system) volume driver can get the redundant sector data, remap the sector to a spare one on the disk drive, and re-write the remapped sector with the redundant data. This way sector is saved without involving the file system driver at all.

If the sector data can be saved because there's redundancy but the bad sector can't be remapped because there aren't any remaining spare sectors, then there's still a solution but it requires cooperation from the file system.

Sector Remapping by the File System

First I should say that file systems don't really do sector remapping. All they do is marking a sector as bad so they won't use it in the future. Depending on whether the current operation is a read or write and whether sector data had some redundancy to it, flow of events will be different.

  • Reading the sector
    • Sector has gone bad but volume driver has recovered the data via redundancy (e.g mirrored RAID)
      • In this case file system marks the sector as unusable, finds a free sector to replace it and writes the sector data to this new sector. Note that this is not a physical remapping, file system is just ignoring this bad sector from now on. It's like a black hole in the sector bitmap.
    • Sector has gone bad and data is lost.
      • Any reads from the related file's related part will return an I/O error to the caller.
      • If the failed sector was part of file system's metadata, then a repair action on the file system is necessary but still file system will lose some metadata in all likelihood.
  • Writing to the sector
    • File system will mark the bad sector as unusable in its sector map, and use a free sector in its place and write that sector with the new write data.

A Case Study

Recently a colleague of mine had checksum errors with his git repository. As the issue was being discussed some people suggested his hard drive is probably failing and he's having bad sectors. As you'd see in the above notes when there's a bad sector file systems don't simply return garbage data from the sector. If the sector data can be saved by something in the I/O stack (FS driver, volume driver, disk driver, disk H/W etc.) it will be. Otherwise caller of the file read operation will get an error code saying data couldn't be read. Of course this may not be the case for all file systems in the world but any production quality file system has to behave this way.

So any checksum problems like my colleague had would most likely be the result of application level bugs rather than file system returning gibberish data for bad sectors.  


Note

Most information here is based on NTFS and Windows' out of box volume driver. But it should still be mostly applicable to most modern production quality software.

Saturday, April 12, 2014

Directory Scalability of File Systems

Introduction

Recently I've been hearing from several people that having too many files in a directory is bad for performance and you need to shard the files in multiple directories instead. If you're familiar with file system internals you'll notice this claim doesn't sound right at all. 

Most mainstream production quality file systems keep a B+ tree (or a variant) index per directory. B+ trees are very scalable data structures that self balance when full rate goes below a certain threshold for each node. And since each node is a large consecutive block containing hundreds of entries they work with sector based persistent storage very well.

Similarly most database servers make use of B+ trees for their index keeping and they do handle billions of records per each table without trouble.

When you combine these it should be intuitive to expect from file systems handle numerous files in a directory without too much trouble.

Edit: Turns out HFS+ doesn't have per directory B+ Tree index after all. It has one global index which is called the catalog file and directory traversal happens through this index. This means sharding files to several directories for performance reasons is pretty much irrelevant.

Experiment

This information should convince most software engineers but part out of curiosity part to convince the remaining people I did a small experiment to see how a file system directory scales with respect to number of files.

Experiment Environment

For a while now I've been working with Apple development tools so the experiment setup is consisting of Apple technologies.


Operating System
MAC OS X 10.9.2
Computer
Macbook Pro Retina, 3 GHz Intel Core i7, 8 GB 1600 MHz DDR3, APPLE SSD SM256E Media
File System
HFS+
Programming Language
Objective C

Experiment Setup

In the experiment I setup different directories and populated them with 1K, 10K, 100K, 1M very small files (4 bytes each) respectively. I measured time to populate the directories, time to enumerate all files in the directories, and time to lookup 1K files in each directories.

I'm sure someone will be troubled by the small file sizes. I kept the file sizes as small as possible because file size is irrelevant to directory metadata. File data and directory metadata are kept totally separate by the file system. By keeping the file size small I kept the file data write overhead to a minimum and most of the time taken to populate the directories would be spent on updating directory metadata rather than updating the files themselves.

Experiment Results

All measurements are in seconds and in wall clock time. For enumeration and lookup timings I rebooted the computer to make sure file system caches get purged. Indeed without cache purge I've observed that both lookups and enumerations are much faster.



1K Files
10K Files
100K Files
1M Files
Time to create the files
0.22s
2.07s
21.48s
252.59s
Time to lookup 1K Files
0.039s
0.14s
0.30s
0.34s
Time to enumerate all files in the directory
0.032s
0.15s
1.21s
19.98s


Conclusion

As you can see from the results HFS+ file system handled up to 1 million files extremely well. Total file creation and enumeration times increased roughly linear to the number of files. Looking up 1 thousand files took pretty much the same time for all directories.

Total time to enumerate 1 million files seems to take a slight divergence from other directories but this is probably about memory locality. My absurdly simple experiment code reads all file names in a single array and an array with  1 million elements is big enough to put a strain on that.

I did this experiment on HFS+ but I'm fairly sure all mainstream file systems like NTFS, EXT3, EXT4 etc. would produce similar results.

Appendix A

Why Do People Think File Systems Can't Handle Too Many Files in a Directory

I feel like this is a residue in collective software engineers' memory. In the early 90s file systems used to be much primitive compared to today. Most didn't even have journaling so you could lose file system metadata when system shutdown abruptly. But starting from mid 90s most major operating system platforms started having robust file systems that also scaled much better compared to their predecessors.

Appendix B

Why Does 'ls -l' Take Much Longer Than Plain 'ls'

ls -l returns bunch of file attributes among the file names where ls just returns the file names. File attributes are part of file data so they don't really concern the directory and it's not kept with directory metadata. So in effect doing ls -l requires looking up the files one by one and reading their attributes. This is why it's wildly slower then plain ls.

Appendix C

Should I Rule Out Directory Sharding Forever

For 99.999% of software engineers, yes, at least for performance reasons. If you're building a file server that'll host millions of files from thousands of very active users, then a big maybe. Here's why: file systems keep in memory objects representing the entities on disk like files and directories which is called FCB (file control block) in Windows world. FCB is an object that's created per file on disk not for each open(handle) of that file. So the FCB for the very large directory can become a hot point under heavy load. Per FCB resources like locks will be under contention and can potentially hurt performance.

I'm personally skeptic even with this because per FCB resources will matter for file open time mostly. And this is only part of the I/O performance picture where other parts are read, write, flush, sysctl(ioctl) etc. times.

Appendix D

Experiment Code

Sorry for the crudity of the code. For the different directory setups I manually modified parts of the code so you can't really use it out of box.

#import <Foundation/Foundation.h>

#include <QuartzCore/QuartzCore.h>


NSString *path = @"your path here";

void lookup1KFiles(void)
{
  int found = 0;
  double start = CACurrentMediaTime();
  
  for (int i = 0; i < 1000; i++) {
    
    if ([[NSFileManager defaultManager] fileExistsAtPath:[NSString stringWithFormat:@"%@/%d.txt", path, i]
                                             isDirectory:NO]) {
      found++;
    }
  }
  
  double end = CACurrentMediaTime();
  
  NSLog(@"lookup time %lf", end - start);
  
  NSLog( @"Lookup found %d", found);
  
}

void populate(void)
{
  NSData *data = [NSData dataWithBytes:"abcd" length:4];
  
  double start = CACurrentMediaTime();
  
  //NSString *uuid = nil;
  
  for (int i=0; i < 1000; i++) {
    
    if (i % 10000 == 0) {
      //uuid = [[NSUUID UUID] UUIDString];
      NSLog(@"%d", i);
    }
    
    [data writeToFile:[NSString stringWithFormat:@"%@/%d.txt", path, i]
           atomically:NO];
    
    
  }
  
  double end = CACurrentMediaTime();
  
  NSLog(@"populate time %lf", end - start);
}

void enumerate(void)
{
  
  
  NSDirectoryEnumerationOptions options = NSDirectoryEnumerationSkipsSubdirectoryDescendants |
  NSDirectoryEnumerationSkipsPackageDescendants |
  NSDirectoryEnumerationSkipsHiddenFiles;
  
  NSURL *url = [NSURL fileURLWithPath:path];
  
  
  double start = CACurrentMediaTime();
  
  __unused NSArray *array = [[NSFileManager defaultManager] contentsOfDirectoryAtPath:path error:nil];
  
  [[NSFileManager defaultManager] contentsOfDirectoryAtURL:url
                                includingPropertiesForKeys:[[NSArray alloc] initWithObjects:NSURLIsRegularFileKey, nil]
                                                   options:options
                                                     error:nil];
  
  double end = CACurrentMediaTime();
  
  NSLog(@"enumerate time %lf", end - start);
  NSLog(@"count %d", (int)[array count]);
  
  return;
}