Thursday, April 17, 2014

When and How Files Get Deleted by a File System

Introduction

We had an interesting discussion with a colleague today. We were discussing what would happen if you rename a file to an existing file if there are already open file descriptors (FD, or handle if you're coming from Windows world) against the existing file. Here's the pseudo code for the scenario:
  1. Open an FD to file a/b/c
  2. Read 5 bytes from the FD
  3. Rename an existing file a/b/d to a/b/c
  4. Using the already open FD read another 5 bytes.

So what would be the content of the 5 bytes we read in step 4? Would it be content of the original a/b/c file or would it be the content of a/b/d which was later renamed as a/b/c? I thought we would read the content from the file that was originally a/b/d. I was dead wrong. My colleague wrote an experiment code and we were clearly reading data from old file. This didn't make any sense to me. I was really bothered by two questions:
  • How can we be seeing two different file content for the same file at the same time?
  • Since we can still read from original file, sectors occupied by that file are still accessible. But who's accounting for those sectors now?

Understanding Rename System Call

If you look at the man page for rename system call you'll see that if the destination file name already exists, it'll be unlinked first then the source file will be renamed. So this clears out the first question. Obviously we're not seeing two different data for the same file. Original file is deleted and some other file takes its place.

But still it's weird that we're reading from a deleted file and god knows what happened to those poor disk sectors occupied by that file.

Global File List vs. Directories

Most modern file systems keep a global list of files per volume. This is called master file table in NTFS and catalog file in HFS+. A file on a volume is mainly represented by its presence in this table.

When we think of directories we're generally mistaken by the idea that directories 'own' files. Quite the contrary. Directories simply 'link' to files that are accounted for in the master file table. One file in the master file table can be linked by one or more directories. So in this sense Unix terminology of link/unlink is extremely good fitting.

What Happens When a File is Deleted?

When a file is to be deleted file system marks the corresponding entry in the master file table as deleted. It also marks the sectors previously occupied by the file as free, which means they can be allocated by other files from that point on.

When is a File Deleted?

If you look at the man page of unlink system call it very clearly lists conditions for a file to be deleted:
  • No directory has a link against the file
  • No process has an open file descriptor against the file
Now this explains the behavior we saw in the experiment described in introduction. We were unlinking the file from the directory but the open FD against is was keeping it from deletion. So the data we read was coming from a perfectly alive file. And sectors of the file are accounted by the file definition in the master file table. Life is good again. When the final FD is closed file will be really deleted this time.

Final Missing Piece

Then yet another thing started bothering me. Sure it's nice that file stays around until closing of the last FD. But what happens if the system crashes while the FD is still open. When the system reboots we'll end up with a zombie file in the master file table and there's no way to reach it anymore because it doesn't exist in the namespace. 

My gut feeling was when the last link to file is removed, this information is persisted somewhere so even if the system crashes until the file is really deleted, it will know about this situation the next time volume is mounted. Sadly I couldn't find concrete information regarding this for NTFS or HFS+, but one EXT3 related paper clearly mentions it:

"In order to handle these deferred delete/truncate requests in a crash-safe manner, the inodes to be unlinked/truncated are added into the ext3 orphan list. This is an already existing mechanism by which ext3 handles file unlink/truncates that might be interrupted by a crash. A persistent singly-linked list of inode numbers is linked from the superblock and, if this list is not empty at filesystem mount time, the ext3 code will first walk the list and delete/truncate all of the files on it before the mount is completed."

I'm sure other file systems employ a similar mechanism to keep track of such files.

Credits

At this point I should thank my colleague, Slobodan Predolac, whom I had the discussion with and who wrote the experiment code and pointed me to related man pages.

Appendix - Experiment Source Code

#include <sys/types.h> #include <sys/uio.h> #include <unistd.h>
#import <Foundation/Foundation.h>
int main(int argc, const char * argv[]) {
@autoreleasepool { NSString *tmpfname = [NSTemporaryDirectory() stringByAppendingPathComponent:@"MyFile"]; NSString *str = @"1234567890ABCDEFG"; [str writeToFile:tmpfname atomically:YES encoding:NSUTF8StringEncoding error:nil]; int fd = open([tmpfname UTF8String], O_RDONLY); if (-1 == fd) { perror("Oh sh"); exit(1); } char buf[64] = {0}; if (-1 == read(fd, buf, 5)) { perror("Oh sh"); exit(1); } else { NSLog(@"First 5 = '%s'", buf); } NSString *str1 = @"qwertyqwertyqwerty";
BOOL succ = [str1 writeToFile:tmpfname atomically:YES encoding:NSUTF8StringEncoding error:nil]; if (!succ) { NSLog(@"Write failed"); exit(1); } else { int fd1 = open([tmpfname UTF8String], O_RDONLY); } char buf1[64] = {0}; if (-1 == read(fd, buf1, 5)) { perror("Oh sh"); exit(1); } else { NSLog(@"Next 5 = '%s'", buf1); }
close(fd); } return 0; }

No comments:

Post a Comment