The Evolution of the File

Copyright © Karl Dahlke, 2023

In a typical home or business, the information on a computer is gathered into files. A song, a video, a picture of your cat, the homework paper you are working on, the spreadsheet that tallies your monthly expenses - each of these is stored in a file. This is the norm. an exception to this rule is a commercial database, but even there, data is stored in tables, which have some similarities to files. Each table has a name, size, creation date, access / update permissions, etc, looking a bit like a file. Let's set databases to the side, and say that all the information on your computer is stored in files.

In 1961, as per the MIT Compatible Time-Sharing System, files were very simple. Each file had a name and not much more. It's amazing how files have evolved since then. I'm not talking about the content of a file, although that has certainly changed dramatically too. At the outset, a file holding a song was unthinkable; in fact a song is larger than all the memory and the attendant floppy disk of the Apple2E. So yes, there has been an explosion in file content, but there has also been an evolution in the attributes of a file, represented by metadata.

I use the word evolution loosely, because technology is not driven by undirected mutations and natural selection. It is in fact driven by an intelligent designer, or perhaps competing committees of intelligent designers. The result isn't as jumbled as our DNA, but it is nonetheless complex, and perhaps not as straightforward as one might wish.

File attributes are stored in a separate block in your computer's file system, apart from the content of the file. I'll call this block of metadata an inode. This is the Unix / Linux convention, the inode is called something else on other operating systems, but it serves the same purpose.

The most important attribute of a file is its name. Thus the name and the content are separate, not just in concept but in implementation. Edit your file, by adding a sentence to your letter, and you have changed the content of that file. Rename the file, from Hello.doc to HiThere.doc, and you are actually updating the directory that contains this file, writing the new name in place of the old, so that the new name points to the inode that describes the file. Renaming a file is a quick and easy operation inside your computer. No need to look at the content of the file; just update the name in the directory. This means you can lie about your files. You can rename a Word document from foobar.doc to foobar.pdf, and the computer is perfectly happy, but don't do it, because many applications assume the filename is consistent with what's inside. Your pdf reader may cheerfully open, and try to render, foobar.pdf, but if the content was generated by MS Word it's not going to work.

As mentioned above, a file is described by its inode. Some attributes that have been included in the inode since day 1 are location and length. When you call up TermPaper.doc, the computer must somehow find the content of that file. The inode is linked to a chain of blocks on the disk. The computer puts these blocks together to reproduce your file. If you add a paragraph to your paper, the file grows, and may require another block on the disk. The inode keeps track of these blocks.

Each block is a certain size on the disk, say 4096 bytes. If your term paper consumes ten blocks, then the file is 4096 × 10 = 40960 bytes long, but it isn't really, because the last block is only partially used. The length field in the inode tells the computer how long the file really is, perhaps 40923 bytes.

The next thing a file needs is a modification time, when did you last edit or update this file. It is called "modification time" in the computer manuals, and in casual conversation, but it is really date and time together, e.g. you last updated your term paper on August 17, 2009, at 4:36 PM. This is important for backups. If you backed up your files on August 16, and one of them was updated on August 17, then that file needs to be copied over to your secondary drive the next time you perform a system backup. A simple script looks like this.

RunBackup()
{
NowTime = current date and time;
	LastTime = recorded date and time of last backup;
	foreach (file in your area, or on your drive) {
		if (modtime(file) >= LastTime)
			copy file to the backup drive;
	}
	set date and time of backup = NowTime;
}

I capture the now time first to avoid a race condition. If it takes a long time to do the backups, and a file changes in the mean time, just after it was copied, you want to schedule it for copy again at the next backup. Since the new backup time is at the start of the procedure, not the end, it will all work out. These subtle timing issues are part of the discipline of software engineering, and beyond the scope of this book.

Since the modification time was so useful, file systems soon included creation time and access time as well, recording when the file was created, and when it was last accessed (not necessarily changed). Again, these are all stored in the inode. The operating system is responsible for maintaining all these times. It sets the creation time when the file is created, and updates the access and modification time when the file is viewed or changed respectively.

In a corporate setting, one computer may have many users. A file now belongs to the person who created it. This requires yet another field in the inode, the owner of the file. When a file is "given" to someone else, via the chown (change owner) command, the inode is updated, reflecting the new owner. It was my file, now it's yours. Once again the content of the file is not affected in any way.

Users naturally clump together into groups: marketing, human resources, finance, product development, sales, management, etc, and it is useful to assign files to groups as well as users. thus the spreadsheet that holds accounts payable for the month of August is owned by a particular employee, and by the finance group.

Now that files belong to owners and groups, some security measures are possible. Each file can be read, or modified, by everyone, by members of the group, or by the owner. These permission bits live in the inode, and the operating system checks these bits, and the owner and group of the file, and who you are, and the group or groups you belong to, before you are allowed to look at or change the file. It's all part of the growing inode and the operating system's interaction with that inode.

All of the above was established by 1980, but hang on, because “My little party's just beginning.”

Someone thought it would be a clever idea to link a file into many locations in the directory tree, under possibly different names. If Lucy and Linus are both working on a book report on Peter Rabbit as a joint project, they might share one file, with one inode describing this file, linked under the two names /home/Lucy/homework/PeterRabbit.doc and /home/Linus/expositions/peter-thesis.doc. After the paper has been turned in for a grade, Lucy would just as soon forget the whole thing. She deletes it, but the computer cannot free up the disk space and use it for something else, because Linus still has the file linked into his directory. He wants to keep the paper forever. Each time a file is deleted, the computer could search the entire disk to see if anyone else is still using the file, but that is terribly inefficient. Instead, each inode has a reference count that records the number of links to this file. The aforementioned book report has a reference count of 2, until Lucy deletes it, whence the count drops to 1. If Linus also deletes the file, the count goes to 0 and the disk space is freed, and can be used for something else.

Next we turn to files that aren't files. If the inode declares a file to be "character special", then it is not a file at all, but a stream of data. Such a file has a major number and a minor number. When a program opens the file for reading, the operating system sees that the file is character special, looks at the major and minor numbers, and reads data from the corresponding device, as though the data were coming in from a file. For example, suppose you are recording music from a microphone. On my computer, the audio capture program would open /dev/snd/pcmC0D0c, which has major number 116 (indicating sound card) and minor number 5 (indicating digital audio input). The inode doesn't reserve space on the disk, it only needs the major and minor numbers, and the operating system does the rest.

Another special file is "block special", which refers to a disk drive, or disk partition, or equivalent, on the computer. My primary disk is /dev/sda, major 8 minor 0, and my secondary disk is /dev/sdb, major 8 minor 16. Such "files" should not be accessed by anything other than privileged system programs. If you write to these files, you will place data in random locations on the disk, and probably trash the computer. Other special file types are fifo and socket. These are all controlled by the inode.

With the advent of the second extended file system in Linux, a series of new attributes could be assigned to each file. These go beyond the scope of the preexisting commands ls -l, chmod, and chown. New commands lsattr and chattr were added to read and change these attributes respectively. These new attributes, managed by the inode, must be interpreted by the operating system, else they are just bits taking up space. Beyond this, new kernel primitives must be written to set and clear these attributes, with supporting user space commands and the attendant documentation. Whenever you enhance or otherwise tamper with the inode, many other software changes must take place in parallel, or it doesn't all fit together. Below are some of the new attributes introduced by the ext2 file system. Much of this comes from the chattr manual page. These attributes are boolean, on or off, yes or no. This is a partial list for illustration purposes, there are many more.

Append

This file can only be opened in append mode. Data can be added to the end of the file, but data that is already present cannot be changed. This is appropriate for log files.

Atime

The file's access time is not updated when the file is viewed. Access times are often unimportant, so they can be disabled on a per file basis.

Compress

This file is automatically compressed on disk to save space. A read from this file returns uncompressed data. A write to this file compresses data before storing them on the disk.

Huge

This is, or at one time was, a huge file, larger than 2 terabytes. Such is stored on the disk in a different way.

Immutable

This file cannot be deleted or changed in any way, nor can it be linked from any other place. This is used for critical system files and programs that should not be changed. Of course a complete system update may in fact have to change some of these files, and we have to allow for that. A superuser program can clear the Immutable attribute, update the file, and set the immutable attribute again.

Zero

When this file is deleted, zero out all its data before putting the blocks back on the free list. This is usually considered a waste of time, but if the information in the file is sensitive, you might want to zero it out before those blocks are free on the disk, where they could be allocated and read by another program, at the behest of another user in search of private information.

Undelete

When this file is deleted, save the content of the file elsewhere, so that the file can be undeleted later. Set this bit on your masters thesis, so that if you accidentally delete it you can get it back. Of course, if you accidentally overwrite it with garbage, this option won't help you at all.

But wait, there's more. Files can be associated with one or more capabilities. This entails yet more commands, getcap to get the capabilities of a file and setcap to set them. In 2014 I used cpio to copy an entire linux instance from one drive to another, and virtually everything worked, except for ping. Upon further research I found that the executable file that is the ping command, /bin/ping, has to have the following capabilities.

cap_net_admin,cap_net_raw+ep

My version of cpio did not carry the capabilities across. And yet I didn't lose much. Only 6 files out of 160,000 have capabilities assigned. This makes me think the entire "capability" feature is unnecessary, and should be scrapped, or accomplished in some other way, such as a net_admin group, with ping owned by this group and having group permissions, including setgid. There must be some way to make these 6 files happy, without creating a brand new infrastructure and supporting it forever more. that's just my opinion of course.

Finally I turn to security, secure Linux in particular, called selinux for short. Since its inception, Unix has had one user, denoted superuser, who can do anything, but soon a wide array of administrators needed superuser access to do their jobs, and this entailed too much risk. Under selinux, access is compartmentalized into security contexts. Some people manage network access, some manage users and passwords, some manage disks and partitions, and so on. Personally, I find this a royal headache with little value. You can trash a computer, or access sensitive data, through any one of these subsystems. I'm not sure that all these contexts, and the management thereof, actually mitigates risk. Well I'm not a security expert, so on we go.

There are other aspects of a file that are not discussed here. I haven't addressed sticky directories, mount points, file locking, xfs attribute value pairs, or symbolic links, for example. This is merely an overview, intended to illustrate the evolving complexity of a file, a concept that was suppose to be relatively simple at the outset. Some of these features may fall out of favor, or merge with others to form a more streamlined design, but one thing's for certain, additional features will creep in. The inode will change to support new aspects of the file, and the operating system must follow along. How many bells and whistles will be attached to a file 1,000 years from now, (we're gonna need a bigger inode!), and how many user commands will be needed to twiddle all these knobs? It seems overwhelming to me, but our descendants will probably take it all in stride.