Inside FAT: Data Recovery Algorithm

In 2013, there are plenty of file systems around. There are FAT, NTFS, HFS, exFAT, ext2/ext3 and many other file systems used by the many different operating systems. And yet, the oldest and simplest file system of them all is still going strong. The FAT system is aged, and has many limitations on maximum volume size and the size of a single file. This file system is rather simplistic by today’s standards. It does not offer any kind of permission management nor built-in transaction roll-back and recovery mechanisms. No built-in compression or encryption either. And yet it is very popular for many applications. The FAT system is so simple to implement, requires so little resources and imposes such a small overhead that it becomes irreplaceable for a wide range mobile applications.

The FAT is used in most digital cameras. The majority of memory cards used in media players, smartphones and tablets are formatted with the FAT. Even Android devices take memory cards formatted with the FAT system. In other words, despite its age, FAT is alive and kicking.

  Recovering  Information from FAT Volumes

If the FAT system is so popular, there must be need for data recovery tools supporting that file system. In this article we’ll be sharing experience gained during the development of a data recovery tool.

Before we go talking about the internals of the file system, let’s have a brief look at why data recovery is at all possible. As a matter of fact, the operating system (Windows, Android, or whatever system that’s used in a digital camera or media player) does not actually wipe or destroy information once a file gets deleted. Instead, the system marks a record in the file system to advertise disk space previously occupied by the file as available. The record itself is marked as deleted. This way is much faster than actually wiping disk content. It also reduces wear.

As you can see, the actual content of a file remains available somewhere on the disk. This is what allows data recovery tools to work. The question now is how to identify which sectors on the disk contain information belonging to a particular file. In order to do that, a data recovery tool could either analyze the file system or scan the content area on the disk looking for deleted files by matching the raw content against a database of pre-defined persistent signatures.

This second method is often called “signature search” or “content-aware analysis”. In forensic applications, this same approach is called “carving”. Whatever the name, the algorithms are very similar. They read the entire disk surface looking for characteristic signatures identifying files of certain supported formats. Once a known signature is encountered, the algorithm will perform a secondary check, then read and parse what appears to be the file’s header. By analyzing the header, the algorithm can determine the exact length of the file. By reading disk sectors following the beginning of the  file , the algorithm  recovers  what it assumes to be the content of a deleted  file .

If you’re following carefully, you could have already noticed several issues with this approach. It works extremely slowly, and it can only identify a finite number of known (supported) file formats. Most importantly, this approach assumes that disk sectors following the file’s header do belong to that particular file, which is not always true. Files are not always stored in a consecutive manner. Instead, the operating system can write chunks into first available clusters on the disk. As a result, the file can be fragmented into multiple pieces.  Recovering  fragmented  files  with signature search is a matter of hit or miss: short, defragmented  files  are usually recoverable without a sweat, while long, fragmented ones may not be  recovered  or may come out damaged after the recovery.

In practice, signature search does work pretty well. Most files that are of any importance to the user are documents, pictures, and other similarly small files. Granted, a lengthy video may not be  recovered , but a typical document or a JPEG image is usually sized below fragmentation threshold and  recovers  pretty well.

If, however, one needs to  recover  fragmented  files , the tool must combine information obtained from the file system and gathered during the disk scan. This, for example, allows excluding clusters that are already occupied by other files, which, as we’ll see in the next chapter, greatly improves the chance of successful recovery.

Using Information from the File System to Improve Recovery Quality

As we could see, signature search alone works great if there is no file system left on the disk, or if the file system is so badly damaged that it becomes unusable. In all other cases, information obtained from the file system can greatly improve the quality of the recovery.

Let’s take a large  file  we need to  recover . Suppose the file was fragmented (as is typical for larger files). Simply using signature search will result in only  recovering  the first fragment of the  file ; the other fragments will not  recover  correctly. It is therefore essential to determine which sectors on the disk belong to that particular file.

Windows and other operating systems determine which sectors belong to which file by enumerating records in the file system. File system records contain information about which sectors belong to which file.

Searching for a File System: the Partition System

Before analyzing the file system, we must identify and locate one first. But before we start looking for a file system, let’s look at how Windows handles partitions.

In Windows, disks are described with a partition system containing one or more tables. Each table describes a single partition. The record contains the partition’s initial address as well as its length. Partition type is also specified.

  • The hard drive is divided into three partitions with corresponding volume labels.
  • This table contains information about the type, beginning and end of each partition.

In order to locate the file system, the data recovery tool must analyze the partition table, if one is still available. But what if there is no partition table left, or what if the disk has been repartitioned, and the new partition table no longer contains information about the deleted volume? If this is the case, the tool will scan the disk in order to identify all available file systems.

When looking for a file system, the algorithm assumes that each partition contained a file system. Most file systems can be identified by looking for a certain persistent signature. For an instance, the FAT file system is identified by values recorded in the 510th and 511th bytes of the initial sectors. If the values recorded in those addresses are “0x55” and “0xaa”, the tool will start performing a secondary check.

The secondary check allows the tool ensuring that the actual file system is found as opposed to random encounters. The secondary check validates certain values used by the file system. For example, one of the records available in the FAT system identifies the number of sectors contained in the cluster. This value is always represented with a power of two. It can be 1, 2, 4, 8, 16, 32, 64 or 128. If there is any other value stored by that address, the structure is not a file system.

Now when we found the file system, we can start analyzing its records. Our goal is identifying addresses of the physical sectors on the disk that contain data belonging to a deleted file. In order to do that, a data recovery algorithm will scan the file system and enumerate its records.

In the FAT system, each file and directory has a corresponding record in the file system, a so-called directory entry. Directory entries contain information about the file including its name, attributes, initial address and length.

The content of a file or directory is stored in data blocks of equal length. These data blocks are called clusters. Each cluster contains a certain number of disk sectors. This number is a fixed value for each FAT volume. It’s recorded in the corresponding file system structure.

The tricky part is when a file or directory contains more than a single cluster. Subsequent clusters are identified with data structures called FAT (File Allocation Table). These structures are used to identify subsequent clusters that belong to a certain file, and to identify if a particular cluster is occupied or available.

Before analyzing the file system, it is essential to identify the three system areas.

  • The first area is reserved; it contains essential information about the file system. In FAT12 and FAT16, this area is one sector long. FAT32 can use more than one sector. The size of this area is specified in the boot sector.
  • The second area belongs to the FAT system, and contains primary and secondary structures of the file system. This area immediately follows the reserved area. Its size is defined by the size and number of FAT structures.
  • Finally, the last area contains the actual data. The content of files and directories is stored in this particular area.

When analyzing the file system, the FAT area will be of principal interest. It is this area that contains information on files’ physical addresses on the disk.

When analyzing the file system, it is essential co correctly determine the three system areas. The reserved area always begins at the very beginning of the file system (sector number 0). The size of this area is specified in the boot sector. In FAT12 and FAT16 the size of this area is exactly one sector. In FAT32, this area may occupy several sectors.

The FAT area immediately follows the reserved area. The FAT area contains one or more FAT structures. The size of this area is calculated by multiplying the number of FAT structures by the size of each structure. These values are also stored in the boot sector.

 Recovering   Files 

We’re finally close to  recovering  our first  file . Let’s assume the file has been just recently deleted, and no part of the file was overwritten with other data. This means that all clusters previously used by this file are now marked as available.

It is important to note that the system also can erases the corresponding FAT records. This means that we can get information about the file’s initial address, its attributes and size, but have no way to obtain data on any subsequent clusters.

At this point, we cannot  recover  the entire list of clusters that belong to the deleted  file . However, we can still try to  recover  the  file’s  content by reading the first cluster. If the file is reasonably small and fits into a single cluster, great! We’ve just  recovered  the  file . If, however, the  file  is larger than the size of a single cluster, we have to develop an algorithm to  recover  the rest of the  file .

The FAT system offers no easy way to determine which clusters belong to a deleted file, so this task is always a bit of a guessing game. The simplest way is just reading the clusters following the initial one, ignoring whether or not these clusters are occupied by other files. However silly it may sound, this is the only method available if no file system is available or if the file system is empty (e.g. after formatting the disk).

The other method is more sophisticated, only reading information from clusters that aren’t occupied with data belonging to other files. This method takes into account information on clusters occupied by other files as specified in the file system.

It is logical to assume that the second method yields better results compared to the first one (assuming that the file system is available and not empty). The second method can even  recover  some fragmented  files .

We have three different scenarios of  recovering  a  file  occupying 6 clusters of the file system. The file size is 7094 bytes; cluster size is 2048 bytes. This means that the deleted file initially occupied 4 clusters. In addition, we know the address of the initial cluster (cluster 56). Red color marks clusters occupied with other data, while empty clusters are filled white.

  • In scenario A, the file occupies 4 subsequent clusters (that is, the file is not fragmented). In this case, the  file  can be  recovered  correctly by either algorithm. Both algorithms will correctly read clusters 56 through 59.
  • In scenario B, the file was fragmented and stored in 3 fragments. Clusters 57 and 60 are used by other file. In this scenario, the first algorithm will  recover  clusters 56 through 59, which will return a corrupted  file . The second method will correctly  recover  clusters 56, 58, 59 and 61.
  • In the final scenario C, the deleted file was also fragmented (same clusters as in scenario B). However, clusters 57 and 60 are not used by any other file. In this scenario, both algorithms will  recover  clusters 56 through 59, both returning a corrupted  file .

As we can see, neither method is perfect, but the second algorithm offers a higher chance of successful recovery compared to the first one.

In our simple scenario we assumed that all parts of the file are still available and not overwritten with other data. In real life, this is not always the case. If some parts of a  file  are taken by other  files , no algorithm will be able to  recover  the  file  completely.



Source by Michael Miroshnichenko