Journaling file system

For the IBM Journaled File System, see JFS (file system).

A journaling file system is a file system that keeps track of changes not yet committed to the file system's main part by recording the intentions of such changes in a data structure known as a "journal", which is usually a circular log. In the event of a system crash or power failure, such file systems can be brought back online more quickly with a lower likelihood of becoming corrupted.^[1]^[2]

Depending on the actual implementation, a journaling file system may only keep track of stored metadata, resulting in improved performance at the expense of increased possibility for data corruption. Alternatively, a journaling file system may track both stored data and related metadata, while some implementations allow selectable behavior in this regard.^[3]

Rationale

Updating file systems to reflect changes to files and directories usually requires many separate write operations. This makes it possible for an interruption (like a power failure or system crash) between writes to leave data structures in an invalid intermediate state.^[1]

For example, deleting a file on a Unix file system involves three steps:^[4]

Removing its directory entry.
Releasing the inode to the pool of free inodes.
Returning all used disk blocks to the pool of free disk blocks.

If a crash occurs after step 1 and before step 2, there will be an orphaned inode and hence a storage leak. On the other hand, if only step 2 is performed first before the crash, the not-yet-deleted file will be marked free and possibly be overwritten by something else.

Detecting and recovering from such inconsistencies normally requires a complete walk of its data structures, for example by a tool such as fsck (the file system checker).^[2] This must typically be done before the file system is next mounted for read-write access. If the file system is large and if there is relatively little I/O bandwidth, this can take a long time and result in longer downtimes if it blocks the rest of the system from coming back online.

To prevent this, a journaled file system allocates a special area—the journal—in which it records the changes it will make ahead of time. After a crash, recovery simply involves reading the journal from the file system and replaying changes from this journal until the file system is consistent again. The changes are thus said to be atomic (not divisible) in that they either succeed (succeeded originally or are replayed completely during recovery), or are not replayed at all (are skipped because they had not yet been completely written to the journal before the crash occurred).

Techniques

Some file systems allow the journal to grow, shrink and be re-allocated just as a regular file, while others put the journal in a contiguous area or a hidden file that is guaranteed not to move or change size while the file system is mounted. Some file systems may also allow external journals on a separate device, such as a solid-state drive or battery-backed non-volatile RAM. Changes to the journal may themselves be journaled for additional redundancy, or the journal may be distributed across multiple physical volumes to protect against device failure.

The internal format of the journal must guard against crashes while the journal itself is being written to. Many journal implementations (such as the JBD2 layer in ext4) bracket every change logged with a checksum, on the understanding that a crash would leave a partially written change with a missing (or mismatched) checksum that can simply be ignored when replaying the journal at next remount.

Physical journals

A physical journal logs an advance copy of every block that will later be written to the main file system. If there is a crash when the main file system is being written to, the write can simply be replayed to completion when the file system is next mounted. If there is a crash when the write is being logged to the journal, the partial write will have a missing or mismatched checksum and can be ignored at next mount.

Physical journals impose a significant performance penalty because every changed block must be committed twice to storage, but may be acceptable when absolute fault protection is required.^[5]

Logical journals

A logical journal stores only changes to file metadata in the journal, and trades fault tolerance for substantially better write performance.^[6] A file system with a logical journal still recovers quickly after a crash, but may allow unjournaled file data and journaled metadata to fall out of sync with each other, causing data corruption.

For example, appending to a file may involve three separate writes to:

The file's inode, to note in the file's metadata that its size has increased.
The free space map, to mark out an allocation of space for the to-be-appended data.
The newly allocated space, to actually write the appended data.

In a metadata-only journal, step 3 would not be logged. If step 3 was not done, but steps 1 and 2 are replayed during recovery, the file will be appended with garbage.

Write hazards

The write cache in most operating systems sorts its writes (using the elevator algorithm or some similar scheme) to maximize throughput. To avoid an out-of-order write hazard with a metadata-only journal, writes for file data must be sorted so that they are committed to storage before their associated metadata. This can be tricky to implement because it requires coordination within the operating system kernel between the file system driver and write cache. An out-of-order write hazard can also exist if the underlying storage cannot write blocks atomically, or does not honor requests to flush its write cache.

To complicate matters, many mass storage devices have their own write caches, in which they may aggressively reorder writes for better performance. (This is particularly common on magnetic hard drives, which have large seek latencies that can be minimized with elevator sorting.) Some journaling file systems conservatively assume such write-reordering always takes place, and sacrifice performance for correctness by forcing the device to flush its cache at certain points in the journal (called barriers in ext3 and ext4).^[7]

Alternatives

Soft updates

Some UFS implementations avoid journaling and instead implement soft updates: they order their writes in such a way that the on-disk file system is never inconsistent, or that the only inconsistency that can be created in the event of a crash is a storage leak. To recover from these leaks, the free space map is reconciled against a full walk of the file system at next mount. This garbage collection is usually done in the background.^[8]

Log-structured file systems

In log-structured file systems, the write-twice penalty does not apply because the journal itself is the file system: it occupies the entire storage device and is structured so that it can be traversed as would a normal file system.

Copy-on-write file systems

Full copy-on-write file systems (such as ZFS and Btrfs) avoid in-place changes to file data by writing out the data in newly allocated blocks, followed by updated metadata that would point to the new data and disown the old, followed by metadata pointing to that, and so on up to the superblock, or the root of the file system hierarchy. This has the same correctness-preserving properties as a journal, without the write-twice overhead.

References

1 2 Jones, M Tim (2008-06-04), Anatomy of Linux journaling file systems, IBM DeveloperWorks, retrieved 2009-04-13
1 2 Arpaci-Dusseau, Remzi H.; Arpaci-Dusseau, Andrea C. (2014-01-21), Crash Consistency: FSCK and Journaling (PDF), Arpaci-Dusseau Books
↑ "tune2fs(8) – Linux man page". linux.die.net. Retrieved February 20, 2015.
↑ File Systems from Tanenbaum, A.S. (2008). Modern operating systems (3rd ed., pp. 287). Upper Saddle River, NJ: Prentice Hall.
↑ Tweedie, Stephen (2000), "Ext3, journaling filesystem", Proceedings of the Ottawa Linux Symposium: 24–29
↑ Prabhakaran, Vijayan; Arpaci-Dusseau, Andrea C; Arpaci-Dusseau, Remzi H, "Analysis and Evolution of Journaling File Systems" (PDF), 2005 USENIX Annual Technical Conference, USENIX Association .
↑ Corbet, Jonathan (2008-05-21), Barriers and journaling filesystems, retrieved 2010-03-06
↑ Seltzer, Margo I; Ganger, Gregory R; McKusick, M Kirk, "Journaling Versus Soft Updates: Asynchronous Meta-data Protection in File Systems", 2000 USENIX Annual Technical Conference, USENIX Association .

File systems

Disk

ADFS AdvFS Amiga FFS Amiga OFS APFS AthFS BFS Be File System Boot File System Btrfs DFS EFS Encrypting File System Extent File System Episode ext ext2 ext3 ext3cow ext4 FAT exFAT Files-11 Fossil HAMMER HFS+ HPFS HTFS IBM General Parallel File System JFS LFS MFS Macintosh File System Tivo Media File System MINIX NetWare File System Next3 NILFS NSS NTFS OneFS PFS QFS QNX4FS ReFS ReiserFS Reiser4 Reliance Reliance Nitro RFS SFS Soup (Apple) Tux3 UBIFS UFS VxFS WAFL Xiafs XFS Xsan zFS ZFS

Optical disc	HSF ISO 9660 ISO 13490 UDF

Flash memory and SSD	APFS FAT exFAT CHFS TFAT FFS2 F2FS HPFS JFFS JFFS2 JFS LogFS NVFS YAFFS UBIFS

Distributed	CXFS GFS2 Google File System OCFS2 OrangeFS PVFS QFS Xsan more...

NAS

Specialized

Aufs AXFS Boot File System CDfs Compact Disc File System cramfs Davfs2 FTPFS FUSE GmailFS Lnfs LTFS MVFS SquashFS UMSDOS OverlayFS UnionFS WBFS

Pseudo and virtual	configfs devfs debugfs kernfs procfs specfs sysfs tmpfs WinFS

Encrypted	eCryptfs EncFS EFS Rubberhose SSHFS ZFS

Types

Topics

Operating systems

General

Kernel

Architectures	Exokernel Hybrid Microkernel Monolithic Rump kernel Unikernel

Components	Device driver Loadable kernel module Microkernel User space

Process management

Concepts	Context switch Interrupt IPC Process Process control block Thread Time-sharing

Scheduling algorithms	Computer multitasking Fixed-priority preemptive Multilevel feedback queue Preemptive Round-robin Shortest job next

Memory management and
resource protection

Storage access and
file systems

List

Miscellaneous concepts

This article is issued from Wikipedia - version of the 11/3/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.