xfs.org - User contributions [en]

Getting the latest source code

2016-10-18T23:43:06Z

Dgc:

== XFS Released/Stable source ==

Note: as of September 2016, the XFS project is moving away from oss.sgi.com infrastructure. As we move to other infrastructure the links below will be updated to point to the new locations.

* '''Mainline kernels'''
:XFS has been maintained in the official Linux kernel [http://www.kernel.org/ kernel trees] starting with [http://lkml.org/lkml/2003/12/8/35 Linux 2.4] and is frequently updated with the latest stable fixes and features from the XFS development team.

* '''Vendor kernels'''
:All modern Linux distributions include support for XFS.

* '''XFS userspace'''
:[https://kernel.org/pub/linux/utils/fs/xfs source code tarballs] of the xfs userspace tools. These tarballs form the basis of the xfsprogs packages found in Linux distributions.

== Development and bleeding edge Development ==

* [[XFS git howto]]

=== Current XFS kernel source ===

* [https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/ xfs]

$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git

Note: the old kernel tree on [http://oss.sgi.com/cgi-bin/gitweb.cgi oss.sgi.com] is no longer kept up to date with the master tree on kernel.org.

=== XFS user space tools ===
* [https://git.kernel.org/cgit/fs/xfs/xfsprogs-dev.git/ xfsprogs ]

git clone git://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git

A few packages are needed to compile <tt>xfsprogs</tt>, depending on your package manager:

apt-get install libtool automake gettext libblkid-dev uuid-dev
yum install libtool automake gettext libblkid-devel libuuid-devel

=== XFS dump ===
* [https://git.kernel.org/cgit/fs/xfs/xfsdump-dev.git/ xfsdump ]

git clone git://git.kernel.org/pub/scm/fs/xfs/xfsdump-dev.git

=== XFS tests ===
* [https://git.kernel.org/cgit/fs/xfs/xfstests-dev.git/ xfstests ]

git clone git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git

=== DMAPI user space tools ===
* [https://git.kernel.org/cgit/fs/xfs/dmapi-dev.git/ dmapi ]

git clone git://git.kernel.org/pub/scm/fs/xfs/dmapi-dev.git

=== git-cvsimport generated trees ===

The Git trees are automated mirrored copies of the CVS trees using [http://www.kernel.org/pub/software/scm/git/docs/git-cvsimport.html git-cvsimport].
Since git-cvsimport utilized the tool [http://www.cobite.com/cvsps/ cvsps] to recreate the atomic commits of ptools or "mod" it is easier to see the entire change that was committed using git.

* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=summary linux-2.6-xfs-from-cvs]
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-cmds.git;a=summary xfs-cmds]

Before building in the <tt>xfsdump</tt> or <tt>dmapi</tt> directories (after building <tt>xfsprogs</tt>), you will need to run:
# cd xfsprogs
# make install-dev
to create <tt>/usr/include/xfs</tt> and install appropriate files there.

Before building in the xfstests directory, you will need to run:
# cd xfsprogs
# make install-qa
to install a somewhat larger set of files in <tt>/usr/include/xfs</tt>.

== XFS cvs trees ==

The cvs trees were created using a script that converted sgi's internal
ptools repository to a cvs repository, so the cvs trees were considered read only.

At this point all new development is being managed by the git trees thus the cvs trees
are no longer active in terms of current development and should only be used
for reference.

* [[XFS CVS howto]]

Getting the latest source code

2016-08-30T01:45:30Z

Dgc:

== XFS Released/Stable source ==

Note: as of September 2016, the XFS project is moving away from oss.sgi.com infrastructure. As we move to other infrastructure the links below will be updated to point to the new locations.

* '''Mainline kernels'''
:XFS has been maintained in the official Linux kernel [http://www.kernel.org/ kernel trees] starting with [http://lkml.org/lkml/2003/12/8/35 Linux 2.4] and is frequently updated with the latest stable fixes and features from the XFS development team.

* '''Vendor kernels'''
:All modern Linux distributions include support for XFS.

* '''XFS userspace'''
:[ftp://oss.sgi.com/projects/xfs source code tarballs] of the xfs userspace tools. These tarballs form the basis of the xfsprogs packages found in Linux distributions.

== Development and bleeding edge Development ==

* [[XFS git howto]]

=== Current XFS kernel source ===

* [https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/ xfs]

$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git

Note: the old kernel tree on [http://oss.sgi.com/cgi-bin/gitweb.cgi oss.sgi.com] is no longer kept up to date with the master tree on kernel.org.

=== XFS user space tools ===
* [https://git.kernel.org/cgit/fs/xfs/xfsprogs-dev.git/ xfsprogs ]

git clone git://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git

A few packages are needed to compile <tt>xfsprogs</tt>, depending on your package manager:

apt-get install libtool automake gettext libblkid-dev uuid-dev
yum install libtool automake gettext libblkid-devel libuuid-devel

=== XFS dump ===
* [https://git.kernel.org/cgit/fs/xfs/xfsdump-dev.git/ xfsdump ]

git clone git://git.kernel.org/pub/scm/fs/xfs/xfsdump-dev.git

=== XFS tests ===
* [https://git.kernel.org/cgit/fs/xfs/xfstests-dev.git/ xfstests ]

git clone git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git

=== DMAPI user space tools ===
* [https://git.kernel.org/cgit/fs/xfs/dmapi-dev.git/ dmapi ]

git clone git://git.kernel.org/pub/scm/fs/xfs/dmapi-dev.git

=== git-cvsimport generated trees ===

The Git trees are automated mirrored copies of the CVS trees using [http://www.kernel.org/pub/software/scm/git/docs/git-cvsimport.html git-cvsimport].
Since git-cvsimport utilized the tool [http://www.cobite.com/cvsps/ cvsps] to recreate the atomic commits of ptools or "mod" it is easier to see the entire change that was committed using git.

* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=summary linux-2.6-xfs-from-cvs]
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-cmds.git;a=summary xfs-cmds]

Before building in the <tt>xfsdump</tt> or <tt>dmapi</tt> directories (after building <tt>xfsprogs</tt>), you will need to run:
# cd xfsprogs
# make install-dev
to create <tt>/usr/include/xfs</tt> and install appropriate files there.

Before building in the xfstests directory, you will need to run:
# cd xfsprogs
# make install-qa
to install a somewhat larger set of files in <tt>/usr/include/xfs</tt>.

== XFS cvs trees ==

The cvs trees were created using a script that converted sgi's internal
ptools repository to a cvs repository, so the cvs trees were considered read only.

At this point all new development is being managed by the git trees thus the cvs trees
are no longer active in terms of current development and should only be used
for reference.

* [[XFS CVS howto]]

XFS email list and archives

2016-08-30T01:35:55Z

Dgc: moving mailing list to vger.kernel.org

== XFS email list ==
Patches, comments, requests and questions should go to [mailto:linux-xfs@vger.kernel.org linux-xfs@vger.kernel.org]

As of September 2016, the XFS will move from the long standing address of xfs@oss.sgi.com because of the propsective shutdown of the oss.sgi.com infrastructure. See below for links to the old archives.

Current crchives of the linux-xfs@vger.kernel.org list can be found at

* [http://www.spinics.net/lists/linux-xfs/ Spinics]

== Subscribing to the list ==

Details for subscribing to the list can be found at the [http://vger.kernel.org/vger-lists.html#linux-xfs vger list info page].

Subscribing is *only* possible by sending an email with the body:

<pre>subscribe linux-xfs</pre>

to [mailto:majordomo@vger.kernel.org majordomo@vger.kernel.org]

== Old XFS archives ==

The list archives on oss.sgi.com are available [http://oss.sgi.com/archives/xfs here] (MHonArc) and [http://oss.sgi.com/pipermail/xfs here] (mailman).

Other archives include:

* [https://www.spinics.net/lists/xfs/ Spinics]
* [http://www.opensubscriber.com/messages/xfs@oss.sgi.com/topic.html OpenSubscriber]

Host Aware SMR architecture

2015-03-16T05:49:12Z

Dgc:

The first proposal for optimising XFS for host aware SMR drives can be found in the upstream XFS documentation repository [https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc here].

A pdf version of this document (built from commit 1708324fdd1d37619db316d7023b7115837ae39d) can be found here: [[File:Xfs-smr-structure-0.2.pdf]].

File:Xfs-smr-structure-0.2.pdf

2015-03-16T05:45:56Z

Dgc: XFS SMR architecture proposal v0.2

XFS SMR architecture proposal v0.2

Host Aware SMR architecture

2015-03-16T05:44:38Z

Dgc:

The first proposal for optimising XFS for host aware canbe found in the upstream XFS documentation repository [https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc here].

A pdf version of this document (built from commit 1708324fdd1d37619db316d7023b7115837ae39d) is attached here.

Host Aware SMR architecture

2015-03-16T05:42:52Z

Dgc: Created page with "Host Aware SMR Architecture The first proposal for optimising XFS for host aware canbe found in the upstream XFS documentation repository [https://git.kernel.org/cgit/fs/xfs/..."

Host Aware SMR Architecture

The first proposal for optimising XFS for host aware canbe found in the upstream XFS documentation repository [https://git.kernel.org/cgit/fs/xfs/xfs-documentation.git/tree/design/xfs-smr-structure.asciidoc here].

Ideas for XFS

2015-03-16T05:37:55Z

Dgc: add link for new document

= Future Directions for XFS =

Dave Chinner's ideas from 2008:

* [[Improving inode Caching]]

* [[Improving Metadata Performance By Reducing Journal Overhead]]

* [[Reliable Detection and Repair of Metadata Corruption]]

Other ideas:

* [[Splitting project quota support from group quota support]]

* [[Assigning project quota to a linux container]]

* [[Support discarding of unused sectors]] (status: completed)

* Superblock flag for when 64-bit inodes are present (see [http://oss.sgi.com/pipermail/xfs/2009-May/041379.html xfs: regarding the inode64 mount option])

* Wishlist: Please integrate ''xfs_irecover'' or provide [http://www.who.is.free.fr/wiki/doku.php?id=recover inode recovery feature]

* [[Host Aware SMR architecture]]

Getting the latest source code

2015-01-22T04:13:41Z

Dgc: update for new master kernel source tree repo

== XFS Released/Stable source ==

* '''Mainline kernels'''
:XFS has been maintained in the official Linux kernel [http://www.kernel.org/ kernel trees] starting with [http://lkml.org/lkml/2003/12/8/35 Linux 2.4] and is frequently updated with the latest stable fixes and features from the XFS development team.

* '''Vendor kernels'''
:All modern Linux distributions include support for XFS.

* '''XFS userspace'''
:[ftp://oss.sgi.com/projects/xfs source code tarballs] of the xfs userspace tools. These tarballs form the basis of the xfsprogs packages found in Linux distributions.

== Development and bleeding edge Development ==

* [[XFS git howto]]

=== Current XFS kernel source ===
* [https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/ xfs]
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git

Note: the old kernel tree on [http://oss.sgi.com/cgi-bin/gitweb.cgi oss.sgi.com] i sno longer kept up to date with the master tree on kernel.org.

=== XFS user space tools ===
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfsprogs.git;a=summary xfsprogs]

git clone git://oss.sgi.com/xfs/cmds/xfsprogs

A few packages are needed to compile <tt>xfsprogs</tt>, depending on your package manager:

apt-get install libtool automake gettext libblkid-dev uuid-dev
yum install libtool automake gettext libblkid-devel libuuid-devel

=== XFS dump ===
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfsdump.git;a=summary xfsdump]
$ git clone git://oss.sgi.com/xfs/cmds/xfsdump

=== XFS tests ===
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfstests.git;a=summary xfstests]
$ git clone git://oss.sgi.com/xfs/cmds/xfstests

=== DMAPI user space tools ===
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/dmapi.git;a=summary dmapi]
$ git clone git://oss.sgi.com/xfs/cmds/dmapi

=== git-cvsimport generated trees ===

The Git trees are automated mirrored copies of the CVS trees using [http://www.kernel.org/pub/software/scm/git/docs/git-cvsimport.html git-cvsimport].
Since git-cvsimport utilized the tool [http://www.cobite.com/cvsps/ cvsps] to recreate the atomic commits of ptools or "mod" it is easier to see the entire change that was committed using git.

* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=summary linux-2.6-xfs-from-cvs]
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-cmds.git;a=summary xfs-cmds]

Before building in the <tt>xfsdump</tt> or <tt>dmapi</tt> directories (after building <tt>xfsprogs</tt>), you will need to run:
# cd xfsprogs
# make install-dev
to create <tt>/usr/include/xfs</tt> and install appropriate files there.

Before building in the xfstests directory, you will need to run:
# cd xfsprogs
# make install-qa
to install a somewhat larger set of files in <tt>/usr/include/xfs</tt>.

== XFS cvs trees ==

The cvs trees were created using a script that converted sgi's internal
ptools repository to a cvs repository, so the cvs trees were considered read only.

At this point all new development is being managed by the git trees thus the cvs trees
are no longer active in terms of current development and should only be used
for reference.

* [[XFS CVS howto]]

User talk:Ronan David

2014-07-08T23:47:13Z

Dgc: Welcome!

'''Welcome to ''XFS.org''!'''
We hope you will contribute much and well.
You will probably want to read the [[Help:Contents|help pages]].
Again, welcome and have fun! [[User:Dgc|Dgc]] ([[User talk:Dgc|talk]]) 23:47, 8 July 2014 (UTC)

User:Ronan David

2014-07-08T23:47:13Z

Dgc: Creating user page for new user.

I'm a project manager on a product installed on linux

User talk:Fengyongzhen

2014-04-16T20:36:43Z

Dgc: Welcome!

User:Fengyongzhen

2014-04-16T20:36:43Z

Dgc: Creating user page for new user.

try to learn how to use xfs，at least support 16T disk。

User talk:William M. Moss

2014-04-16T20:35:54Z

Dgc: Welcome!

User:William M. Moss

2014-04-16T20:35:54Z

Dgc: Creating user page for new user.

Retired
Using Unix since Lab Version 7.
Using Linux since version 1.13.

User talk:Brian Foster

2014-04-16T20:35:38Z

Dgc: Welcome!

User:Brian Foster

2014-04-16T20:35:36Z

Dgc: Creating user page for new user.

Brian is an XFS hacker at Red Hat. He wrote this second sentence to satisfy the biography length requirement.

User talk:Avishay

2013-06-15T01:14:34Z

Dgc: Welcome!

User:Avishay

2013-06-15T01:14:34Z

Dgc: Creating user page for new user.

Development Manager of Credit Cards and ATM.
Management of large scale software projects within a variety of environments, managing 18 SW engineers(3 Teams).
experience in technical analysis and evaluation, budget estimates, Project Management, recruitment and training, tool selection and implementation, configuration management, quality assurance and testing.

User talk:Benjamin Myers

2013-05-03T02:07:32Z

Dgc: Welcome!

User:Benjamin Myers

2013-05-03T02:07:32Z

Dgc: Creating user page for new user.

one two three four five six seven eight nine ten

User talk:Ben Myers

2013-05-03T02:03:54Z

Dgc: Welcome!

User:Ben Myers

2013-05-03T02:03:54Z

Dgc: Creating user page for new user.

This biography must be at least 10 words long. Ten. Eleven.

User talk:Geoffrey Wehrman

2013-05-03T02:03:16Z

Dgc: Welcome!

User:Geoffrey Wehrman

2013-05-03T02:03:16Z

Dgc: Creating user page for new user.

Filesystem developer at SGI since 1997.
Experience with XFS on IRIX and Linux.

User talk:Chandra Seetharaman

2013-05-03T02:02:50Z

Dgc: Welcome!

User:Chandra Seetharaman

2013-05-03T02:02:50Z

Dgc: Creating user page for new user.

Has been working on different part of linux kernel for more than a decade, which include filesystem too.

User talk:Mariusz Witek

2012-10-12T07:00:50Z

Dgc: Welcome!

User:Mariusz Witek

2012-10-12T07:00:50Z

Dgc: Creating user page for new user.

I am physicist working in the research institute with some computing background.

XFS FAQ

2012-08-13T23:58:57Z

Dgc: Reverted edits by Zmi (talk) to last revision by Sandeen

Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]

Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.

== Q: Where can I find documentation about XFS? ==

The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.

You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the '''<nowiki>#xfs</nowiki>''' IRC channel on ''irc.freenode.net''.

== Q: Where can I find documentation about ACLs? ==

Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/

The '''acl(5)''' manual page is also quite extensive.

== Q: Where can I find information about the internals of XFS? ==

An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.

Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.

== Q: What partition type should I use for XFS on Linux? ==

Linux native filesystem (83).

== Q: What mount options does XFS have? ==

There are a number of mount options influencing XFS filesystems - refer to the '''mount(8)''' manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])

== Q: Is there any relation between the XFS utilities and the kernel version? ==

No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.

== Q: Does it run on platforms other than i386? ==

XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.

== Q: Quota: Do quotas work on XFS? ==

Yes.

To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/ http://sourceforge.net/projects/linuxquota/] or use '''xfs_quota(8)'''.

== Q: Quota: What's project quota? ==

The project quota is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.

== Q: Quota: Can group quota and project quota be used at the same time? ==

No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.

== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==

To be answered.

== Q: Are there any dump/restore tools for XFS? ==

'''xfsdump(8)''' and '''xfsrestore(8)''' are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.

== Q: Does LILO work with XFS? ==

This depends on where you install LILO.

Yes, for MBR (Master Boot Record) installations.

No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.

== Q: Does GRUB work with XFS? ==

There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.

== Q: Can XFS be used for a root filesystem? ==

Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the "rootflags=" kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit "logdev=" specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]

== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==

Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be "clean" when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.

== Q: Is there a way to make a XFS filesystem larger or smaller? ==

You can ''NOT'' make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.

An XFS filesystem may be enlarged by using '''xfs_growfs(8)'''.

If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the ''exact same'' starting point. Run '''xfs_growfs''' to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.

Using XFS filesystems on top of a volume manager makes this a lot easier.

== Q: What information should I include when reporting a problem? ==

What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:

* kernel version (uname -a)
* xfsprogs version (xfs_repair -V)
* number of CPUs
* contents of /proc/meminfo
* contents of /proc/mounts
* contents of /proc/partitions
* RAID layout (hardware and/or software)
* LVM configuration
* type of disks you are using
* write cache status of drives
* size of BBWC and mode it is running in
* xfs_info output on the filesystem in question
* dmesg output showing all error messages and stack traces

Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:

# iostat -x -d -m 5
# vmstat 5

can give us insight into the IO and memory utilisation of your machine at the time of the problem.

If the filesystem is hanging, then capture the output of the dmesg command after running:

# echo w > /proc/sysrq-trigger
# dmesg

will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.

And for advanced users, capturing an event trace using '''trace-cmd''' (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it's a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:

# trace-cmd record -e xfs\*

before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:

# trace-cmd report > trace_report.txt

Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.

If you have a problem with '''xfs_repair(8)''', make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using '''xfs_metadump(8)''' (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.

== Q: Mounting an XFS filesystem does not work - what is wrong? ==

If mount prints an error message something like:

mount: /dev/hda5 has wrong major or minor number

you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the "-t xfs" option on mount or the "xfs" option in <tt>/etc/fstab</tt>.

If you get something like:

mount: wrong fs type, bad option, bad superblock on /dev/sda1,
or too many mounted file systems

Refer to your system log file (<tt>/var/log/messages</tt>) for a detailed diagnostic message from the kernel.

== Q: Does the filesystem have an undelete capability? ==

There is no undelete in XFS (so far).

However at least some XFS driver implementations do not wipe file information nodes completely so there are chance to recover files with specialized commercial closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS].

In this kind of XFS driver implementation it does not re-use directory entries immediately so there are chance to get back recently deleted files even with their real names.

''xfs_irecover'' or ''xfsr'' may help too, [http://www.who.is.free.fr/wiki/doku.php?id=recover this site] has a few links.

This applies to most recent Linux distributions (versions?), as well as to most popular NAS boxes that use embedded linux and XFS file system.

Anyway, the best is to always keep backups.

== Q: How can I backup a XFS filesystem and ACLs? ==

You can backup a XFS filesystem with utilities like '''xfsdump(8)''' and standard '''tar(1)''' for standard files. If you want to backup ACLs you will need to use '''xfsdump''' or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (> version 3.1.4) or [http://rsync.samba.org/ rsync] (>= version 3.0.0) to backup ACLs and EAs. '''xfsdump''' can also be integrated with [http://www.amanda.org/ amanda(8)].

== Q: I see applications returning error 990 or "Structure needs cleaning", what is wrong? ==

The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], "Structure needs cleaning."

The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.

There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.

You can use xfs_repair to remedy the problem (with the file system unmounted).

== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==

Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.

XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.

Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you'll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the '''xfs_bmap(8)''' command).

== Q: What is the problem with the write cache on journaled filesystems? ==

Many drives use a write back cache in order to speed up the performance of writes. However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk. Further, the drive can de-stage data from the write cache to the platters in any order that it chooses. This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk. When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.

With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information. In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.

With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued. A powerfail "only" loses data in the cache but no essential ordering is violated, and corruption will not occur.

With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance. But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.

== Q: How can I tell if I have the disk write cache enabled? ==

For SCSI/SATA:

* Look in dmesg(8) output for a driver line, such as: "SCSI device sda: drive cache: write back"
* <nowiki># sginfo -c /dev/sda | grep -i 'write cache' </nowiki>

For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):

* <nowiki># hdparm -I /dev/sda</nowiki> and look under "Enabled Supported" for "Write cache"

For RAID controllers:

* See the section about RAID controllers below

== Q: How can I address the problem with the disk write cache? ==

=== Disabling the disk write back cache. ===

For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]): 

* <nowiki># hdparm -W0 /dev/sda</nowiki> # hdparm -W0 /dev/hda
* <nowiki># blktool /dev/sda wcache off</nowiki> # blktool /dev/hda wcache off

For SCSI:

* Using sginfo(8) which is a little tedious It takes 3 steps. For example:
*# <nowiki>#sginfo -c /dev/sda</nowiki> which gives a list of attribute names and values
*# <nowiki>#sginfo -cX /dev/sda</nowiki> which gives an array of cache values which you must match up with from step 1, e.g. 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0
*# <nowiki>#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0</nowiki> allows you to reset the value of the cache attributes.

For RAID controllers:

* See the section about RAID controllers below

This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled. 

=== Using an external log. ===

Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will '''not''' solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won't be able to guarantee that if the metadata is on a drive with the write cache enabled.

In fact using an external log will disable XFS' write barrier support.

=== Write barrier support. ===

Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with "nobarrier". Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:

* "Disabling barriers, not supported with external log device"
* "Disabling barriers, not supported by the underlying device"
* "Disabling barriers, trial barrier write failed"

If the filesystem is mounted with an external log device then we currently don't support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn't support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.

== Q. Should barriers be enabled with storage which has a persistent write cache? ==

Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with "nobarrier", assuming your RAID controller is infallible and not resetting randomly like some common ones do. But take care about the hard disk write cache, which should be off.

== Q. Which settings does my RAID controller need ? ==

It's hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:

Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory "[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]") which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.

If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.

* onboard RAID controllers: there are so many different types it's hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn't even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.

* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86);

* Adaptec: allows setting individual drives cache
arcconf setcache <disk> wb|wt
wb=write back, which means write cache on, wt=write through, which means write cache off. So "wt" should be chosen.

* Areca: In archttp under "System Controls" -> "System Configuration" there's the option "Disk Write Cache Mode" (defaults "Auto")

"Off": disk write cache is turned off

"On": disk write cache is enabled, this is not safe for your data but fast

"Auto": If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to "On", because neither controller cache nor disk cache is safe so you don't seem to care about your data and just want high speed (which you get then).

That's a very sensible default so you can let it "Auto" or enforce "Off" to be sure.

* LSI MegaRAID: allows setting individual disks cache:
MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL # flushes the controller cache
MegaCli -LDGetProp -Cache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the controller cache settings
MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the disk cache settings (for all phys. disks in logical disk)
MegaCli -LDSetProp -EnDskCache|DisDskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # set disk cache setting

* Xyratex: from the docs: "Write cache includes the disk drive cache and controller cache.". So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.

== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==

The biggest problem is that those products seem to also virtualize disk
writes in a way that even barriers don't work any more, which means even
a fsync is not reliable. Tests confirm that unplugging the power from
such a system even with RAID controller with battery backed cache and
hard disk cache turned off (which is safe on a normal host) you can
destroy a database within the virtual machine (client, domU whatever you
call it).

In qemu you can specify cache=off on the line specifying the virtual
disk. For others information is missing.

== Q: What is the issue with directory corruption in Linux 2.6.17? ==

In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some "sparse" endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.

''Update: the fix is included in 2.6.17.7 and later kernels.''

To add insult to injury, '''xfs_repair(8)''' is currently not correcting these directories on detection of this corrupt state either. This '''xfs_repair''' issue is actively being worked on, and a fixed version will be available shortly.

''Update: a fixed '''xfs_repair''' is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.''

No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.

'''xfs_repair -n''' should be able to detect any directory corruption.

Until a fixed '''xfs_repair''' binary is available, one can make use of the '''xfs_db(8)''' command to mark the problem directory for removal (see the example below). A subsequent '''xfs_repair''' invocation will remove the directory and move all contents into "lost+found", named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
core.mode = 040755
core.version = 2
core.format = 3 (btree)
...
xfs_db> write core.mode 0
xfs_db> quit
</nowiki>

A subsequent '''xfs_repair''' will clear the directory, and add new entries (named by inode number) in lost+found.

The easiest way to map inode numbers to full paths is via '''xfs_ncheck(8)'''<nowiki>: </nowiki>

<nowiki>
# xfs_ncheck -i 14101 -i 14102 /dev/sdXXX
14101 full/path/mumble_fratz_foo_bar_1495
14102 full/path/mumble_fratz_foo_bar_1494
</nowiki>

Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
...
next_unlinked = null
u.bmbt.level = 1
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[0]
u.bmbt.ptrs[1] = 1:3628
xfs_db> fsblock 3628
xfs_db> type bmapbtd
xfs_db> print
magic = 0x424d4150
level = 0
numrecs = 19
leftsib = null
rightsib = null
recs[1-19] = [startoff,startblock,blockcount,extentflag]
1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]
5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]
9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]
12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]
15:[33554436,3488,8,0] 16:[33554444,3629,4,0]
17:[33554448,3748,4,0] 18:[33554452,3900,4,0]
19:[67108864,3364,4,0]
</nowiki>

At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the '''xfs_db''' dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:

...
xfs_db> dblock 20
xfs_db> print
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0
dhdr.bestfree[0].length = 0
dhdr.bestfree[1].offset = 0
dhdr.bestfree[1].length = 0
dhdr.bestfree[2].offset = 0
dhdr.bestfree[2].length = 0
du[0].inumber = 13937
du[0].namelen = 25
du[0].name = "mumble_fratz_foo_bar_1595"
du[0].tag = 0x10
du[1].inumber = 13938
du[1].namelen = 25
du[1].name = "mumble_fratz_foo_bar_1594"
du[1].tag = 0x38
...

So, here we can see that inode number 13938 matches up with name "mumble_fratz_foo_bar_1594". Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at "lost+found" (once '''xfs_repair''' has removed the corrupt directory).

== Q: Why does my > 2TB XFS partition disappear when I reboot ? ==

Strictly speaking this is not an XFS problem.

To support > 2TB partitions you need two things: a kernel that supports large block devices (<tt>CONFIG_LBD=y</tt>) and a partition table format that can hold large partitions. The default DOS partition tables don't. The best partition format for
> 2TB partitions is the EFI GPT format (<tt>CONFIG_EFI_PARTITION=y</tt>).

Without CONFIG_LBD=y you can't even create the filesystem, but without <tt>CONFIG_EFI_PARTITION=y</tt> it works fine until you reboot at which point the partition will disappear. Note that you need to enable the <tt>CONFIG_PARTITION_ADVANCED</tt> option before you can set <tt>CONFIG_EFI_PARTITION=y</tt>.

== Q: Why do I receive <tt>No space left on device</tt> after <tt>xfs_growfs</tt>? ==

After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:

The only way to fix this is to move data around to free up space
below 1TB. Find your oldest data (i.e. that was around before even
the first grow) and move it off the filesystem (move, not copy).
Then if you copy it back on, the data blocks will end up above 1TB
and that should leave you with plenty of space for inodes below 1TB.

A complete dump and restore will also fix the problem ;)

Also, you can add 'inode64' to your mount options to allow inodes to live above 1TB.

example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&forum=38 | No space left on device on xfs filesystem with 7.7TB free]

== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==

The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons.

Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.

== Q: How to get around a bad inode repair is unable to clean up ==

The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.

xfs_db -x -c 'inode XXX' -c 'write core.nextents 0' -c 'write core.size 0' /dev/hdXX

== Q: How to calculate the correct sunit,swidth values for optimal performance ==

XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.

These options can be sometimes autodetected (for example with md raid and recent enough kernel (>= 2.6.32) and xfsprogs (>= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.

The calculation of these values is quite simple:

su = <RAID controllers stripe size in BYTES (or KiBytes when used with k)>
sw = <# of data disks (don't count parity disks)>

So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use

su = 64k
sw = 6 (RAID-6 of 8 disks has 6 data disks)

A RAID stripe size of 256KB with a RAID-10 over 16 disks should use

su = 256k
sw = 8 (RAID-10 of 16 disks has 8 data disks)

Alternatively, you can use "sunit" instead of "su" and "swidth" instead of "sw" but then sunit/swidth values need to be specified in "number of 512B sectors"!

Note that <tt>xfs_info</tt> and <tt>mkfs.xfs</tt> interpret sunit and swidth as being specified in units of 512B sectors; that's unfortunately not the unit they're reported in, however.
<tt>xfs_info</tt> and <tt>mkfs.xfs</tt> report them in multiples of your basic block size (bsize) and not in 512B sectors.

Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.

When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.

== Q: Why doesn't NFS-exporting subdirectories of inode64-mounted filesystem work? ==

The default <tt>fsid</tt> type encodes only 32-bit of the inode number for subdirectory exports. However, exporting the root of the filesystem works, or using one of the non-default <tt>fsid</tt> types (<tt>fsid=uuid</tt> in <tt>/etc/exports</tt> with recent <tt>nfs-utils</tt>) should work as well. (Thanks, Christoph!)

== Q: What is the inode64 mount option for? ==

By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like "disk full" when you still have plenty space free, but there's no more place in the first TB to create a new inode. Also, performance sucks.

To come around this, use the inode64 mount options for filesystems >1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.

Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.

== Q: Can I just try the inode64 option to see if it helps me? ==

Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can't access files & dirs that have been created with an inode >32bit anymore.

== Q: Performance: mkfs.xfs -n size=64k option ==

Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:

Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a
directory entry is determined by the length of the name.

There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there's the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.

For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.

In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don't have any numbers on what the difference might be - I'm getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....

== Q: I want to tune my XFS filesystems for <something> ==

''Premature optimization is the root of all evil.'' - Donald Knuth

The standard answer you will get to this question is this: use the defaults.

There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to configure the filesystem appropriately.

There are a lot of "XFS tuning guides" that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don't expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.

In most cases, the only thing you need to to consider for <tt>mkfs.xfs</tt> is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the <tt>logbsize</tt> and <tt>delaylog</tt> mount options. Increasing <tt>logbsize</tt> reduces the number of journal IOs for a given workload, and <tt>delaylog</tt> will reduce them even further. The trade off for this increase in metadata performance is that more operations may be "missing" after recovery if the system crashes while actively making modifications.

As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.

== Q: Which factors influence the memory usage of xfs_repair? ==

This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).

# xfs_repair -n -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2096.
#

xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,
of which 2,097,152KB is needed for tracking free space.
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)

Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:

# xfs_repair -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2289.

That is now needs at least another 200MB of RAM to run.

The numbers reported by xfs_repair are the absolute minimum required and approximate at that;
more RAM than this may be required to complete successfully.
Also, if you only give xfs_repair the minimum required RAM, it will be slow;
for best repair performance, the more RAM you can give it the better.

== Q: Why some files of my filesystem shows as "?????????? ? ? ? ? ? filename" ? ==

If ls -l shows you a listing as

# ?????????? ? ? ? ? ? file1
?????????? ? ? ? ? ? file2
?????????? ? ? ? ? ? file3
?????????? ? ? ? ? ? file4

and errors like:
# ls /pathtodir/
ls: cannot access /pathtodir/file1: Invalid argument
ls: cannot access /pathtodir/file2: Invalid argument
ls: cannot access /pathtodir/file3: Invalid argument
ls: cannot access /pathtodir/file4: Invalid argument

or even:
# failed to stat /pathtodir/file1

It is very probable your filesystem must be mounted with inode64
# mount -oremount,inode64 /dev/diskpart /mnt/xfs

should make it work ok again.
If it works, add the option to fstab.

== Q: The xfs_db "frag" command says I'm over 50%. Is that bad? ==

It depends. It's important to know how the value is calculated. xfs_db looks at the extents in all files, and returns:

(actual extents - ideal extents) / actual extents

This means that if, for example, you have an average of 2 extents per file, you'll get an answer of 50%. 4 extents per file would give you 75%. This may or may not be a problem, especially depending on the size of the files in question. (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented). The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.

Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:
[[Image:Frag_factor.png|500px]]

XFS FAQ

2012-04-27T10:12:27Z

Dgc: updated trace-cmd directions.

Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]

Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.

== Q: Where can I find documentation about XFS? ==

The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.

You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the '''<nowiki>#xfs</nowiki>''' IRC channel on ''irc.freenode.net''.

== Q: Where can I find documentation about ACLs? ==

Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/

The '''acl(5)''' manual page is also quite extensive.

== Q: Where can I find information about the internals of XFS? ==

An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.

Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.

== Q: What partition type should I use for XFS on Linux? ==

Linux native filesystem (83).

== Q: What mount options does XFS have? ==

There are a number of mount options influencing XFS filesystems - refer to the '''mount(8)''' manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])

== Q: Is there any relation between the XFS utilities and the kernel version? ==

No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.

== Q: Does it run on platforms other than i386? ==

XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.

== Q: Quota: Do quotas work on XFS? ==

Yes.

To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/ http://sourceforge.net/projects/linuxquota/] or use '''xfs_quota(8)'''.

== Q: Quota: What's project quota? ==

The project quota is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.

== Q: Quota: Can group quota and project quota be used at the same time? ==

No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.

== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==

To be answered.

== Q: Are there any dump/restore tools for XFS? ==

'''xfsdump(8)''' and '''xfsrestore(8)''' are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.

== Q: Does LILO work with XFS? ==

This depends on where you install LILO.

Yes, for MBR (Master Boot Record) installations.

No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.

== Q: Does GRUB work with XFS? ==

There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.

== Q: Can XFS be used for a root filesystem? ==

Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the "rootflags=" kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit "logdev=" specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]

== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==

Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be "clean" when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.

== Q: Is there a way to make a XFS filesystem larger or smaller? ==

You can ''NOT'' make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.

An XFS filesystem may be enlarged by using '''xfs_growfs(8)'''.

If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the ''exact same'' starting point. Run '''xfs_growfs''' to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.

Using XFS filesystems on top of a volume manager makes this a lot easier.

== Q: What information should I include when reporting a problem? ==

What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:

* kernel version (uname -a)
* xfsprogs version (xfs_repair -V)
* number of CPUs
* contents of /proc/meminfo
* contents of /proc/mounts
* contents of /proc/partitions
* RAID layout (hardware and/or software)
* LVM configuration
* type of disks you are using
* write cache status of drives
* size of BBWC and mode it is running in
* xfs_info output on the filesystem in question
* dmesg output showing all error messages and stack traces

Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:

# iostat -x -d -m 5
# vmstat 5

can give us insight into the IO and memory utilisation of your machine at the time of the problem.

If the filesystem is hanging, then capture the output of the dmesg command after running:

# echo w > /proc/sysrq-trigger
# dmesg

will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.

And for advanced users, capturing an event trace using '''trace-cmd''' (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it's a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:

# trace-cmd record -e xfs\*

before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:

# trace-cmd report > trace_report.txt

Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.

If you have a problem with '''xfs_repair(8)''', make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using '''xfs_metadump(8)''' (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.

== Q: Mounting an XFS filesystem does not work - what is wrong? ==

If mount prints an error message something like:

mount: /dev/hda5 has wrong major or minor number

you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the "-t xfs" option on mount or the "xfs" option in <tt>/etc/fstab</tt>.

If you get something like:

mount: wrong fs type, bad option, bad superblock on /dev/sda1,
or too many mounted file systems

Refer to your system log file (<tt>/var/log/messages</tt>) for a detailed diagnostic message from the kernel.

== Q: Does the filesystem have an undelete capability? ==

There is no undelete in XFS (so far).

However at least some XFS driver implementations do not wipe file information nodes completely so there are chance to recover files with specialized commercial closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS].

In this kind of XFS driver implementation it does not re-use directory entries immediately so there are chance to get back recently deleted files even with their real names.

''xfs_irecover'' or ''xfsr'' may help too, [http://www.who.is.free.fr/wiki/doku.php?id=recover this site] has a few links.

This applies to most recent Linux distributions (versions?), as well as to most popular NAS boxes that use embedded linux and XFS file system.

Anyway, the best is to always keep backups.

== Q: How can I backup a XFS filesystem and ACLs? ==

You can backup a XFS filesystem with utilities like '''xfsdump(8)''' and standard '''tar(1)''' for standard files. If you want to backup ACLs you will need to use '''xfsdump''' or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (> version 3.1.4) or [http://rsync.samba.org/ rsync] (>= version 3.0.0) to backup ACLs and EAs. '''xfsdump''' can also be integrated with [http://www.amanda.org/ amanda(8)].

== Q: I see applications returning error 990 or "Structure needs cleaning", what is wrong? ==

The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], "Structure needs cleaning."

The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.

There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.

You can use xfs_check and xfs_repair to remedy the problem (with the file system unmounted).

== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==

Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.

XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.

Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you'll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the '''xfs_bmap(8)''' command).

== Q: What is the problem with the write cache on journaled filesystems? ==

Many drives use a write back cache in order to speed up the performance of writes. However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk. Further, the drive can de-stage data from the write cache to the platters in any order that it chooses. This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk. When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.

With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information. In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.

With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued. A powerfail "only" loses data in the cache but no essential ordering is violated, and corruption will not occur.

With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance. But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.

== Q: How can I tell if I have the disk write cache enabled? ==

For SCSI/SATA:

* Look in dmesg(8) output for a driver line, such as: "SCSI device sda: drive cache: write back"
* <nowiki># sginfo -c /dev/sda | grep -i 'write cache' </nowiki>

For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):

* <nowiki># hdparm -I /dev/sda</nowiki> and look under "Enabled Supported" for "Write cache"

For RAID controllers:

* See the section about RAID controllers below

== Q: How can I address the problem with the disk write cache? ==

=== Disabling the disk write back cache. ===

For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]): 

* <nowiki># hdparm -W0 /dev/sda</nowiki> # hdparm -W0 /dev/hda
* <nowiki># blktool /dev/sda wcache off</nowiki> # blktool /dev/hda wcache off

For SCSI:

* Using sginfo(8) which is a little tedious It takes 3 steps. For example:
*# <nowiki>#sginfo -c /dev/sda</nowiki> which gives a list of attribute names and values
*# <nowiki>#sginfo -cX /dev/sda</nowiki> which gives an array of cache values which you must match up with from step 1, e.g. 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0
*# <nowiki>#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0</nowiki> allows you to reset the value of the cache attributes.

For RAID controllers:

* See the section about RAID controllers below

This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled. 

=== Using an external log. ===

Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will '''not''' solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won't be able to guarantee that if the metadata is on a drive with the write cache enabled.

In fact using an external log will disable XFS' write barrier support.

=== Write barrier support. ===

Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with "nobarrier". Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:

* "Disabling barriers, not supported with external log device"
* "Disabling barriers, not supported by the underlying device"
* "Disabling barriers, trial barrier write failed"

If the filesystem is mounted with an external log device then we currently don't support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn't support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.

== Q. Should barriers be enabled with storage which has a persistent write cache? ==

Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with "nobarrier", assuming your RAID controller is infallible and not resetting randomly like some common ones do. But take care about the hard disk write cache, which should be off.

== Q. Which settings does my RAID controller need ? ==

It's hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:

Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory "[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]") which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.

If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.

* onboard RAID controllers: there are so many different types it's hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn't even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.

* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86);

* Adaptec: allows setting individual drives cache
arcconf setcache <disk> wb|wt
wb=write back, which means write cache on, wt=write through, which means write cache off. So "wt" should be chosen.

* Areca: In archttp under "System Controls" -> "System Configuration" there's the option "Disk Write Cache Mode" (defaults "Auto")

"Off": disk write cache is turned off

"On": disk write cache is enabled, this is not safe for your data but fast

"Auto": If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to "On", because neither controller cache nor disk cache is safe so you don't seem to care about your data and just want high speed (which you get then).

That's a very sensible default so you can let it "Auto" or enforce "Off" to be sure.

* LSI MegaRAID: allows setting individual disks cache:
MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL # flushes the controller cache
MegaCli -LDGetProp -Cache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the controller cache settings
MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the disk cache settings (for all phys. disks in logical disk)
MegaCli -LDSetProp -EnDskCache|DisDskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # set disk cache setting

* Xyratex: from the docs: "Write cache includes the disk drive cache and controller cache.". So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.

== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==

The biggest problem is that those products seem to also virtualize disk
writes in a way that even barriers don't work any more, which means even
a fsync is not reliable. Tests confirm that unplugging the power from
such a system even with RAID controller with battery backed cache and
hard disk cache turned off (which is safe on a normal host) you can
destroy a database within the virtual machine (client, domU whatever you
call it).

In qemu you can specify cache=off on the line specifying the virtual
disk. For others information is missing.

== Q: What is the issue with directory corruption in Linux 2.6.17? ==

In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some "sparse" endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.

''Update: the fix is included in 2.6.17.7 and later kernels.''

To add insult to injury, '''xfs_repair(8)''' is currently not correcting these directories on detection of this corrupt state either. This '''xfs_repair''' issue is actively being worked on, and a fixed version will be available shortly.

''Update: a fixed '''xfs_repair''' is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.''

No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.

The '''xfs_check''' tool, or '''xfs_repair -n''', should be able to detect any directory corruption.

Until a fixed '''xfs_repair''' binary is available, one can make use of the '''xfs_db(8)''' command to mark the problem directory for removal (see the example below). A subsequent '''xfs_repair''' invocation will remove the directory and move all contents into "lost+found", named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
core.mode = 040755
core.version = 2
core.format = 3 (btree)
...
xfs_db> write core.mode 0
xfs_db> quit
</nowiki>

A subsequent '''xfs_repair''' will clear the directory, and add new entries (named by inode number) in lost+found.

The easiest way to map inode numbers to full paths is via '''xfs_ncheck(8)'''<nowiki>: </nowiki>

<nowiki>
# xfs_ncheck -i 14101 -i 14102 /dev/sdXXX
14101 full/path/mumble_fratz_foo_bar_1495
14102 full/path/mumble_fratz_foo_bar_1494
</nowiki>

Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
...
next_unlinked = null
u.bmbt.level = 1
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[0]
u.bmbt.ptrs[1] = 1:3628
xfs_db> fsblock 3628
xfs_db> type bmapbtd
xfs_db> print
magic = 0x424d4150
level = 0
numrecs = 19
leftsib = null
rightsib = null
recs[1-19] = [startoff,startblock,blockcount,extentflag]
1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]
5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]
9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]
12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]
15:[33554436,3488,8,0] 16:[33554444,3629,4,0]
17:[33554448,3748,4,0] 18:[33554452,3900,4,0]
19:[67108864,3364,4,0]
</nowiki>

At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the '''xfs_db''' dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:

...
xfs_db> dblock 20
xfs_db> print
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0
dhdr.bestfree[0].length = 0
dhdr.bestfree[1].offset = 0
dhdr.bestfree[1].length = 0
dhdr.bestfree[2].offset = 0
dhdr.bestfree[2].length = 0
du[0].inumber = 13937
du[0].namelen = 25
du[0].name = "mumble_fratz_foo_bar_1595"
du[0].tag = 0x10
du[1].inumber = 13938
du[1].namelen = 25
du[1].name = "mumble_fratz_foo_bar_1594"
du[1].tag = 0x38
...

So, here we can see that inode number 13938 matches up with name "mumble_fratz_foo_bar_1594". Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at "lost+found" (once '''xfs_repair''' has removed the corrupt directory).

== Q: Why does my > 2TB XFS partition disappear when I reboot ? ==

Strictly speaking this is not an XFS problem.

To support > 2TB partitions you need two things: a kernel that supports large block devices (<tt>CONFIG_LBD=y</tt>) and a partition table format that can hold large partitions. The default DOS partition tables don't. The best partition format for
> 2TB partitions is the EFI GPT format (<tt>CONFIG_EFI_PARTITION=y</tt>).

Without CONFIG_LBD=y you can't even create the filesystem, but without <tt>CONFIG_EFI_PARTITION=y</tt> it works fine until you reboot at which point the partition will disappear. Note that you need to enable the <tt>CONFIG_PARTITION_ADVANCED</tt> option before you can set <tt>CONFIG_EFI_PARTITION=y</tt>.

== Q: Why do I receive <tt>No space left on device</tt> after <tt>xfs_growfs</tt>? ==

After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:

The only way to fix this is to move data around to free up space
below 1TB. Find your oldest data (i.e. that was around before even
the first grow) and move it off the filesystem (move, not copy).
Then if you copy it back on, the data blocks will end up above 1TB
and that should leave you with plenty of space for inodes below 1TB.

A complete dump and restore will also fix the problem ;)

Also, you can add 'inode64' to your mount options to allow inodes to live above 1TB.

example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&forum=38 | No space left on device on xfs filesystem with 7.7TB free]

== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==

The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons.

Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.

== Q: How to get around a bad inode repair is unable to clean up ==

The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.

xfs_db -x -c 'inode XXX' -c 'write core.nextents 0' -c 'write core.size 0' /dev/hdXX

== Q: How to calculate the correct sunit,swidth values for optimal performance ==

XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.

These options can be sometimes autodetected (for example with md raid and recent enough kernel (>= 2.6.32) and xfsprogs (>= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.

The calculation of these values is quite simple:

su = <RAID controllers stripe size in BYTES (or KiBytes when used with k)>
sw = <# of data disks (don't count parity disks)>

So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use

su = 64k
sw = 6 (RAID-6 of 8 disks has 6 data disks)

A RAID stripe size of 256KB with a RAID-10 over 16 disks should use

su = 256k
sw = 8 (RAID-10 of 16 disks has 8 data disks)

Alternatively, you can use "sunit" instead of "su" and "swidth" instead of "sw" but then sunit/swidth values need to be specified in "number of 512B sectors"!

Note that <tt>xfs_info</tt> and <tt>mkfs.xfs</tt> interpret sunit and swidth as being specified in units of 512B sectors; that's unfortunately not the unit they're reported in, however.
<tt>xfs_info</tt> and <tt>mkfs.xfs</tt> report them in multiples of your basic block size (bsize) and not in 512B sectors.

Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.

When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.

== Q: Why doesn't NFS-exporting subdirectories of inode64-mounted filesystem work? ==

The default <tt>fsid</tt> type encodes only 32-bit of the inode number for subdirectory exports. However, exporting the root of the filesystem works, or using one of the non-default <tt>fsid</tt> types (<tt>fsid=uuid</tt> in <tt>/etc/exports</tt> with recent <tt>nfs-utils</tt>) should work as well. (Thanks, Christoph!)

== Q: What is the inode64 mount option for? ==

By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like "disk full" when you still have plenty space free, but there's no more place in the first TB to create a new inode. Also, performance sucks.

To come around this, use the inode64 mount options for filesystems >1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.

Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.

== Q: Can I just try the inode64 option to see if it helps me? ==

Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can't access files & dirs that have been created with an inode >32bit anymore.

== Q: Performance: mkfs.xfs -n size=64k option ==

Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:

Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a
directory entry is determined by the length of the name.

There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there's the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.

For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.

In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don't have any numbers on what the difference might be - I'm getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....

== Q: I want to tune my XFS filesystems for <something> ==

The standard answer you will get to this question is this: use the defaults.

There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to configure the filesystem appropriately.

There are a lot of "XFS tuning guides" that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don't expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.

In most cases, the only thing you need to to consider for <tt>mkfs.xfs</tt> is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the <tt>logbsize</tt> and <tt>delaylog</tt> mount options. Increasing <tt>logbsize</tt> reduces the number of journal IOs for a given workload, and <tt>delaylog</tt> will reduce them even further. The trade off for this increase in metadata performance is that more operations may be "missing" after recovery if the system crashes while actively making modifications.

As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.

== Q: Which factors influence the memory usage of xfs_repair? ==

This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).

# xfs_repair -n -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2096.
#

xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,
of which 2,097,152KB is needed for tracking free space.
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)

Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:

# xfs_repair -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2289.

That is now needs at least another 200MB of RAM to run.

The numbers reported by xfs_repair are the absolute minimum required and approximate at that;
more RAM than this may be required to complete successfully.
Also, if you only give xfs_repair the minimum required RAM, it will be slow;
for best repair performance, the more RAM you can give it the better.

== Q: Why some files of my filesystem shows as "?????????? ? ? ? ? ? filename" ? ==

If ls -l shows you a listing as

# ?????????? ? ? ? ? ? file1
?????????? ? ? ? ? ? file2
?????????? ? ? ? ? ? file3
?????????? ? ? ? ? ? file4

and errors like:
# ls /pathtodir/
ls: cannot access /pathtodir/file1: Invalid argument
ls: cannot access /pathtodir/file2: Invalid argument
ls: cannot access /pathtodir/file3: Invalid argument
ls: cannot access /pathtodir/file4: Invalid argument

or even:
# failed to stat /pathtodir/file1

It is very probable your filesystem must be mounted with inode64
# mount -oremount,inode64 /dev/diskpart /mnt/xfs

should make it work ok again.
If it works, add the option to fstab.

== Q: The xfs_db "frag" command says I'm over 50%. Is that bad? ==

It depends. It's important to know how the value is calculated. xfs_db looks at the extents in all files, and returns:

(actual extents - ideal extents) / actual extents

This means that if, for example, you have an average of 2 extents per file, you'll get an answer of 50%. 4 extents per file would give you 75%. This may or may not be a problem, especially depending on the size of the files in question. (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented). The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.

Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:
[[Image:Frag_factor.png|500px]]

XFS FAQ

2012-04-23T23:57:08Z

Dgc:

Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]

Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.

== Q: Where can I find documentation about XFS? ==

The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.

You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the '''<nowiki>#xfs</nowiki>''' IRC channel on ''irc.freenode.net''.

== Q: Where can I find documentation about ACLs? ==

Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/

The '''acl(5)''' manual page is also quite extensive.

== Q: Where can I find information about the internals of XFS? ==

An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.

Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.

== Q: What partition type should I use for XFS on Linux? ==

Linux native filesystem (83).

== Q: What mount options does XFS have? ==

There are a number of mount options influencing XFS filesystems - refer to the '''mount(8)''' manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])

== Q: Is there any relation between the XFS utilities and the kernel version? ==

No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.

== Q: Does it run on platforms other than i386? ==

XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.

== Q: Quota: Do quotas work on XFS? ==

Yes.

To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/ http://sourceforge.net/projects/linuxquota/] or use '''xfs_quota(8)'''.

== Q: Quota: What's project quota? ==

The project quota is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.

== Q: Quota: Can group quota and project quota be used at the same time? ==

No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.

== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==

To be answered.

== Q: Are there any dump/restore tools for XFS? ==

'''xfsdump(8)''' and '''xfsrestore(8)''' are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.

== Q: Does LILO work with XFS? ==

This depends on where you install LILO.

Yes, for MBR (Master Boot Record) installations.

No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.

== Q: Does GRUB work with XFS? ==

There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.

== Q: Can XFS be used for a root filesystem? ==

Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the "rootflags=" kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit "logdev=" specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]

== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==

Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be "clean" when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.

== Q: Is there a way to make a XFS filesystem larger or smaller? ==

You can ''NOT'' make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.

An XFS filesystem may be enlarged by using '''xfs_growfs(8)'''.

If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the ''exact same'' starting point. Run '''xfs_growfs''' to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.

Using XFS filesystems on top of a volume manager makes this a lot easier.

== Q: What information should I include when reporting a problem? ==

What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:

* kernel version (uname -a)
* xfsprogs version (xfs_repair -V)
* number of CPUs
* contents of /proc/meminfo
* contents of /proc/mounts
* contents of /proc/partitions
* RAID layout (hardware and/or software)
* LVM configuration
* type of disks you are using
* write cache status of drives
* size of BBWC and mode it is running in
* xfs_info output on the filesystem in question
* dmesg output showing all error messages and stack traces

Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:

# iostat -x -d -m 5
# vmstat 5

can give us insight into the IO and memory utilisation of your machine at the time of the problem.

If the filesystem is hanging, then capture the output of the dmesg command after running:

# echo w > /proc/sysrq-trigger
# dmesg

will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.

And for advanced users, capturing an event trace using '''trace-cmd''' (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it's a good idea to be ready with it in advance. Start the trace with this command:

# trace-cmd record -e xfs\*

before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:

# trace-cmd report > trace_report.txt

Compress the trace_report.txt file and include that with the bug report.

If you have a problem with '''xfs_repair(8)''', make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using '''xfs_metadump(8)''' (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.

== Q: Mounting an XFS filesystem does not work - what is wrong? ==

If mount prints an error message something like:

mount: /dev/hda5 has wrong major or minor number

you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the "-t xfs" option on mount or the "xfs" option in <tt>/etc/fstab</tt>.

If you get something like:

mount: wrong fs type, bad option, bad superblock on /dev/sda1,
or too many mounted file systems

Refer to your system log file (<tt>/var/log/messages</tt>) for a detailed diagnostic message from the kernel.

== Q: Does the filesystem have an undelete capability? ==

There is no undelete in XFS (so far).

However at least some XFS driver implementations do not wipe file information nodes completely so there are chance to recover files with specialized commercial closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS].

In this kind of XFS driver implementation it does not re-use directory entries immediately so there are chance to get back recently deleted files even with their real names.

''xfs_irecover'' or ''xfsr'' may help too, [http://www.who.is.free.fr/wiki/doku.php?id=recover this site] has a few links.

This applies to most recent Linux distributions (versions?), as well as to most popular NAS boxes that use embedded linux and XFS file system.

Anyway, the best is to always keep backups.

== Q: How can I backup a XFS filesystem and ACLs? ==

You can backup a XFS filesystem with utilities like '''xfsdump(8)''' and standard '''tar(1)''' for standard files. If you want to backup ACLs you will need to use '''xfsdump''' or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (> version 3.1.4) or [http://rsync.samba.org/ rsync] (>= version 3.0.0) to backup ACLs and EAs. '''xfsdump''' can also be integrated with [http://www.amanda.org/ amanda(8)].

== Q: I see applications returning error 990 or "Structure needs cleaning", what is wrong? ==

The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], "Structure needs cleaning."

The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.

There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.

You can use xfs_check and xfs_repair to remedy the problem (with the file system unmounted).

== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==

Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.

XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.

Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you'll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the '''xfs_bmap(8)''' command).

== Q: What is the problem with the write cache on journaled filesystems? ==

Many drives use a write back cache in order to speed up the performance of writes. However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk. Further, the drive can de-stage data from the write cache to the platters in any order that it chooses. This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk. When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.

With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information. In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.

With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued. A powerfail "only" loses data in the cache but no essential ordering is violated, and corruption will not occur.

With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance. But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.

== Q: How can I tell if I have the disk write cache enabled? ==

For SCSI/SATA:

* Look in dmesg(8) output for a driver line, such as: "SCSI device sda: drive cache: write back"
* <nowiki># sginfo -c /dev/sda | grep -i 'write cache' </nowiki>

For PATA/SATA (although for SATA this only works on a recent kernel with ATA command passthrough):

* <nowiki># hdparm -I /dev/sda</nowiki> and look under "Enabled Supported" for "Write cache"

For RAID controllers:

* See the section about RAID controllers below

== Q: How can I address the problem with the disk write cache? ==

=== Disabling the disk write back cache. ===

For SATA/PATA(IDE): (although for SATA this only works on a recent kernel with ATA command passthrough): 

* <nowiki># hdparm -W0 /dev/sda</nowiki> # hdparm -W0 /dev/hda
* <nowiki># blktool /dev/sda wcache off</nowiki> # blktool /dev/hda wcache off

For SCSI:

* Using sginfo(8) which is a little tedious It takes 3 steps. For example:
*# <nowiki>#sginfo -c /dev/sda</nowiki> which gives a list of attribute names and values
*# <nowiki>#sginfo -cX /dev/sda</nowiki> which gives an array of cache values which you must match up with from step 1, e.g. 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0
*# <nowiki>#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0</nowiki> allows you to reset the value of the cache attributes.

For RAID controllers:

* See the section about RAID controllers below

This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled. 

=== Using an external log. ===

Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will '''not''' solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won't be able to guarantee that if the metadata is on a drive with the write cache enabled.

In fact using an external log will disable XFS' write barrier support.

=== Write barrier support. ===

Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with "nobarrier". Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:

* "Disabling barriers, not supported with external log device"
* "Disabling barriers, not supported by the underlying device"
* "Disabling barriers, trial barrier write failed"

If the filesystem is mounted with an external log device then we currently don't support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn't support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.

== Q. Should barriers be enabled with storage which has a persistent write cache? ==

Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with "nobarrier", assuming your RAID controller is infallible and not resetting randomly like some common ones do. But take care about the hard disk write cache, which should be off.

== Q. Which settings does my RAID controller need ? ==

It's hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:

Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory "[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]") which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.

If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.

* onboard RAID controllers: there are so many different types it's hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn't even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.

* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86);

* Adaptec: allows setting individual drives cache
arcconf setcache <disk> wb|wt
wb=write back, which means write cache on, wt=write through, which means write cache off. So "wt" should be chosen.

* Areca: In archttp under "System Controls" -> "System Configuration" there's the option "Disk Write Cache Mode" (defaults "Auto")

"Off": disk write cache is turned off

"On": disk write cache is enabled, this is not safe for your data but fast

"Auto": If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to "On", because neither controller cache nor disk cache is safe so you don't seem to care about your data and just want high speed (which you get then).

That's a very sensible default so you can let it "Auto" or enforce "Off" to be sure.

* LSI MegaRAID: allows setting individual disks cache:
MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL # flushes the controller cache
MegaCli -LDGetProp -Cache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the controller cache settings
MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the disk cache settings (for all phys. disks in logical disk)
MegaCli -LDSetProp -EnDskCache|DisDskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # set disk cache setting

* Xyratex: from the docs: "Write cache includes the disk drive cache and controller cache.". So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.

== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==

The biggest problem is that those products seem to also virtualize disk
writes in a way that even barriers don't work any more, which means even
a fsync is not reliable. Tests confirm that unplugging the power from
such a system even with RAID controller with battery backed cache and
hard disk cache turned off (which is safe on a normal host) you can
destroy a database within the virtual machine (client, domU whatever you
call it).

In qemu you can specify cache=off on the line specifying the virtual
disk. For others information is missing.

== Q: What is the issue with directory corruption in Linux 2.6.17? ==

In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some "sparse" endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.

''Update: the fix is included in 2.6.17.7 and later kernels.''

To add insult to injury, '''xfs_repair(8)''' is currently not correcting these directories on detection of this corrupt state either. This '''xfs_repair''' issue is actively being worked on, and a fixed version will be available shortly.

''Update: a fixed '''xfs_repair''' is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.''

No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.

The '''xfs_check''' tool, or '''xfs_repair -n''', should be able to detect any directory corruption.

Until a fixed '''xfs_repair''' binary is available, one can make use of the '''xfs_db(8)''' command to mark the problem directory for removal (see the example below). A subsequent '''xfs_repair''' invocation will remove the directory and move all contents into "lost+found", named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
core.mode = 040755
core.version = 2
core.format = 3 (btree)
...
xfs_db> write core.mode 0
xfs_db> quit
</nowiki>

A subsequent '''xfs_repair''' will clear the directory, and add new entries (named by inode number) in lost+found.

The easiest way to map inode numbers to full paths is via '''xfs_ncheck(8)'''<nowiki>: </nowiki>

<nowiki>
# xfs_ncheck -i 14101 -i 14102 /dev/sdXXX
14101 full/path/mumble_fratz_foo_bar_1495
14102 full/path/mumble_fratz_foo_bar_1494
</nowiki>

Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
...
next_unlinked = null
u.bmbt.level = 1
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[0]
u.bmbt.ptrs[1] = 1:3628
xfs_db> fsblock 3628
xfs_db> type bmapbtd
xfs_db> print
magic = 0x424d4150
level = 0
numrecs = 19
leftsib = null
rightsib = null
recs[1-19] = [startoff,startblock,blockcount,extentflag]
1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]
5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]
9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]
12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]
15:[33554436,3488,8,0] 16:[33554444,3629,4,0]
17:[33554448,3748,4,0] 18:[33554452,3900,4,0]
19:[67108864,3364,4,0]
</nowiki>

At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the '''xfs_db''' dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:

...
xfs_db> dblock 20
xfs_db> print
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0
dhdr.bestfree[0].length = 0
dhdr.bestfree[1].offset = 0
dhdr.bestfree[1].length = 0
dhdr.bestfree[2].offset = 0
dhdr.bestfree[2].length = 0
du[0].inumber = 13937
du[0].namelen = 25
du[0].name = "mumble_fratz_foo_bar_1595"
du[0].tag = 0x10
du[1].inumber = 13938
du[1].namelen = 25
du[1].name = "mumble_fratz_foo_bar_1594"
du[1].tag = 0x38
...

So, here we can see that inode number 13938 matches up with name "mumble_fratz_foo_bar_1594". Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at "lost+found" (once '''xfs_repair''' has removed the corrupt directory).

== Q: Why does my > 2TB XFS partition disappear when I reboot ? ==

Strictly speaking this is not an XFS problem.

To support > 2TB partitions you need two things: a kernel that supports large block devices (<tt>CONFIG_LBD=y</tt>) and a partition table format that can hold large partitions. The default DOS partition tables don't. The best partition format for
> 2TB partitions is the EFI GPT format (<tt>CONFIG_EFI_PARTITION=y</tt>).

Without CONFIG_LBD=y you can't even create the filesystem, but without <tt>CONFIG_EFI_PARTITION=y</tt> it works fine until you reboot at which point the partition will disappear. Note that you need to enable the <tt>CONFIG_PARTITION_ADVANCED</tt> option before you can set <tt>CONFIG_EFI_PARTITION=y</tt>.

== Q: Why do I receive <tt>No space left on device</tt> after <tt>xfs_growfs</tt>? ==

After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:

The only way to fix this is to move data around to free up space
below 1TB. Find your oldest data (i.e. that was around before even
the first grow) and move it off the filesystem (move, not copy).
Then if you copy it back on, the data blocks will end up above 1TB
and that should leave you with plenty of space for inodes below 1TB.

A complete dump and restore will also fix the problem ;)

Also, you can add 'inode64' to your mount options to allow inodes to live above 1TB.

example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&forum=38 | No space left on device on xfs filesystem with 7.7TB free]

== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==

The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons.

Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.

== Q: How to get around a bad inode repair is unable to clean up ==

The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.

xfs_db -x -c 'inode XXX' -c 'write core.nextents 0' -c 'write core.size 0' /dev/hdXX

== Q: How to calculate the correct sunit,swidth values for optimal performance ==

XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.

These options can be sometimes autodetected (for example with md raid and recent enough kernel (>= 2.6.32) and xfsprogs (>= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.

The calculation of these values is quite simple:

su = <RAID controllers stripe size in BYTES (or KiBytes when used with k)>
sw = <# of data disks (don't count parity disks)>

So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use

su = 64k
sw = 6 (RAID-6 of 8 disks has 6 data disks)

A RAID stripe size of 256KB with a RAID-10 over 16 disks should use

su = 256k
sw = 8 (RAID-10 of 16 disks has 8 data disks)

Alternatively, you can use "sunit" instead of "su" and "swidth" instead of "sw" but then sunit/swidth values need to be specified in "number of 512B sectors"!

Note that <tt>xfs_info</tt> and <tt>mkfs.xfs</tt> interpret sunit and swidth as being specified in units of 512B sectors; that's unfortunately not the unit they're reported in, however.
<tt>xfs_info</tt> and <tt>mkfs.xfs</tt> report them in multiples of your basic block size (bsize) and not in 512B sectors.

Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.

When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.

== Q: Why doesn't NFS-exporting subdirectories of inode64-mounted filesystem work? ==

The default <tt>fsid</tt> type encodes only 32-bit of the inode number for subdirectory exports. However, exporting the root of the filesystem works, or using one of the non-default <tt>fsid</tt> types (<tt>fsid=uuid</tt> in <tt>/etc/exports</tt> with recent <tt>nfs-utils</tt>) should work as well. (Thanks, Christoph!)

== Q: What is the inode64 mount option for? ==

By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like "disk full" when you still have plenty space free, but there's no more place in the first TB to create a new inode. Also, performance sucks.

To come around this, use the inode64 mount options for filesystems >1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.

Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.

== Q: Can I just try the inode64 option to see if it helps me? ==

Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can't access files & dirs that have been created with an inode >32bit anymore.

== Q: Performance: mkfs.xfs -n size=64k option ==

Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:

Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a
directory entry is determined by the length of the name.

There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there's the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.

For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.

In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don't have any numbers on what the difference might be - I'm getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....

== Q: I want to tune my XFS filesystems for <something> ==

The standard answer you will get to this question is this: use the defaults.

There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to configure the filesystem appropriately.

There are a lot of "XFS tuning guides" that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don't expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.

In most cases, the only thing you need to to consider for <tt>mkfs.xfs</tt> is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the <tt>logbsize</tt> and <tt>delaylog</tt> mount options. Increasing <tt>logbsize</tt> reduces the number of journal IOs for a given workload, and <tt>delaylog</tt> will reduce them even further. The trade off for this increase in metadata performance is that more operations may be "missing" after recovery if the system crashes while actively making modifications.

As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.

== Q: Which factors influence the memory usage of xfs_repair? ==

This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).

# xfs_repair -n -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2096.
#

xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,
of which 2,097,152KB is needed for tracking free space.
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)

Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:

# xfs_repair -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2289.

That is now needs at least another 200MB of RAM to run.

The numbers reported by xfs_repair are the absolute minimum required and approximate at that;
more RAM than this may be required to complete successfully.
Also, if you only give xfs_repair the minimum required RAM, it will be slow;
for best repair performance, the more RAM you can give it the better.

== Q: Why some files of my filesystem shows as "?????????? ? ? ? ? ? filename" ? ==

If ls -l shows you a listing as

# ?????????? ? ? ? ? ? file1
?????????? ? ? ? ? ? file2
?????????? ? ? ? ? ? file3
?????????? ? ? ? ? ? file4

and errors like:
# ls /pathtodir/
ls: cannot access /pathtodir/file1: Invalid argument
ls: cannot access /pathtodir/file2: Invalid argument
ls: cannot access /pathtodir/file3: Invalid argument
ls: cannot access /pathtodir/file4: Invalid argument

or even:
# failed to stat /pathtodir/file1

It is very probable your filesystem must be mounted with inode64
# mount -oremount,inode64 /dev/diskpart /mnt/xfs

should make it work ok again.
If it works, add the option to fstab.

== Q: The xfs_db "frag" command says I'm over 50%. Is that bad? ==

It depends. It's important to know how the value is calculated. xfs_db looks at the extents in all files, and returns:

(actual extents - ideal extents) / actual extents

This means that if, for example, you have an average of 2 extents per file, you'll get an answer of 50%. 4 extents per file would give you 75%. This may or may not be a problem, especially depending on the size of the files in question. (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented). The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.

Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:
[[Image:Frag_factor.png|500px]]

XFS Papers and Documentation

2012-01-30T03:43:31Z

Dgc:

=== Primary XFS Documentation ===

The XFS documentation started by SGI has been converted to docbook/[https://fedorahosted.org/publican/ Publican] format. The material is suitable for experienced users as well as developers and support staff. The XML source is available in a [http://git.kernel.org/?p=fs/xfs/xfsdocs-xml-dev.git;a=summary git repository] and builds of the documentation are available here:

* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-US/html/index.html XFS User Guide]

* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html XFS File System Structure]
** [http://sites.google.com/site/kandamotohiro/xfs Japanese translation] is also available.

* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_Labs/tmp/en-US/html/index.html XFS Training Labs]

* (Original versions of this material are still available at [http://oss.sgi.com/projects/xfs/training/index.html XFS Overview and Internals (html)] and [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS Filesystem Structure (pdf)]

The format of <tt>/proc/fs/xfs/stat</tt> also has been documented:
* [[Runtime_Stats|Runtime_Stats]]

=== Papers, Presentations, Etc ===

At the linux.conf.au 2012 event, Dave Chinner presented a talk on filesystem metadata scalability:

* ''XFS - Recent and Future Adventures in Filesystem Scalability'' [[http://www.youtube.com/watch?v=FegjLbCnoBw Video]] [[http://xfs.org/images/d/d1/Xfs-scalability-lca2012.pdf Presentation Slides]]

The October 2009 issue of the USENIX ;login: magazine published an article about XFS targeted at system administrators:

* ''XFS: The big storage file system for Linux'' [[http://oss.sgi.com/projects/xfs/papers/hellwig.pdf pdf]]

At the Ottawa Linux Symposium (July 2006), Dave Chinner presented a paper on filesystem scalability in Linux 2.6 kernels:

* ''High Bandwidth Filesystems on Large Systems'' (July 2006) [[http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf paper]] [[http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-presentation.pdf presentation]]

At linux.conf.au 2008 Dave Chinner gave a presentation about xfs_repair that he co-authored with Barry Naujok:

* Fixing XFS Filesystems Faster [[http://mirror.linux.org.au/pub/linux.conf.au/2008/slides/135-fixing_xfs_faster.pdf pdf]]

In July 2006, SGI storage marketing updated the XFS datasheet:

* ''Open Source XFS for Linux'' [[http://oss.sgi.com/projects/xfs/datasheet.pdf pdf]]

At UKUUG 2003, Christoph Hellwig presented a talk on XFS:

* ''XFS for Linux'' (July 2003) [[http://oss.sgi.com/projects/xfs/papers/ukuug2003.pdf pdf]] [[http://verein.lst.de/~hch/talks/ukuug2003/ html]]

Originally published in Proceedings of the FREENIX Track: 2002 Usenix Annual Technical Conference:

* ''Filesystem Performance and Scalability in Linux 2.4.17'' (June 2002) [[http://oss.sgi.com/projects/xfs/papers/filesystem-perf-tm.pdf pdf]]

At the Ottawa Linux Symposium, an updated presentation on porting XFSÂ to Linux was given:

* ''Porting XFS to Linux'' (July 2000) [[http://oss.sgi.com/projects/xfs/papers/ols2000/ols-xfs.htm html]]

At the Atlanta Linux Showcase, SGI presented the following paper on the port of XFS to Linux:

* ''Porting the SGI XFS File System to Linux'' (October 1999) [[http://oss.sgi.com/projects/xfs/papers/als/als.ps ps]] [[http://oss.sgi.com/projects/xfs/papers/als/als.pdf pdf]]

At the 6th Linux Kongress & the Linux Storage Management Workshop (LSMW) in Germany in September, 1999, SGI had a few presentations including the following:

* ''SGI's port of XFS to Linux'' (September 1999) [[http://oss.sgi.com/projects/xfs/papers/linux_kongress/index.htm html]]
* ''Overview of DMF'' (September 1999) [[http://oss.sgi.com/projects/xfs/papers/DMF-over/index.htm html]]

At the LinuxWorld Conference & Expo in August 1999, SGI published:

* ''An Open Source XFS data sheet'' (August 1999) [[http://oss.sgi.com/projects/xfs/papers/xfs_GPL.pdf pdf]]

From the 1996 USENIX conference:

* ''An XFS white paper'' [[http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html html]]

=== Other historical articles, press-releases, etc ===

* IBM's ''Advanced Filesystem Implementor's Guide'' has a chapter ''Introducing XFS'' [[http://www-106.ibm.com/developerworks/library/l-fs9.html html]]

* An editorial titled ''Tired of fscking? Try a journaling filesystem!'', Freshmeat (February 2001) [[http://freshmeat.net/articles/view/212/ html]]

* ''Who give a fsck about filesystems'' provides an overview of the Linux 2.4 filesystems [[http://www.linuxuser.co.uk/articles/issue6/lu6-All_you_need_to_know_about-Filesystems.pdf html]]

* ''Journal File Systems'' in issue 55 of ''Linux Gazette'' provides a comparison of journaled filesystems.

* The original XFS beta release announcement was published in ''Linux Today'' (September 2000) [[http://linuxtoday.com/news_story.php3?ltsn=2000-09-26-017-04-OS-SW html]]

* ''XFS: It's worth the wait'' was published on ''EarthWeb'' (July 2000) [[http://networking.earthweb.com/netos/oslin/article/0,,12284_623661,00.html html]]

* An ''IRIX-XFS data sheet'' (July 1999) [[http://oss.sgi.com/projects/xfs/papers/IRIX_xfs_data_sheet.pdf pdf]]

* The ''Getting Started with XFS'' book (1994) [[http://oss.sgi.com/projects/xfs/papers/getting_started_with_xfs.pdf pdf]]

* Original ''XFS design documents'' (1993) ([http://oss.sgi.com/projects/xfs/design_docs/xfsdocs93_ps/ ps], [http://oss.sgi.com/projects/xfs/design_docs/xfsdocs93_pdf/ pdf])

File:Xfs-scalability-lca2012.pdf

2012-01-30T03:37:30Z

Dgc: Slides to LCA 2012 presentation "XFS - Adventures in Scalability"

Slides to LCA 2012 presentation "XFS - Adventures in Scalability"

FITRIM/discard

2011-10-11T22:30:46Z

Dgc:

== Purpose ==

FITRIM is a mounted filesystem feature to discard (or "[http://en.wikipedia.org/wiki/TRIM trim]") blocks which are not in use by the filesystem. This is useful for solid-state drives (SSDs) and thinly-provisioned storage.

== Requirements ==

#The block device underneath the filesystem must support the FITRIM operation.
#The kernel must include TRIM support and XFS must include FITRIM support (this has been true for Linux since v2.6.38, Jan 18 2011)
#Realtime discard mode requires a more recent v3.0 kernel

This can be verified by viewing /sys/block/<dev>/queue/discard_max_bytes -- If the value is zero then your device doesn't support discard
operations.

== Modes of Operation ==

* Realtime discard -- As files are removed, the filesystem issues discard requests automatically
* Batch Mode -- A user procedure that trims all or portions of the filesystem

=== Realtime discard ===

This mode issues discard requests automatically as files are removed from the filesystem. No other command or process is required.

Realtime discard is selected by adding the filesystem option <code>discard</code> while mounting.

This can be done by the following examples:

# placing <code>discard</code> in your /etc/fstab for the filesystem: <code>/dev/sda1 /mountpoint xfs defaults,discard 0 1</code>
# mount options: <code>mount -o discard /dev/sda1 /mountpoint</code>

=== Batch Mode ===

This mode requires user intervention. This intervention is in the form of the command <code>fstrim</code>. It has been included in [http://en.wikipedia.org/wiki/Util-linux util-linux-ng] since about Nov 26, 2010.

Usage example:
<code>fstrim /mountpoint</code>

----

== References ==
# FITRIM description - Lukas Czerner <lczerner at redhat.com> http://patchwork.xfs.org/patch/1490/
# Block requirements - Dave Chinner <david at fromorbit.com> http://oss.sgi.com/pipermail/xfs/2011-October/053379.html
# util-linux-ng addition - Karel Zak <kzak@xxxxxxxxxx> http://www.spinics.net/lists/util-linux-ng/msg03646.html

Improving Metadata Performance By Reducing Journal Overhead

2010-12-23T04:07:09Z

Dgc: /* Improving Metadata Performance By Reducing Journal Overhead */

== Improving Metadata Performance By Reducing Journal Overhead ==

XFS currently uses asynchronous write-ahead logging to ensure that changes to
the filesystem structure are preserved on crash. It does this by logging
detailed records of the changes being made to each object on disk during a
transaction. Every byte that is modified needs to be recorded in the journal.

There are two issues with this approach. The first is that transactions can
modify a *lot* of metadata to complete a single operation. Worse is the fact
that the average size of a transactions grows as structures get larger and
deeper, so performance on larger, fuller filesystem drops off as log bandwidth
is consumed by fewer, larger transactions.

The second is that we re-log previous changes that are active in the journal
if the object is modified again. hence if an object is modified repeatedly, the
dirty parts of the object get rewritten over and over again. in the worst case,
frequently logged buffers will be entirely dirty and so even if we only change
a single byte in the buffer we'll log the entire buffer.

The problem can be approached along two different axes:

- reduce the amount we log in a given transaction
- change the way we re-log objects.

Both of these things give the same end result - we require less bandwidth to
the journal to log changes that are happening in the filesystem.

== Asynchronous Transaction Aggregation ==

Status: Done, known as delayed logging.

Experimental in 2.6.35, stable for production in 2.6.37, planned for default
in 2.6.39.

Design documentation can be found here:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs-delayed-logging-design.txt

== Atomic Multi-Transaction Operations ==

A feature asynchronous transaction aggregation makes possible is atomic
multi-transaction operations. On the first transaction we hold the queue in
memory, preventing it from being committed. We can then do further transactions
that will end up in the same commit record, and on the final transaction we
unlock the async transaction queue. This will allow all those transaction to be
applied atomically. This is far simpler than any other method I've been looking
at to do this.

After a bit of reflection, I think this feature may be necessary for correct
implementation of existing logging techniques. The way we currently implement
rolling transactions (with permanent log reservations and rolling
dup/commit/re-reserve sequences) would seem to require all the commits in a
rolling transaction to be including in a single commit record. If I understand
history and the original design correctly, these rolling transactions were
implemented so that large, complex transactions would not pin the tail of the
log as they progressed. IOWs, they implicitly use re-logging to keep the tail
of the log moving forward as they progress and continue to modify items in the
transaction.

Given we are using asynchronous transaction aggregation as a method of reducing
re-logging, it would make sense to prevent these sorts of transactions from
pinning the tail of the log at all. Further, because we are effectively
disturbing the concept of unique transactions, I don't think that allowing a
rolling transaction to span aggregated commits is valid as we are going to be
ignoring the transaction IDs that are used to identify individual transactions.

Hence I think it is a good idea to simply replace rolling transactions with
atomic multi-transaction operations. This may also allow us to split some of
the large compound transactions into smaller, more self contained transactions.
This would reduce reservation pressure on log space in the common case where
all the corner cases in the transactions are not taken. In terms of
implementation, I think we can initially augment the permanent transaction
reservation/release interface to acheive this. With a working implementation,
we can then look to changing to a more explicit interface and slowly work to
remove the 'permanent log transaction' concept entirely. This shold simplify
the log code somewhat....

Note: This asynchronous transaction aggregation is originally based on a
concept floated by Nathan Scott called 'Delayed Logging' after observing how
ext3 implemented journalling. This never passed more than a concept
description phase....

== Operation Based Logging ==

The second approach to reducing log traffic is to change exactly what we
log in the transactions. At the moment, what we log is the exact change to
the item that is being made. For things like inodes and dquots, this isn't
particularly expensive because it is already a very compact form. The issue
comes with changes that are logged in buffers.

The prime example of this is a btree modification that involves either removing
or inserting a record into a buffer. The records are kept in compact form, so an
insert or remove will also move other records around in the buffer. In the worst
case, a single insert or remove of a 16 byte record can dirty an entire block
(4k generally, but could be up to 64k). In this case, if we were to log the
btree operation (e.g. insert {record, index}) rather than the resultant change
on the buffer the overhead of a btree operation is fixed. Such logging also
allows us to avoid needing to log the changes due to splits and merges - we just
replay the operation and subsequent splits/merges get done as part of replay.

The result of this is that complex transactions no longer need as much log space
as all possible change they can cause - we only log the basic operations that
are occurring and their result. Hence transaction end up being much smaller,
vary less in size between empty and full filesystems, etc. An example set of
operations describing all the changes made by an extent allocation on an inode
would be:

- inode X intent to allocate extent {off, len}
- AGCNT btree update record in AG X {old rec} {new rec values}
- AGBNO btree delete record in AG X {block, len}
- inode X BMBT btree insert record {off, block, len}
- inode X delta

This comes down to a relatively small, bound amount of space which is close the
minimun and existing allocation transaction would consume. However, with this
method of logging the transaction size does not increase with the size of
structures or the amount of updates necessary to complete the operations.

A major difference to the existing transaction system is that re-logging
of items doesn't fit very neatly with operation based logging.

There are three main disadvantages to this approach:

- recovery becomes more complex - it will need to change substantially
to accomodate operation replay rather than just reading from disk
and applying deltas.
- we have to create a whole new set of item types and add the necessary
hooks into the code to log all the operations correctly.
- re-logging is probably not possible, and that introduces
differences to the way we'll need to track objects for flushing. It
may, in fact, require transaction IDs in all objects to allow us
to determine what the last transaction that modified the item
on disk was during recovery.

Changing the logging strategy as described is a much more fundamental change to
XFS than asynchronous transaction aggregation. It will be difficult to change
to such a model in an evolutionary manner; it is more of a 'flag day' style
change where then entire functionality needs to be added in one hit. Given that
we will also still have to support the old log format, it doesn't enable us to
remove any code, either.

Given that we are likely to see major benefits in the problem workloads as a
result of asynchronous transaction aggregation, it may not be necessary to
completely rework the transaction subsystem. Combining aggregation with an
ongoing process of targeted reduction of transaction size will provide benefits
out to at least the medium term. It is unclear whether this direction will be
sufficient in the long run until we can measure the benefit that aggregation
will provide.

== Reducing Transaction Overhead ==

Per iclog callback list locks: Done

AIL tail pushing in it's own thread: Done

Bulk AIL insert and delete operations: Done

Log grant lock split-up: Done

Lock free transaction reserve path: Done

Moving all of the log interfacing out of the direct transaction commit path may provide similar benefits to moving the AIL pushing into it's own thread. This will mean that there will typically only be a single thread formatting and writing to iclog buffers. This will remove much of the parallelism that puts excessive pressure on many of these locks.

== Reducing Recovery Time ==

With 2GB logs, recovery can take an awfully long time due to the need
to read each object synchronously as we process the journal. An obvious
way to avoid this is to add another pass to the processing to do asynchronous
readahead of all the objects in the log before doing the processing passes.
This will populate the cache as quickly as possible and hide any read latency
that could occur as we process commit records.

A logical extension to this is to sort the objects in ascending offset order
before issuing I/O on them. That will further optimise the readahead I/O
to reduce seeking and hence should speed up the read phase of recovery
further.

== ToDo ==
Further investigation of recovery for future optimisation.

Improving Metadata Performance By Reducing Journal Overhead

2010-12-23T04:03:30Z

Dgc: /* Reducing Transaction Overhead */

== Improving Metadata Performance By Reducing Journal Overhead ==

XFS currently uses asynchronous write-ahead logging to ensure that changes to
the filesystem structure are preserved on crash. It does this by logging
detailed records of the changes being made to each object on disk during a
transaction. Every byte that is modified needs to be recorded in the journal.

There are two issues with this approach. The first is that transactions can
modify a *lot* of metadata to complete a single operation. Worse is the fact
that the average size of a transactions grows as structures get larger and
deeper, so performance on larger, fuller filesystem drops off as log bandwidth
is consumed by fewer, larger transactions.

The second is that we re-log previous changes that are active in the journal
if the object is modified again. hence if an object is modified repeatedly, the
dirty parts of the object get rewritten over and over again. in the worst case,
frequently logged buffers will be entirely dirty and so even if we only change
a single byte in the buffer we'll log the entire buffer.

An example of how needless this can be is the operation of a removing all the
files in a directory result in the directory blocks being logged over and over
again before finally being freed and made stale in the log. If we are freeing
the entire contents of the directory, the only transactions we really need in
the journal w.r.t to directory buffers is the 'remove, stale and free'
transaction; all other changes are irrelevant because we don't care about
changes to free space. Depending on the directory block size, we might log each
directory buffer tens to hundreds of times before making it stale...

Clearly we have two different axis to approach this problem along:

- reduce the amount we log in a given transaction
- reduce the number of times we re-log objects.

Both of these things give the same end result - we require less bandwidth to
the journal to log changes that are happening in the filesystem. Let's start
by looking at how to reduce re-logging of objects.

== Asynchronous Transaction Aggregation ==

Status: Done, known as delayed logging.

Experimental in 2.6.35, stable for production in 2.6.37, planned for default
in 2.6.39.

Design documentation can be found here:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs-delayed-logging-design.txt

== Atomic Multi-Transaction Operations ==

A feature asynchronous transaction aggregation makes possible is atomic
multi-transaction operations. On the first transaction we hold the queue in
memory, preventing it from being committed. We can then do further transactions
that will end up in the same commit record, and on the final transaction we
unlock the async transaction queue. This will allow all those transaction to be
applied atomically. This is far simpler than any other method I've been looking
at to do this.

After a bit of reflection, I think this feature may be necessary for correct
implementation of existing logging techniques. The way we currently implement
rolling transactions (with permanent log reservations and rolling
dup/commit/re-reserve sequences) would seem to require all the commits in a
rolling transaction to be including in a single commit record. If I understand
history and the original design correctly, these rolling transactions were
implemented so that large, complex transactions would not pin the tail of the
log as they progressed. IOWs, they implicitly use re-logging to keep the tail
of the log moving forward as they progress and continue to modify items in the
transaction.

Given we are using asynchronous transaction aggregation as a method of reducing
re-logging, it would make sense to prevent these sorts of transactions from
pinning the tail of the log at all. Further, because we are effectively
disturbing the concept of unique transactions, I don't think that allowing a
rolling transaction to span aggregated commits is valid as we are going to be
ignoring the transaction IDs that are used to identify individual transactions.

Hence I think it is a good idea to simply replace rolling transactions with
atomic multi-transaction operations. This may also allow us to split some of
the large compound transactions into smaller, more self contained transactions.
This would reduce reservation pressure on log space in the common case where
all the corner cases in the transactions are not taken. In terms of
implementation, I think we can initially augment the permanent transaction
reservation/release interface to acheive this. With a working implementation,
we can then look to changing to a more explicit interface and slowly work to
remove the 'permanent log transaction' concept entirely. This shold simplify
the log code somewhat....

Note: This asynchronous transaction aggregation is originally based on a
concept floated by Nathan Scott called 'Delayed Logging' after observing how
ext3 implemented journalling. This never passed more than a concept
description phase....

== Operation Based Logging ==

The second approach to reducing log traffic is to change exactly what we
log in the transactions. At the moment, what we log is the exact change to
the item that is being made. For things like inodes and dquots, this isn't
particularly expensive because it is already a very compact form. The issue
comes with changes that are logged in buffers.

The prime example of this is a btree modification that involves either removing
or inserting a record into a buffer. The records are kept in compact form, so an
insert or remove will also move other records around in the buffer. In the worst
case, a single insert or remove of a 16 byte record can dirty an entire block
(4k generally, but could be up to 64k). In this case, if we were to log the
btree operation (e.g. insert {record, index}) rather than the resultant change
on the buffer the overhead of a btree operation is fixed. Such logging also
allows us to avoid needing to log the changes due to splits and merges - we just
replay the operation and subsequent splits/merges get done as part of replay.

The result of this is that complex transactions no longer need as much log space
as all possible change they can cause - we only log the basic operations that
are occurring and their result. Hence transaction end up being much smaller,
vary less in size between empty and full filesystems, etc. An example set of
operations describing all the changes made by an extent allocation on an inode
would be:

- inode X intent to allocate extent {off, len}
- AGCNT btree update record in AG X {old rec} {new rec values}
- AGBNO btree delete record in AG X {block, len}
- inode X BMBT btree insert record {off, block, len}
- inode X delta

This comes down to a relatively small, bound amount of space which is close the
minimun and existing allocation transaction would consume. However, with this
method of logging the transaction size does not increase with the size of
structures or the amount of updates necessary to complete the operations.

A major difference to the existing transaction system is that re-logging
of items doesn't fit very neatly with operation based logging.

There are three main disadvantages to this approach:

- recovery becomes more complex - it will need to change substantially
to accomodate operation replay rather than just reading from disk
and applying deltas.
- we have to create a whole new set of item types and add the necessary
hooks into the code to log all the operations correctly.
- re-logging is probably not possible, and that introduces
differences to the way we'll need to track objects for flushing. It
may, in fact, require transaction IDs in all objects to allow us
to determine what the last transaction that modified the item
on disk was during recovery.

Changing the logging strategy as described is a much more fundamental change to
XFS than asynchronous transaction aggregation. It will be difficult to change
to such a model in an evolutionary manner; it is more of a 'flag day' style
change where then entire functionality needs to be added in one hit. Given that
we will also still have to support the old log format, it doesn't enable us to
remove any code, either.

Given that we are likely to see major benefits in the problem workloads as a
result of asynchronous transaction aggregation, it may not be necessary to
completely rework the transaction subsystem. Combining aggregation with an
ongoing process of targeted reduction of transaction size will provide benefits
out to at least the medium term. It is unclear whether this direction will be
sufficient in the long run until we can measure the benefit that aggregation
will provide.

== Reducing Transaction Overhead ==

Per iclog callback list locks: Done

AIL tail pushing in it's own thread: Done

Bulk AIL insert and delete operations: Done

Log grant lock split-up: Done

Lock free transaction reserve path: Done

Moving all of the log interfacing out of the direct transaction commit path may provide similar benefits to moving the AIL pushing into it's own thread. This will mean that there will typically only be a single thread formatting and writing to iclog buffers. This will remove much of the parallelism that puts excessive pressure on many of these locks.

== Reducing Recovery Time ==

With 2GB logs, recovery can take an awfully long time due to the need
to read each object synchronously as we process the journal. An obvious
way to avoid this is to add another pass to the processing to do asynchronous
readahead of all the objects in the log before doing the processing passes.
This will populate the cache as quickly as possible and hide any read latency
that could occur as we process commit records.

A logical extension to this is to sort the objects in ascending offset order
before issuing I/O on them. That will further optimise the readahead I/O
to reduce seeking and hence should speed up the read phase of recovery
further.

== ToDo ==
Further investigation of recovery for future optimisation.

Improving Metadata Performance By Reducing Journal Overhead

2010-12-23T03:58:17Z

Dgc: /* Asynchronous Transaction Aggregation */

== Improving Metadata Performance By Reducing Journal Overhead ==

XFS currently uses asynchronous write-ahead logging to ensure that changes to
the filesystem structure are preserved on crash. It does this by logging
detailed records of the changes being made to each object on disk during a
transaction. Every byte that is modified needs to be recorded in the journal.

There are two issues with this approach. The first is that transactions can
modify a *lot* of metadata to complete a single operation. Worse is the fact
that the average size of a transactions grows as structures get larger and
deeper, so performance on larger, fuller filesystem drops off as log bandwidth
is consumed by fewer, larger transactions.

The second is that we re-log previous changes that are active in the journal
if the object is modified again. hence if an object is modified repeatedly, the
dirty parts of the object get rewritten over and over again. in the worst case,
frequently logged buffers will be entirely dirty and so even if we only change
a single byte in the buffer we'll log the entire buffer.

An example of how needless this can be is the operation of a removing all the
files in a directory result in the directory blocks being logged over and over
again before finally being freed and made stale in the log. If we are freeing
the entire contents of the directory, the only transactions we really need in
the journal w.r.t to directory buffers is the 'remove, stale and free'
transaction; all other changes are irrelevant because we don't care about
changes to free space. Depending on the directory block size, we might log each
directory buffer tens to hundreds of times before making it stale...

Clearly we have two different axis to approach this problem along:

- reduce the amount we log in a given transaction
- reduce the number of times we re-log objects.

Both of these things give the same end result - we require less bandwidth to
the journal to log changes that are happening in the filesystem. Let's start
by looking at how to reduce re-logging of objects.

== Asynchronous Transaction Aggregation ==

Status: Done, known as delayed logging.

Experimental in 2.6.35, stable for production in 2.6.37, planned for default
in 2.6.39.

Design documentation can be found here:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs-delayed-logging-design.txt

== Atomic Multi-Transaction Operations ==

A feature asynchronous transaction aggregation makes possible is atomic
multi-transaction operations. On the first transaction we hold the queue in
memory, preventing it from being committed. We can then do further transactions
that will end up in the same commit record, and on the final transaction we
unlock the async transaction queue. This will allow all those transaction to be
applied atomically. This is far simpler than any other method I've been looking
at to do this.

After a bit of reflection, I think this feature may be necessary for correct
implementation of existing logging techniques. The way we currently implement
rolling transactions (with permanent log reservations and rolling
dup/commit/re-reserve sequences) would seem to require all the commits in a
rolling transaction to be including in a single commit record. If I understand
history and the original design correctly, these rolling transactions were
implemented so that large, complex transactions would not pin the tail of the
log as they progressed. IOWs, they implicitly use re-logging to keep the tail
of the log moving forward as they progress and continue to modify items in the
transaction.

Given we are using asynchronous transaction aggregation as a method of reducing
re-logging, it would make sense to prevent these sorts of transactions from
pinning the tail of the log at all. Further, because we are effectively
disturbing the concept of unique transactions, I don't think that allowing a
rolling transaction to span aggregated commits is valid as we are going to be
ignoring the transaction IDs that are used to identify individual transactions.

Hence I think it is a good idea to simply replace rolling transactions with
atomic multi-transaction operations. This may also allow us to split some of
the large compound transactions into smaller, more self contained transactions.
This would reduce reservation pressure on log space in the common case where
all the corner cases in the transactions are not taken. In terms of
implementation, I think we can initially augment the permanent transaction
reservation/release interface to acheive this. With a working implementation,
we can then look to changing to a more explicit interface and slowly work to
remove the 'permanent log transaction' concept entirely. This shold simplify
the log code somewhat....

Note: This asynchronous transaction aggregation is originally based on a
concept floated by Nathan Scott called 'Delayed Logging' after observing how
ext3 implemented journalling. This never passed more than a concept
description phase....

== Operation Based Logging ==

The second approach to reducing log traffic is to change exactly what we
log in the transactions. At the moment, what we log is the exact change to
the item that is being made. For things like inodes and dquots, this isn't
particularly expensive because it is already a very compact form. The issue
comes with changes that are logged in buffers.

The prime example of this is a btree modification that involves either removing
or inserting a record into a buffer. The records are kept in compact form, so an
insert or remove will also move other records around in the buffer. In the worst
case, a single insert or remove of a 16 byte record can dirty an entire block
(4k generally, but could be up to 64k). In this case, if we were to log the
btree operation (e.g. insert {record, index}) rather than the resultant change
on the buffer the overhead of a btree operation is fixed. Such logging also
allows us to avoid needing to log the changes due to splits and merges - we just
replay the operation and subsequent splits/merges get done as part of replay.

The result of this is that complex transactions no longer need as much log space
as all possible change they can cause - we only log the basic operations that
are occurring and their result. Hence transaction end up being much smaller,
vary less in size between empty and full filesystems, etc. An example set of
operations describing all the changes made by an extent allocation on an inode
would be:

- inode X intent to allocate extent {off, len}
- AGCNT btree update record in AG X {old rec} {new rec values}
- AGBNO btree delete record in AG X {block, len}
- inode X BMBT btree insert record {off, block, len}
- inode X delta

This comes down to a relatively small, bound amount of space which is close the
minimun and existing allocation transaction would consume. However, with this
method of logging the transaction size does not increase with the size of
structures or the amount of updates necessary to complete the operations.

A major difference to the existing transaction system is that re-logging
of items doesn't fit very neatly with operation based logging.

There are three main disadvantages to this approach:

- recovery becomes more complex - it will need to change substantially
to accomodate operation replay rather than just reading from disk
and applying deltas.
- we have to create a whole new set of item types and add the necessary
hooks into the code to log all the operations correctly.
- re-logging is probably not possible, and that introduces
differences to the way we'll need to track objects for flushing. It
may, in fact, require transaction IDs in all objects to allow us
to determine what the last transaction that modified the item
on disk was during recovery.

Changing the logging strategy as described is a much more fundamental change to
XFS than asynchronous transaction aggregation. It will be difficult to change
to such a model in an evolutionary manner; it is more of a 'flag day' style
change where then entire functionality needs to be added in one hit. Given that
we will also still have to support the old log format, it doesn't enable us to
remove any code, either.

Given that we are likely to see major benefits in the problem workloads as a
result of asynchronous transaction aggregation, it may not be necessary to
completely rework the transaction subsystem. Combining aggregation with an
ongoing process of targeted reduction of transaction size will provide benefits
out to at least the medium term. It is unclear whether this direction will be
sufficient in the long run until we can measure the benefit that aggregation
will provide.

== Reducing Transaction Overhead ==

To switch tracks completely, I have not addressed general issues with overhead
in the transaction subsystem itself. There are several points where the
transaction subsystem will single thread because of filesystem scope locks and
structures. We have, for example, the log grant lock for protecting
reservation and used log space, the AIL lock for tracking dirty metadata, the
log state lock for state transition of log buffers and other associated
structure modifications.

We have already started down the path of reducing contention in
various paths. For example:

- changing iclog reference counts to atomics to avoid needing the log
state lock on every transaction commit
- protecting iclog callback lists with a per-iclog lock instead of the log
state lock
- removing the AIL lock from the transaction reserve path by isolating
AIL tail pushing to a single thread instead of being done
synchronously.

Asynchronous transaction aggregation is likely to perturb the current known
behaviour and bottlenecks as a result of moving all of the log interfacing out
of the direct transaction commit path. Similar to moving the AIL pushing into
it's own thread, this will mean that there will typically only be a single
thread formatting and writing to iclog buffers. This will remove much of the
parallelism that puts excessive pressure on many of these locks.

I am certain that asynchronous transaction aggregation will open up new areas
of optimisation in the log formatting and dispatch code - it will probably
enable us to remove a lot of the complexity because we will be able to directly
control the parallelism in the formatting and dispatch of log buffers. This
implies that we may not need to be limited to a fixed pool of fixed sized log
buffers for writing transactions to disk.

However, it is probably best to leave consideration of such optimisations until
after the asynchronous transaction aggregation is implemented and we can
directly observe the pain points that become apparent as a result of such a
change.

== Reducing Recovery Time ==

With 2GB logs, recovery can take an awfully long time due to the need
to read each object synchronously as we process the journal. An obvious
way to avoid this is to add another pass to the processing to do asynchronous
readahead of all the objects in the log before doing the processing passes.
This will populate the cache as quickly as possible and hide any read latency
that could occur as we process commit records.

A logical extension to this is to sort the objects in ascending offset order
before issuing I/O on them. That will further optimise the readahead I/O
to reduce seeking and hence should speed up the read phase of recovery
further.

== ToDo ==
Further investigation of recovery for future optimisation.

Improving inode Caching

2010-12-23T03:51:48Z

Dgc: /* Compressed Inode Cache */

Future Directions for XFS

== Improving Inode Caching and Operation in XFS ==
--------------------------------------------

Thousand foot view:

We want to drive inode lookup in a manner that is as parallel, scalable and low
overhead as possible. This means efficient indexing, lowering memory
consumption, simplifying the caching heirachy, removing duplication and
reducing/removing lock traffic.

In addition, we want to provide a good foundation for simplifying inode I/O,
improving writeback clustering, preventing RMW of inode buffers under memory
pressure, reducing creation and deletion overhead and removing writeback of
unlogged changes completely.

There are a variety of features in disconnected trees and patch sets that need
to be combined to acheive this - the basic structure needed to implement this is
already in mainline and that is the radix tree inode indexing. Further
improvements are going to be based around this structure and using it
effectively to avoid needing other indexing mechanisms.

Discussion:

== Combining XFS and VFS inodes ==

Status: Done (October 2008)

== Compressed Inode Cache ==

The XFs inode cache uses a lot of memory. We can avoid this problem by making
use of the compressed inode cache - only the active inodes are held in a
non-compressed form, hence most inodes will end up being cached in compressed
form rather than in the XFS/linux inode form. The compressed form can reduce
the cached inode footprint to 200-300 bytes per inode instead of 1-1.1k that
they currently take on a 64bit system. Hence by moving to a compressed cache we
can greatly increase the number of inodes cached in a given amount of memory
which more that offsets any comparitive increase we will see from inodes in
reclaim. the compressed cache should really have a LRU and a shrinker as well
so that memory pressure will slowly trim it as memory demands occur. [Note:
this compressed cache is discussed further later on in the reclaim context.]

== Fixed Inode Cache Size ==

It is worth noting that for embedded systems and appliances it may be worth while allowing
the size of the caches to be fixed. Also, to prevent memory fragmentation
problems, we could simply allocate that memory to the compressed cache slab.
In effect, this would become a 'static slab' in that it has a bound maximum
size and never frees and memory. When the cache is full, we reclaim an
object out of it for reuse - this could be done by triggering the shrinker
to reclaim from the LRU. This would prevent the compressed inode cache from
consuming excessive amounts of memory in tightly constrained evironments.
Such an extension to the slab caches does not look difficult to implement,
and would allow such customisation with minimal deviation from mainline code.

== Bypassing the Linux Inode Cache ==

Lookups: Done (October 2008)

Tracking dirty inodes: Done

Writeback of dirty inodes: Done

Writeback of dirty pages: still executed the by VFS

Now that we can track dirty inodes ourselves, we can pretty much isolate
writeback of both data and inodes from the generic pdflush code. If we add a
hook high up in the pdflush path that simply passes us a writeback control
structure with the current writeback guidelines, we can do writeback within
those guidelines in the most optimal fashion for XFS.

== Avoiding the Generic pdflush Code ==

Writeback of inodes via AIL: Done

For pdflush driven writeback, we only want to write back data; all other inode
writeback should be driven from the AIL (our time ordered dirty metadata list)
or xfssyncd in a manner that is most optimal for XFS.

Furthermore, if we implement our own pdflush method, we can parallelise it in
several ways. We can ensure that each filesystem has it's own flush thread or
thread pool, we can have a thread pool shared by all filesystems (like pdflush
currently operates), we can have a flush thread per inode radix tree, and so
one. The method of paralleisation is open for interpretation, but enabling
multiple flush threads to operate on a single filesystem is one of the necessary
requirements to avoid data writeback (and hence delayed allocation) being
limited to the throughput of a single CPU per filesystem.

== Improving Inode Writeback ==

To optimise inode writeback, we really need to reduce the impact of inode
buffer read-modify-write cycles. XFS is capable of caching far more inodes in
memory than it has buffer space available for, so RMW cycles during inode
writeback under memory pressure are quite common. Firstly, we want to avoid
blocking pdflush at all costs. Secondly, we want to issue as much localised
readahead as possible in ascending offset order to allow both elevator merging
of readahead and as little seeking as possible. Finally, we want to issue all
the write cycles as close together as possible to allow the same elevator and
I/O optimisations to take place.

To do this, firstly we need the non-blocking inode flush semantics to issue
readahead on buffers that are not up-to-date rather than reading them
synchronously. Inode writeback already has the interface to handle inodes that
weren't flushed - we return EAGAIN from xfs_iflush() and the higher inode
writeback layers handle this appropriately. It would be easy to add another
flag to pass down to the buffer layer to say 'issue but don't wait for any
read'. If we use a radix tree traversal to issue readahead in such a manner,
we'll get ascending offset readahead being issued.

One problem with this is that we can issue too much readahead and thrash the
cache. A possible solution to this is to make the readahead a 'delayed read'
and on I/o completion add it to a queue that holds a reference on the buffer.
If a followup read occurs soon after, we remove it from the queue and drop that
reference. This prevents the buffer from being reclaimed in betwen the
readahead completing and the real read being issued. We should also issue this
delayed read on buffers that are in the cache so that they don't get reclaimed
to make room for the readahead.

To prevent buildup of delayed read buffers, we can periodically purge them -
those that are older than a given age (say 5 seconds) can be removed from the
list and their reference dropped. This will free the buffer and allow it's
pages to be reclaimed.

Once we have done the readahead pass, we can then do a modify and writeback
pass over all the inodes, knowing that there will be no read cycles to delay
this step. Once again, a radix tree traversal gives us ascending order
writeback and hence the modified buffers we send to the device will be in
optimal order for merging and minimal seek overhead.

== Contiguous Inode Allocation ==

To make optimal use of the radix tree cache and enable wide-scale clustering of
inode writeback across multiple clusters, we really need to ensure that inode
allocation occurs in large contiguous chunks on disk. Right now we only
allocate chunks of 64 inodes at a time; ideally we want to allocate a stripe
unit (or multiple of) full of inodes at a time. This would allow inode
writeback clustering to do full stripe writes to the underlying RAID if there
are dirty inodes spanning the entire stripe unit.

The problem with doing this is that we don't want to introduce the latency of
creating megabytes of inodes when only one is needed for the current operation.
Hence we need to push the inode creation into a background thread and use that
to create contiguous inode chunks asynchronously. This moves the actual on-disk
allocation of inodes out of the normal create path; it should always be able to
find a free inode without doing on disk allocation. This will simplify the
create path by removing the allocate-on-disk-then-retry-the-create double
transaction that currently occurs.

As an aside, we could preallocate a small amount of inodes in each AG (10-20MB
of inodes per AG?) without impacting mkfs time too greatly. This would allow
the filesystem to be used immediately on the first mount without triggering
lots of background allocation. This could alsobe done after the first mount
occurs, but that could interfere with typical benchmarking situations. Another
good reason for this preallocation is that it will help reduce xfs_repair
runtime for most common filesystem usages.

One of the issues that the background create will cause is a substantial amount
of log traffic - every inode buffer initialised will be logged in whole. Hence
if we create a megabyte of inodes, we'll be causing a megabyte of log traffic
just for the inode buffers we've initialised. This is relatively simple to fix
- we don't log the buffer, we just log the fact that we need to initialise
inodes in a given range. In recovery, when we see this transaction, then we
build the buffers, initialise them and write them out. Hence, we don't need to
log the buffers used to initialise the inodes.

Also, we can use the background allocations to keep track of recently allocated
inode regions in the per-ag. Using that information to select the next inode to
be used rather than requiring btree searches on every create will greatly reduce
the CPU overhead of workloads that create lots of new inodes. It is not clear
whether a single background thread will be able to allocate enough inodes
to keep up with demand from the rest of the system - we may need multiple
threads for large configurations.

== Single Block Inode Allocation ==

One of the big problems we have withe filesystems that are approaching
full is that it can be hard to find a large enough extent to hold 64 inodes.
We've had ENOSPC errors on inode allocation reported on filesystems that
are only 85% full. This is a sign of free space fragmentation, and it
prevents inode allocation from succeeding. We could (and should) write
a free space defragmenter, but that does not solve the problem - it's
reactive, not preventative.

The main problem we have is that XFS uses inode chunk size and alignment
to optimise inode number to disk location conversion. That is, the conversion
becomes a single set of shifts and masks instead of an AGI btree lookup.
This optimisation substantially reduces the CPU and I/O overhead of
inode lookups, but it does limit our flexibility. If we break the
alignment restriction, every lookup has to go back to a btree search.
Hence we really want to avoid breaking chunk alignment and size
rules.

An approach to avoiding violation of this rule is to be able to determine which
index to look up when parsing the inode number. For example, we could use the
high bit of the inode number to indicate that it is located in a non-aligned
inode chunk and hence needs to be looked up in the btree. This would avoid
the lookup penalty for correctly aligned inode chunks.

If we then redefine the meaning of the contents of the AGI btree record for
such inode chunks, we do not need a new index to keep these in. Effectively,
we need to add a bitmask to the record to indicate which blocks inside
the chunk can actually contain inodes. We still use aligned/sized records,
but mask out the sections that we are not allowed to allocate inodes in.
Effectively, this would allow sparse inode chunks. There may be limitations
on the resolution of sparseness depending on inode size and block size,
but for the common cases of 4k block size and 256 or 512 byte inodes I
think we can run a fully sparse mapping for each inode chunk.

This would allow us to allocate inode extents of any alignment and size
that fits *inside* the existing alignment/size limitations. That is,
a single extent allocation could not span two btree records, but can
lie anywhere inside a single record. It also means that we can do
multiple extent allocations within one btree record to make optimal
use of the fragmented free space.

It should be noted that this will probably have impact on some of the
inode cluster buffer mapping and clustering algorithms. It is not clear
exactly what impact yet, but certainly write clustering will be affected.
Fortunately we'll be able to detect the inodes that will have this problem
by the high bit in the inode number.

== Inode Unlink ==

If we turn to look at unlink and reclaim interactions, there are a few
optimisations that can be made. Firstly, we don't need to do inode inactivation
in reclaim threads - these transactions can easily be pushed to a background
thread. This means that xfs_inactive would be little more than a vmtruncate()
call and queuing to a workqueue. This will substantially speed up the processing
of prune_icache() - we'll get inodes moved into reclaim much faster than we do
right now.

This will have a noticable effect, though. When inodes are unlinked the space
consumed by those inodes may not be immediately freed - it will be returned as
the inodes are processed through the reclaim threads. This means that userspace
monitoring tools such as 'df' may not immediately reflect the result of a
completed unlink operation. This will be a user visible change in behaviour,
though in most cases should not affect anyone and for those that it does affect
a 'sync' should be sufficient to wait for the space to be returned.

Now that inodes to be unlinked are out of general circulation, we can make the
unlinked path more complex. It is desirable to move the unlinked list from the
inode buffer to the inode core, but that has locking implications for incore
unlinked. Hence we really need background thread processing to enable this to
work (i.e. being able to requeue inodes for later processing). To ensure that
to overhead of this work is not a limiting factor, we will probably need
multiple workqueue processing threads for this.

Moving the logging to the inode core enables two things - it allows us to keep
an in-memory copy of the unlinked list off the perag and that allows us to remove
xfs_inotobp(). The in-memory unlinked list means we don't have to read and
traverse the buffers every time we need to find the previous buffer to remove an
inode from the list, but it does mean we have to take the inode lock. If the
previous inode is locked, then we can't remove the inode from the unlinked list
so we must requeue it for this to occur at a later time.

Combined with the changes to inode create, we effectively will only use the
inode buffer in the transaction subsystem for marking the region stale when
freeing an inode chunk from disk (i.e. the default noikeep configuration). If
we are using large inode allocation, we don't want to be freeing random inode
chunks - this will just leave us with fragmented inode regions and undo all the
good work that was done originally.

To avoid this, we should not be freeing inode chunks as soon as they no longer
have any empty inodes in them. We should periodically scan the AGI btree
looking for contiguous chunks that have no inodes allocated in them, and then
freeing the large contiguous regions we find in one go. It is likely this can
be done in a single transaction; it's one extent to be freed, along with a
contiguous set of records to be removed from the AGI btree so should not
require logging much at all. Also, the background scanning could be triggered
by a number of different events - low space in an AG, a large number of free
inodes in an AG, etc - as it doesn't need to be done frequently. As a result
of the lack of frequency that this needs to be done, it can probably be
handled by a single thread or delayed workqueue.

Further optimisations are possible here - if we rule that the AGI btree is the
sole place that inodes are marked free or in-use (with the exception of
unlinked inodes attached to the AGI lists), then we can avoid the need to
write back unlinked inodes or read newly created inodes from disk. This would
require all inodes to effectively use a random generation number assigned at
create time as we would not be reading it from disk - writing/reading the current
generation number appears to be the only real reason for doing this I/O. This
would require extra checks to determine if an inode is unlinked - we
need to do an imap lookup rather than reading it and then checking it is
valid if it is not already in memory. Avoiding the I/O, however, will greatly speed
up create and remove workloads. Note: the impact of this on the bulkstat algorithm
has not been determined yet.

One of the issues we need to consider with this background inactivation is that
we will be able to defer a large quantity of inactivation transactions so we are
going to need to be careful about how much we allow to be queued. Simple queue
depth throttling should be all that is needed to keep this under control.

== Reclaim Optimizations ==

Tracking inodes for reclaim in radix tree: Done

Using RCU for radix tree reclaim walks: Done

Non-blocking background reclaim: Done

Parallelised shrinker based reclaim: Done

Now that we have efficient unlink, we've got to handle the reclaim of all the
inodes that are now dead or simply not referenced. For inodes that are dirty,
we need to write them out to clean them. For inodes that are clean and not
unlinked, we need to compress them down for more compact storage. This involves
some CPU overhead, but it is worth noting that reclaiming of clean inodes
typically only occurs when we are under memory pressure.

By compressing the XFS inode in this case, we are effectively reducing the
memory usage of the inode rather than freeing it directly. If we then get
another operation on that inode (e.g. the working set is slightly larger than
can be held in linux+XFS inode pairs, we avoid having to read the inode off
disk again - it simply gets uncompressed out of the cache. In essence we use
the compressed inode cache as an exclusive second level cache - it has higher
density than the primary cache and higher load latency and CPU overhead,
but it still avoids I/O in exactly the same manner as the primary cache.

We cannot allow unrestricted build-up of reclaimable inodes - the memory they
consume will be large, so we should be aiming to compress reclaimable inodes as
soon as they are clean. This will prevent buildup of memory consuming
uncompressed inodes that are not likely to be referenced again immediately.

This clean inode reclaimation process can be accelerated by triggering reclaim
on inode I/O completion. If the inode is clean and reclaimable we should
trigger immediate reclaim processing of that inode. This will mean that
reclaim of newly cleaned inodes will not get held up behind reclaim of dirty
inodes.

For inodes that are unlinked, we can simply free them in reclaim as theƦ
are no longer in use. We don't want to poison the compressed cache with
unlinked inodes, nor do we need to because we can allocate new inodes
without incurring I/O.

Still, we may end up with lots of inodes queued for reclaim. We may need
to implement a throttle mechanism to slow down the rate at which inodes
are queued for reclaimation in the situation where the reclaim process
is not able to keep up. It should be noted that if we parallelise inode
writeback we should also be able to parallelise inode reclaim via
the same mechanism, so the need for throttling may relatively low
if we can have multiple inodes under reclaim at once.

It should be noted that complexity is exposed by interactions with concurrent
lookups, especially if we move to RCU locking on the radix tree. Firstly, we
need to be able to do an atomic swap of the compressed inode for the
uncompressed inode in the radix tree (and vice versa), to be able to tell them
apart (magic #), and to have atomic reference counts to ensure we can avoid use
after free situations when lookups race with compression or freeing.

Secondly, with the complex unlink/reclaim interactions we will need to be
careful to detect inodes in the process of reclaim - the lookupp process
will need to do different things depending on the state of reclaim. Indeed,
we will need to be able to cancel reclaim of an unlinked inode if we try
to allocate it before it has been fully unlinked or reclaimed. The same
can be said for an inode in the process of being compressed - if we get
a lookup during the compression process, we want to return the existing
inode, not have to wait, re-allocate and uncompress it again. These
are all solvable issues - they just add complexity.

== Accelerated Reclaim of buftarg Page Cache for Inodes ==
----------------------------------------------------

Per-buftarg buffer LRU reclaim: Done

Per-buftarg shrinker: Done

Per-buffer type reclaim prioritisation: Done

For single use inodes or even read-only inodes, we read them in, use them, then
reclaim them. With the compressed cache, they'll get compressed and live a lot
longer in memory. However, we also will have the inode cluster buffer pages
sitting in memory for some length of time after the inode was read in. This can
consume a large amount of memory that will never be used again, and does not
get reclaimed until they are purged from the LRU by the VM. It would be
advantageous to accelerate the reclaim of these pages so that they do not build
up unneccessarily.

A better method would appear to be to leverage the delayed read queue
mechanism. This delayed read queue pins read buffers for a short period of
time, and then if they have not been referenced they get torn down. If, as
part of this delayed read buffer teardown procedure we all free the backing
pages completely, we acheive the exact same result as having our own LRUs to
manage the page cache. This seems much simpler and a much more holistic
approach to solving the problem than implementing page LRUs.

== Killing Bufferheads (a.k.a "Die, buggerheads, Die!") ==

[This is not strictly about inode caching, but doesn't fit into
other areas of development as closely as it does to inode caching
optimisations.]

XFS is extent based. The Linux page cache is block based. Hence for
every cached page in memory, we have to attach a structure for mapping
the blocks on that page back to to the on-disk location. In XFs, we also
use this to hold state for delayed allocation and unwritten extent blocks
so the generic code can do the right thing when necessary. We also
use it to avoid extent lookups at various times within the XFS I/O
path.

However, this has a massive cost. While XFS might represent the
disk mapping of a 1GB extent in 24 bytes of memory, the page cache
requires 262,144 bufferheads (assuming 4k block size) to represent the
same mapping. That's roughly 14MB of memory neededtoo represent that.

Chris Mason wrote an extent map representation for page cache state
and mappings for BTRFS; that code is mostly generic and could be
adapted to XFS. This would allow us to hold all the page cache state
in extent format and greatly reduce the memory overhead that it currently
has. The tradeoff is increased CPU overhead due to tree lookups where
structure lookups currently are used. Still, this has much lower
overhead than xfs_bmapi() based lookups, so the penalty is going to
be lower than if we did these lookups right now.

If we make this change, we would then have three levels of extent
caching:

- the BMBT buffers
- the XFS incore inode extent tree (iext*)
- the page cache extent map tree

Effectively, the XFS incore inode extent tree becomes redundant - all
the extent state it holds can be moved to the generic page cache tree
and we can do all our incore operations there. Our logging of changes
is based on the BMBT buffers, so getting rid of the iext layer would
not impact the transaction subsystem at all.

Such integration with the generic code will also allow development
of generic writeback routines for delayed allocation, unwritten
extents, etc that are not specific to a given filesystem.

== Demand Paging of Large Inode Extent Maps ==

Currently the inode extent map is pinned in memory until the inode is
reclaimed. Hence an inode with millions of extents will pin a large
amount of memory and this can cause serious issues in low memory
situations. Ideally we would like to be able to page the extent
map in and out once they get to a certain size to avoid this
problem. This feature requires more investigation before an overall
approach can be detailed here.

It should be noted that if we move to an extent-based page cache mapping
tree, the associated extent state tree can be used to track sparse
regions. That is, regions of the extent map that are not in memory
can be easily represented and acceesses to an unread region can then
be used to trigger demand loading.

== Food For Thought (Crazy Ideas) ==

If we are not using inode buffers for logging changes to inodes, we should
consider whether we need them at all. What benefit do the buffers bring us when
all we will use them for is read or write I/O? Would it be better to go
straight to the buftarg page cache and do page based I/O via submit_bio()?

Improving inode Caching

2010-12-23T03:51:34Z

Dgc: /* Combining XFS and VFS inodes */

Future Directions for XFS

== Improving Inode Caching and Operation in XFS ==
--------------------------------------------

Thousand foot view:

We want to drive inode lookup in a manner that is as parallel, scalable and low
overhead as possible. This means efficient indexing, lowering memory
consumption, simplifying the caching heirachy, removing duplication and
reducing/removing lock traffic.

In addition, we want to provide a good foundation for simplifying inode I/O,
improving writeback clustering, preventing RMW of inode buffers under memory
pressure, reducing creation and deletion overhead and removing writeback of
unlogged changes completely.

There are a variety of features in disconnected trees and patch sets that need
to be combined to acheive this - the basic structure needed to implement this is
already in mainline and that is the radix tree inode indexing. Further
improvements are going to be based around this structure and using it
effectively to avoid needing other indexing mechanisms.

Discussion:

== Combining XFS and VFS inodes ==

Status: Done (October 2008)

== Compressed Inode Cache ==
-----------------------------

The XFs inode cache uses a lot of memory. We can avoid this problem by making
use of the compressed inode cache - only the active inodes are held in a
non-compressed form, hence most inodes will end up being cached in compressed
form rather than in the XFS/linux inode form. The compressed form can reduce
the cached inode footprint to 200-300 bytes per inode instead of 1-1.1k that
they currently take on a 64bit system. Hence by moving to a compressed cache we
can greatly increase the number of inodes cached in a given amount of memory
which more that offsets any comparitive increase we will see from inodes in
reclaim. the compressed cache should really have a LRU and a shrinker as well
so that memory pressure will slowly trim it as memory demands occur. [Note:
this compressed cache is discussed further later on in the reclaim context.]

== Fixed Inode Cache Size ==

It is worth noting that for embedded systems and appliances it may be worth while allowing
the size of the caches to be fixed. Also, to prevent memory fragmentation
problems, we could simply allocate that memory to the compressed cache slab.
In effect, this would become a 'static slab' in that it has a bound maximum
size and never frees and memory. When the cache is full, we reclaim an
object out of it for reuse - this could be done by triggering the shrinker
to reclaim from the LRU. This would prevent the compressed inode cache from
consuming excessive amounts of memory in tightly constrained evironments.
Such an extension to the slab caches does not look difficult to implement,
and would allow such customisation with minimal deviation from mainline code.

== Bypassing the Linux Inode Cache ==

Lookups: Done (October 2008)

Tracking dirty inodes: Done

Writeback of dirty inodes: Done

Writeback of dirty pages: still executed the by VFS

Now that we can track dirty inodes ourselves, we can pretty much isolate
writeback of both data and inodes from the generic pdflush code. If we add a
hook high up in the pdflush path that simply passes us a writeback control
structure with the current writeback guidelines, we can do writeback within
those guidelines in the most optimal fashion for XFS.

== Avoiding the Generic pdflush Code ==

Writeback of inodes via AIL: Done

For pdflush driven writeback, we only want to write back data; all other inode
writeback should be driven from the AIL (our time ordered dirty metadata list)
or xfssyncd in a manner that is most optimal for XFS.

Furthermore, if we implement our own pdflush method, we can parallelise it in
several ways. We can ensure that each filesystem has it's own flush thread or
thread pool, we can have a thread pool shared by all filesystems (like pdflush
currently operates), we can have a flush thread per inode radix tree, and so
one. The method of paralleisation is open for interpretation, but enabling
multiple flush threads to operate on a single filesystem is one of the necessary
requirements to avoid data writeback (and hence delayed allocation) being
limited to the throughput of a single CPU per filesystem.

== Improving Inode Writeback ==

To optimise inode writeback, we really need to reduce the impact of inode
buffer read-modify-write cycles. XFS is capable of caching far more inodes in
memory than it has buffer space available for, so RMW cycles during inode
writeback under memory pressure are quite common. Firstly, we want to avoid
blocking pdflush at all costs. Secondly, we want to issue as much localised
readahead as possible in ascending offset order to allow both elevator merging
of readahead and as little seeking as possible. Finally, we want to issue all
the write cycles as close together as possible to allow the same elevator and
I/O optimisations to take place.

To do this, firstly we need the non-blocking inode flush semantics to issue
readahead on buffers that are not up-to-date rather than reading them
synchronously. Inode writeback already has the interface to handle inodes that
weren't flushed - we return EAGAIN from xfs_iflush() and the higher inode
writeback layers handle this appropriately. It would be easy to add another
flag to pass down to the buffer layer to say 'issue but don't wait for any
read'. If we use a radix tree traversal to issue readahead in such a manner,
we'll get ascending offset readahead being issued.

One problem with this is that we can issue too much readahead and thrash the
cache. A possible solution to this is to make the readahead a 'delayed read'
and on I/o completion add it to a queue that holds a reference on the buffer.
If a followup read occurs soon after, we remove it from the queue and drop that
reference. This prevents the buffer from being reclaimed in betwen the
readahead completing and the real read being issued. We should also issue this
delayed read on buffers that are in the cache so that they don't get reclaimed
to make room for the readahead.

To prevent buildup of delayed read buffers, we can periodically purge them -
those that are older than a given age (say 5 seconds) can be removed from the
list and their reference dropped. This will free the buffer and allow it's
pages to be reclaimed.

Once we have done the readahead pass, we can then do a modify and writeback
pass over all the inodes, knowing that there will be no read cycles to delay
this step. Once again, a radix tree traversal gives us ascending order
writeback and hence the modified buffers we send to the device will be in
optimal order for merging and minimal seek overhead.

== Contiguous Inode Allocation ==

To make optimal use of the radix tree cache and enable wide-scale clustering of
inode writeback across multiple clusters, we really need to ensure that inode
allocation occurs in large contiguous chunks on disk. Right now we only
allocate chunks of 64 inodes at a time; ideally we want to allocate a stripe
unit (or multiple of) full of inodes at a time. This would allow inode
writeback clustering to do full stripe writes to the underlying RAID if there
are dirty inodes spanning the entire stripe unit.

The problem with doing this is that we don't want to introduce the latency of
creating megabytes of inodes when only one is needed for the current operation.
Hence we need to push the inode creation into a background thread and use that
to create contiguous inode chunks asynchronously. This moves the actual on-disk
allocation of inodes out of the normal create path; it should always be able to
find a free inode without doing on disk allocation. This will simplify the
create path by removing the allocate-on-disk-then-retry-the-create double
transaction that currently occurs.

As an aside, we could preallocate a small amount of inodes in each AG (10-20MB
of inodes per AG?) without impacting mkfs time too greatly. This would allow
the filesystem to be used immediately on the first mount without triggering
lots of background allocation. This could alsobe done after the first mount
occurs, but that could interfere with typical benchmarking situations. Another
good reason for this preallocation is that it will help reduce xfs_repair
runtime for most common filesystem usages.

One of the issues that the background create will cause is a substantial amount
of log traffic - every inode buffer initialised will be logged in whole. Hence
if we create a megabyte of inodes, we'll be causing a megabyte of log traffic
just for the inode buffers we've initialised. This is relatively simple to fix
- we don't log the buffer, we just log the fact that we need to initialise
inodes in a given range. In recovery, when we see this transaction, then we
build the buffers, initialise them and write them out. Hence, we don't need to
log the buffers used to initialise the inodes.

Also, we can use the background allocations to keep track of recently allocated
inode regions in the per-ag. Using that information to select the next inode to
be used rather than requiring btree searches on every create will greatly reduce
the CPU overhead of workloads that create lots of new inodes. It is not clear
whether a single background thread will be able to allocate enough inodes
to keep up with demand from the rest of the system - we may need multiple
threads for large configurations.

== Single Block Inode Allocation ==

One of the big problems we have withe filesystems that are approaching
full is that it can be hard to find a large enough extent to hold 64 inodes.
We've had ENOSPC errors on inode allocation reported on filesystems that
are only 85% full. This is a sign of free space fragmentation, and it
prevents inode allocation from succeeding. We could (and should) write
a free space defragmenter, but that does not solve the problem - it's
reactive, not preventative.

The main problem we have is that XFS uses inode chunk size and alignment
to optimise inode number to disk location conversion. That is, the conversion
becomes a single set of shifts and masks instead of an AGI btree lookup.
This optimisation substantially reduces the CPU and I/O overhead of
inode lookups, but it does limit our flexibility. If we break the
alignment restriction, every lookup has to go back to a btree search.
Hence we really want to avoid breaking chunk alignment and size
rules.

An approach to avoiding violation of this rule is to be able to determine which
index to look up when parsing the inode number. For example, we could use the
high bit of the inode number to indicate that it is located in a non-aligned
inode chunk and hence needs to be looked up in the btree. This would avoid
the lookup penalty for correctly aligned inode chunks.

If we then redefine the meaning of the contents of the AGI btree record for
such inode chunks, we do not need a new index to keep these in. Effectively,
we need to add a bitmask to the record to indicate which blocks inside
the chunk can actually contain inodes. We still use aligned/sized records,
but mask out the sections that we are not allowed to allocate inodes in.
Effectively, this would allow sparse inode chunks. There may be limitations
on the resolution of sparseness depending on inode size and block size,
but for the common cases of 4k block size and 256 or 512 byte inodes I
think we can run a fully sparse mapping for each inode chunk.

This would allow us to allocate inode extents of any alignment and size
that fits *inside* the existing alignment/size limitations. That is,
a single extent allocation could not span two btree records, but can
lie anywhere inside a single record. It also means that we can do
multiple extent allocations within one btree record to make optimal
use of the fragmented free space.

It should be noted that this will probably have impact on some of the
inode cluster buffer mapping and clustering algorithms. It is not clear
exactly what impact yet, but certainly write clustering will be affected.
Fortunately we'll be able to detect the inodes that will have this problem
by the high bit in the inode number.

== Inode Unlink ==

If we turn to look at unlink and reclaim interactions, there are a few
optimisations that can be made. Firstly, we don't need to do inode inactivation
in reclaim threads - these transactions can easily be pushed to a background
thread. This means that xfs_inactive would be little more than a vmtruncate()
call and queuing to a workqueue. This will substantially speed up the processing
of prune_icache() - we'll get inodes moved into reclaim much faster than we do
right now.

This will have a noticable effect, though. When inodes are unlinked the space
consumed by those inodes may not be immediately freed - it will be returned as
the inodes are processed through the reclaim threads. This means that userspace
monitoring tools such as 'df' may not immediately reflect the result of a
completed unlink operation. This will be a user visible change in behaviour,
though in most cases should not affect anyone and for those that it does affect
a 'sync' should be sufficient to wait for the space to be returned.

Now that inodes to be unlinked are out of general circulation, we can make the
unlinked path more complex. It is desirable to move the unlinked list from the
inode buffer to the inode core, but that has locking implications for incore
unlinked. Hence we really need background thread processing to enable this to
work (i.e. being able to requeue inodes for later processing). To ensure that
to overhead of this work is not a limiting factor, we will probably need
multiple workqueue processing threads for this.

Moving the logging to the inode core enables two things - it allows us to keep
an in-memory copy of the unlinked list off the perag and that allows us to remove
xfs_inotobp(). The in-memory unlinked list means we don't have to read and
traverse the buffers every time we need to find the previous buffer to remove an
inode from the list, but it does mean we have to take the inode lock. If the
previous inode is locked, then we can't remove the inode from the unlinked list
so we must requeue it for this to occur at a later time.

Combined with the changes to inode create, we effectively will only use the
inode buffer in the transaction subsystem for marking the region stale when
freeing an inode chunk from disk (i.e. the default noikeep configuration). If
we are using large inode allocation, we don't want to be freeing random inode
chunks - this will just leave us with fragmented inode regions and undo all the
good work that was done originally.

To avoid this, we should not be freeing inode chunks as soon as they no longer
have any empty inodes in them. We should periodically scan the AGI btree
looking for contiguous chunks that have no inodes allocated in them, and then
freeing the large contiguous regions we find in one go. It is likely this can
be done in a single transaction; it's one extent to be freed, along with a
contiguous set of records to be removed from the AGI btree so should not
require logging much at all. Also, the background scanning could be triggered
by a number of different events - low space in an AG, a large number of free
inodes in an AG, etc - as it doesn't need to be done frequently. As a result
of the lack of frequency that this needs to be done, it can probably be
handled by a single thread or delayed workqueue.

Further optimisations are possible here - if we rule that the AGI btree is the
sole place that inodes are marked free or in-use (with the exception of
unlinked inodes attached to the AGI lists), then we can avoid the need to
write back unlinked inodes or read newly created inodes from disk. This would
require all inodes to effectively use a random generation number assigned at
create time as we would not be reading it from disk - writing/reading the current
generation number appears to be the only real reason for doing this I/O. This
would require extra checks to determine if an inode is unlinked - we
need to do an imap lookup rather than reading it and then checking it is
valid if it is not already in memory. Avoiding the I/O, however, will greatly speed
up create and remove workloads. Note: the impact of this on the bulkstat algorithm
has not been determined yet.

One of the issues we need to consider with this background inactivation is that
we will be able to defer a large quantity of inactivation transactions so we are
going to need to be careful about how much we allow to be queued. Simple queue
depth throttling should be all that is needed to keep this under control.

== Reclaim Optimizations ==

Tracking inodes for reclaim in radix tree: Done

Using RCU for radix tree reclaim walks: Done

Non-blocking background reclaim: Done

Parallelised shrinker based reclaim: Done

Now that we have efficient unlink, we've got to handle the reclaim of all the
inodes that are now dead or simply not referenced. For inodes that are dirty,
we need to write them out to clean them. For inodes that are clean and not
unlinked, we need to compress them down for more compact storage. This involves
some CPU overhead, but it is worth noting that reclaiming of clean inodes
typically only occurs when we are under memory pressure.

By compressing the XFS inode in this case, we are effectively reducing the
memory usage of the inode rather than freeing it directly. If we then get
another operation on that inode (e.g. the working set is slightly larger than
can be held in linux+XFS inode pairs, we avoid having to read the inode off
disk again - it simply gets uncompressed out of the cache. In essence we use
the compressed inode cache as an exclusive second level cache - it has higher
density than the primary cache and higher load latency and CPU overhead,
but it still avoids I/O in exactly the same manner as the primary cache.

We cannot allow unrestricted build-up of reclaimable inodes - the memory they
consume will be large, so we should be aiming to compress reclaimable inodes as
soon as they are clean. This will prevent buildup of memory consuming
uncompressed inodes that are not likely to be referenced again immediately.

This clean inode reclaimation process can be accelerated by triggering reclaim
on inode I/O completion. If the inode is clean and reclaimable we should
trigger immediate reclaim processing of that inode. This will mean that
reclaim of newly cleaned inodes will not get held up behind reclaim of dirty
inodes.

For inodes that are unlinked, we can simply free them in reclaim as theƦ
are no longer in use. We don't want to poison the compressed cache with
unlinked inodes, nor do we need to because we can allocate new inodes
without incurring I/O.

Still, we may end up with lots of inodes queued for reclaim. We may need
to implement a throttle mechanism to slow down the rate at which inodes
are queued for reclaimation in the situation where the reclaim process
is not able to keep up. It should be noted that if we parallelise inode
writeback we should also be able to parallelise inode reclaim via
the same mechanism, so the need for throttling may relatively low
if we can have multiple inodes under reclaim at once.

It should be noted that complexity is exposed by interactions with concurrent
lookups, especially if we move to RCU locking on the radix tree. Firstly, we
need to be able to do an atomic swap of the compressed inode for the
uncompressed inode in the radix tree (and vice versa), to be able to tell them
apart (magic #), and to have atomic reference counts to ensure we can avoid use
after free situations when lookups race with compression or freeing.

Secondly, with the complex unlink/reclaim interactions we will need to be
careful to detect inodes in the process of reclaim - the lookupp process
will need to do different things depending on the state of reclaim. Indeed,
we will need to be able to cancel reclaim of an unlinked inode if we try
to allocate it before it has been fully unlinked or reclaimed. The same
can be said for an inode in the process of being compressed - if we get
a lookup during the compression process, we want to return the existing
inode, not have to wait, re-allocate and uncompress it again. These
are all solvable issues - they just add complexity.

== Accelerated Reclaim of buftarg Page Cache for Inodes ==
----------------------------------------------------

Per-buftarg buffer LRU reclaim: Done

Per-buftarg shrinker: Done

Per-buffer type reclaim prioritisation: Done

For single use inodes or even read-only inodes, we read them in, use them, then
reclaim them. With the compressed cache, they'll get compressed and live a lot
longer in memory. However, we also will have the inode cluster buffer pages
sitting in memory for some length of time after the inode was read in. This can
consume a large amount of memory that will never be used again, and does not
get reclaimed until they are purged from the LRU by the VM. It would be
advantageous to accelerate the reclaim of these pages so that they do not build
up unneccessarily.

A better method would appear to be to leverage the delayed read queue
mechanism. This delayed read queue pins read buffers for a short period of
time, and then if they have not been referenced they get torn down. If, as
part of this delayed read buffer teardown procedure we all free the backing
pages completely, we acheive the exact same result as having our own LRUs to
manage the page cache. This seems much simpler and a much more holistic
approach to solving the problem than implementing page LRUs.

== Killing Bufferheads (a.k.a "Die, buggerheads, Die!") ==

[This is not strictly about inode caching, but doesn't fit into
other areas of development as closely as it does to inode caching
optimisations.]

XFS is extent based. The Linux page cache is block based. Hence for
every cached page in memory, we have to attach a structure for mapping
the blocks on that page back to to the on-disk location. In XFs, we also
use this to hold state for delayed allocation and unwritten extent blocks
so the generic code can do the right thing when necessary. We also
use it to avoid extent lookups at various times within the XFS I/O
path.

However, this has a massive cost. While XFS might represent the
disk mapping of a 1GB extent in 24 bytes of memory, the page cache
requires 262,144 bufferheads (assuming 4k block size) to represent the
same mapping. That's roughly 14MB of memory neededtoo represent that.

Chris Mason wrote an extent map representation for page cache state
and mappings for BTRFS; that code is mostly generic and could be
adapted to XFS. This would allow us to hold all the page cache state
in extent format and greatly reduce the memory overhead that it currently
has. The tradeoff is increased CPU overhead due to tree lookups where
structure lookups currently are used. Still, this has much lower
overhead than xfs_bmapi() based lookups, so the penalty is going to
be lower than if we did these lookups right now.

If we make this change, we would then have three levels of extent
caching:

- the BMBT buffers
- the XFS incore inode extent tree (iext*)
- the page cache extent map tree

Effectively, the XFS incore inode extent tree becomes redundant - all
the extent state it holds can be moved to the generic page cache tree
and we can do all our incore operations there. Our logging of changes
is based on the BMBT buffers, so getting rid of the iext layer would
not impact the transaction subsystem at all.

Such integration with the generic code will also allow development
of generic writeback routines for delayed allocation, unwritten
extents, etc that are not specific to a given filesystem.

== Demand Paging of Large Inode Extent Maps ==

Currently the inode extent map is pinned in memory until the inode is
reclaimed. Hence an inode with millions of extents will pin a large
amount of memory and this can cause serious issues in low memory
situations. Ideally we would like to be able to page the extent
map in and out once they get to a certain size to avoid this
problem. This feature requires more investigation before an overall
approach can be detailed here.

It should be noted that if we move to an extent-based page cache mapping
tree, the associated extent state tree can be used to track sparse
regions. That is, regions of the extent map that are not in memory
can be easily represented and acceesses to an unread region can then
be used to trigger demand loading.

== Food For Thought (Crazy Ideas) ==

If we are not using inode buffers for logging changes to inodes, we should
consider whether we need them at all. What benefit do the buffers bring us when
all we will use them for is read or write I/O? Would it be better to go
straight to the buftarg page cache and do page based I/O via submit_bio()?

Improving inode Caching

2010-12-23T03:51:08Z

Dgc: /* Reclaim Optimizations */

Future Directions for XFS

== Improving Inode Caching and Operation in XFS ==
--------------------------------------------

Thousand foot view:

We want to drive inode lookup in a manner that is as parallel, scalable and low
overhead as possible. This means efficient indexing, lowering memory
consumption, simplifying the caching heirachy, removing duplication and
reducing/removing lock traffic.

In addition, we want to provide a good foundation for simplifying inode I/O,
improving writeback clustering, preventing RMW of inode buffers under memory
pressure, reducing creation and deletion overhead and removing writeback of
unlogged changes completely.

There are a variety of features in disconnected trees and patch sets that need
to be combined to acheive this - the basic structure needed to implement this is
already in mainline and that is the radix tree inode indexing. Further
improvements are going to be based around this structure and using it
effectively to avoid needing other indexing mechanisms.

Discussion:

== Combining XFS and VFS inodes ==
----------------------------

Status: Done (October 2008)

== Compressed Inode Cache ==
-----------------------------

The XFs inode cache uses a lot of memory. We can avoid this problem by making
use of the compressed inode cache - only the active inodes are held in a
non-compressed form, hence most inodes will end up being cached in compressed
form rather than in the XFS/linux inode form. The compressed form can reduce
the cached inode footprint to 200-300 bytes per inode instead of 1-1.1k that
they currently take on a 64bit system. Hence by moving to a compressed cache we
can greatly increase the number of inodes cached in a given amount of memory
which more that offsets any comparitive increase we will see from inodes in
reclaim. the compressed cache should really have a LRU and a shrinker as well
so that memory pressure will slowly trim it as memory demands occur. [Note:
this compressed cache is discussed further later on in the reclaim context.]

== Fixed Inode Cache Size ==

It is worth noting that for embedded systems and appliances it may be worth while allowing
the size of the caches to be fixed. Also, to prevent memory fragmentation
problems, we could simply allocate that memory to the compressed cache slab.
In effect, this would become a 'static slab' in that it has a bound maximum
size and never frees and memory. When the cache is full, we reclaim an
object out of it for reuse - this could be done by triggering the shrinker
to reclaim from the LRU. This would prevent the compressed inode cache from
consuming excessive amounts of memory in tightly constrained evironments.
Such an extension to the slab caches does not look difficult to implement,
and would allow such customisation with minimal deviation from mainline code.

== Bypassing the Linux Inode Cache ==

Lookups: Done (October 2008)

Tracking dirty inodes: Done

Writeback of dirty inodes: Done

Writeback of dirty pages: still executed the by VFS

Now that we can track dirty inodes ourselves, we can pretty much isolate
writeback of both data and inodes from the generic pdflush code. If we add a
hook high up in the pdflush path that simply passes us a writeback control
structure with the current writeback guidelines, we can do writeback within
those guidelines in the most optimal fashion for XFS.

== Avoiding the Generic pdflush Code ==

Writeback of inodes via AIL: Done

For pdflush driven writeback, we only want to write back data; all other inode
writeback should be driven from the AIL (our time ordered dirty metadata list)
or xfssyncd in a manner that is most optimal for XFS.

Furthermore, if we implement our own pdflush method, we can parallelise it in
several ways. We can ensure that each filesystem has it's own flush thread or
thread pool, we can have a thread pool shared by all filesystems (like pdflush
currently operates), we can have a flush thread per inode radix tree, and so
one. The method of paralleisation is open for interpretation, but enabling
multiple flush threads to operate on a single filesystem is one of the necessary
requirements to avoid data writeback (and hence delayed allocation) being
limited to the throughput of a single CPU per filesystem.

== Improving Inode Writeback ==

To optimise inode writeback, we really need to reduce the impact of inode
buffer read-modify-write cycles. XFS is capable of caching far more inodes in
memory than it has buffer space available for, so RMW cycles during inode
writeback under memory pressure are quite common. Firstly, we want to avoid
blocking pdflush at all costs. Secondly, we want to issue as much localised
readahead as possible in ascending offset order to allow both elevator merging
of readahead and as little seeking as possible. Finally, we want to issue all
the write cycles as close together as possible to allow the same elevator and
I/O optimisations to take place.

To do this, firstly we need the non-blocking inode flush semantics to issue
readahead on buffers that are not up-to-date rather than reading them
synchronously. Inode writeback already has the interface to handle inodes that
weren't flushed - we return EAGAIN from xfs_iflush() and the higher inode
writeback layers handle this appropriately. It would be easy to add another
flag to pass down to the buffer layer to say 'issue but don't wait for any
read'. If we use a radix tree traversal to issue readahead in such a manner,
we'll get ascending offset readahead being issued.

One problem with this is that we can issue too much readahead and thrash the
cache. A possible solution to this is to make the readahead a 'delayed read'
and on I/o completion add it to a queue that holds a reference on the buffer.
If a followup read occurs soon after, we remove it from the queue and drop that
reference. This prevents the buffer from being reclaimed in betwen the
readahead completing and the real read being issued. We should also issue this
delayed read on buffers that are in the cache so that they don't get reclaimed
to make room for the readahead.

To prevent buildup of delayed read buffers, we can periodically purge them -
those that are older than a given age (say 5 seconds) can be removed from the
list and their reference dropped. This will free the buffer and allow it's
pages to be reclaimed.

Once we have done the readahead pass, we can then do a modify and writeback
pass over all the inodes, knowing that there will be no read cycles to delay
this step. Once again, a radix tree traversal gives us ascending order
writeback and hence the modified buffers we send to the device will be in
optimal order for merging and minimal seek overhead.

== Contiguous Inode Allocation ==

To make optimal use of the radix tree cache and enable wide-scale clustering of
inode writeback across multiple clusters, we really need to ensure that inode
allocation occurs in large contiguous chunks on disk. Right now we only
allocate chunks of 64 inodes at a time; ideally we want to allocate a stripe
unit (or multiple of) full of inodes at a time. This would allow inode
writeback clustering to do full stripe writes to the underlying RAID if there
are dirty inodes spanning the entire stripe unit.

The problem with doing this is that we don't want to introduce the latency of
creating megabytes of inodes when only one is needed for the current operation.
Hence we need to push the inode creation into a background thread and use that
to create contiguous inode chunks asynchronously. This moves the actual on-disk
allocation of inodes out of the normal create path; it should always be able to
find a free inode without doing on disk allocation. This will simplify the
create path by removing the allocate-on-disk-then-retry-the-create double
transaction that currently occurs.

As an aside, we could preallocate a small amount of inodes in each AG (10-20MB
of inodes per AG?) without impacting mkfs time too greatly. This would allow
the filesystem to be used immediately on the first mount without triggering
lots of background allocation. This could alsobe done after the first mount
occurs, but that could interfere with typical benchmarking situations. Another
good reason for this preallocation is that it will help reduce xfs_repair
runtime for most common filesystem usages.

One of the issues that the background create will cause is a substantial amount
of log traffic - every inode buffer initialised will be logged in whole. Hence
if we create a megabyte of inodes, we'll be causing a megabyte of log traffic
just for the inode buffers we've initialised. This is relatively simple to fix
- we don't log the buffer, we just log the fact that we need to initialise
inodes in a given range. In recovery, when we see this transaction, then we
build the buffers, initialise them and write them out. Hence, we don't need to
log the buffers used to initialise the inodes.

Also, we can use the background allocations to keep track of recently allocated
inode regions in the per-ag. Using that information to select the next inode to
be used rather than requiring btree searches on every create will greatly reduce
the CPU overhead of workloads that create lots of new inodes. It is not clear
whether a single background thread will be able to allocate enough inodes
to keep up with demand from the rest of the system - we may need multiple
threads for large configurations.

== Single Block Inode Allocation ==

One of the big problems we have withe filesystems that are approaching
full is that it can be hard to find a large enough extent to hold 64 inodes.
We've had ENOSPC errors on inode allocation reported on filesystems that
are only 85% full. This is a sign of free space fragmentation, and it
prevents inode allocation from succeeding. We could (and should) write
a free space defragmenter, but that does not solve the problem - it's
reactive, not preventative.

The main problem we have is that XFS uses inode chunk size and alignment
to optimise inode number to disk location conversion. That is, the conversion
becomes a single set of shifts and masks instead of an AGI btree lookup.
This optimisation substantially reduces the CPU and I/O overhead of
inode lookups, but it does limit our flexibility. If we break the
alignment restriction, every lookup has to go back to a btree search.
Hence we really want to avoid breaking chunk alignment and size
rules.

An approach to avoiding violation of this rule is to be able to determine which
index to look up when parsing the inode number. For example, we could use the
high bit of the inode number to indicate that it is located in a non-aligned
inode chunk and hence needs to be looked up in the btree. This would avoid
the lookup penalty for correctly aligned inode chunks.

If we then redefine the meaning of the contents of the AGI btree record for
such inode chunks, we do not need a new index to keep these in. Effectively,
we need to add a bitmask to the record to indicate which blocks inside
the chunk can actually contain inodes. We still use aligned/sized records,
but mask out the sections that we are not allowed to allocate inodes in.
Effectively, this would allow sparse inode chunks. There may be limitations
on the resolution of sparseness depending on inode size and block size,
but for the common cases of 4k block size and 256 or 512 byte inodes I
think we can run a fully sparse mapping for each inode chunk.

This would allow us to allocate inode extents of any alignment and size
that fits *inside* the existing alignment/size limitations. That is,
a single extent allocation could not span two btree records, but can
lie anywhere inside a single record. It also means that we can do
multiple extent allocations within one btree record to make optimal
use of the fragmented free space.

It should be noted that this will probably have impact on some of the
inode cluster buffer mapping and clustering algorithms. It is not clear
exactly what impact yet, but certainly write clustering will be affected.
Fortunately we'll be able to detect the inodes that will have this problem
by the high bit in the inode number.

== Inode Unlink ==

If we turn to look at unlink and reclaim interactions, there are a few
optimisations that can be made. Firstly, we don't need to do inode inactivation
in reclaim threads - these transactions can easily be pushed to a background
thread. This means that xfs_inactive would be little more than a vmtruncate()
call and queuing to a workqueue. This will substantially speed up the processing
of prune_icache() - we'll get inodes moved into reclaim much faster than we do
right now.

This will have a noticable effect, though. When inodes are unlinked the space
consumed by those inodes may not be immediately freed - it will be returned as
the inodes are processed through the reclaim threads. This means that userspace
monitoring tools such as 'df' may not immediately reflect the result of a
completed unlink operation. This will be a user visible change in behaviour,
though in most cases should not affect anyone and for those that it does affect
a 'sync' should be sufficient to wait for the space to be returned.

Now that inodes to be unlinked are out of general circulation, we can make the
unlinked path more complex. It is desirable to move the unlinked list from the
inode buffer to the inode core, but that has locking implications for incore
unlinked. Hence we really need background thread processing to enable this to
work (i.e. being able to requeue inodes for later processing). To ensure that
to overhead of this work is not a limiting factor, we will probably need
multiple workqueue processing threads for this.

Moving the logging to the inode core enables two things - it allows us to keep
an in-memory copy of the unlinked list off the perag and that allows us to remove
xfs_inotobp(). The in-memory unlinked list means we don't have to read and
traverse the buffers every time we need to find the previous buffer to remove an
inode from the list, but it does mean we have to take the inode lock. If the
previous inode is locked, then we can't remove the inode from the unlinked list
so we must requeue it for this to occur at a later time.

Combined with the changes to inode create, we effectively will only use the
inode buffer in the transaction subsystem for marking the region stale when
freeing an inode chunk from disk (i.e. the default noikeep configuration). If
we are using large inode allocation, we don't want to be freeing random inode
chunks - this will just leave us with fragmented inode regions and undo all the
good work that was done originally.

To avoid this, we should not be freeing inode chunks as soon as they no longer
have any empty inodes in them. We should periodically scan the AGI btree
looking for contiguous chunks that have no inodes allocated in them, and then
freeing the large contiguous regions we find in one go. It is likely this can
be done in a single transaction; it's one extent to be freed, along with a
contiguous set of records to be removed from the AGI btree so should not
require logging much at all. Also, the background scanning could be triggered
by a number of different events - low space in an AG, a large number of free
inodes in an AG, etc - as it doesn't need to be done frequently. As a result
of the lack of frequency that this needs to be done, it can probably be
handled by a single thread or delayed workqueue.

Further optimisations are possible here - if we rule that the AGI btree is the
sole place that inodes are marked free or in-use (with the exception of
unlinked inodes attached to the AGI lists), then we can avoid the need to
write back unlinked inodes or read newly created inodes from disk. This would
require all inodes to effectively use a random generation number assigned at
create time as we would not be reading it from disk - writing/reading the current
generation number appears to be the only real reason for doing this I/O. This
would require extra checks to determine if an inode is unlinked - we
need to do an imap lookup rather than reading it and then checking it is
valid if it is not already in memory. Avoiding the I/O, however, will greatly speed
up create and remove workloads. Note: the impact of this on the bulkstat algorithm
has not been determined yet.

One of the issues we need to consider with this background inactivation is that
we will be able to defer a large quantity of inactivation transactions so we are
going to need to be careful about how much we allow to be queued. Simple queue
depth throttling should be all that is needed to keep this under control.

== Reclaim Optimizations ==

Tracking inodes for reclaim in radix tree: Done

Using RCU for radix tree reclaim walks: Done

Non-blocking background reclaim: Done

Parallelised shrinker based reclaim: Done

Now that we have efficient unlink, we've got to handle the reclaim of all the
inodes that are now dead or simply not referenced. For inodes that are dirty,
we need to write them out to clean them. For inodes that are clean and not
unlinked, we need to compress them down for more compact storage. This involves
some CPU overhead, but it is worth noting that reclaiming of clean inodes
typically only occurs when we are under memory pressure.

By compressing the XFS inode in this case, we are effectively reducing the
memory usage of the inode rather than freeing it directly. If we then get
another operation on that inode (e.g. the working set is slightly larger than
can be held in linux+XFS inode pairs, we avoid having to read the inode off
disk again - it simply gets uncompressed out of the cache. In essence we use
the compressed inode cache as an exclusive second level cache - it has higher
density than the primary cache and higher load latency and CPU overhead,
but it still avoids I/O in exactly the same manner as the primary cache.

We cannot allow unrestricted build-up of reclaimable inodes - the memory they
consume will be large, so we should be aiming to compress reclaimable inodes as
soon as they are clean. This will prevent buildup of memory consuming
uncompressed inodes that are not likely to be referenced again immediately.

This clean inode reclaimation process can be accelerated by triggering reclaim
on inode I/O completion. If the inode is clean and reclaimable we should
trigger immediate reclaim processing of that inode. This will mean that
reclaim of newly cleaned inodes will not get held up behind reclaim of dirty
inodes.

For inodes that are unlinked, we can simply free them in reclaim as theƦ
are no longer in use. We don't want to poison the compressed cache with
unlinked inodes, nor do we need to because we can allocate new inodes
without incurring I/O.

Still, we may end up with lots of inodes queued for reclaim. We may need
to implement a throttle mechanism to slow down the rate at which inodes
are queued for reclaimation in the situation where the reclaim process
is not able to keep up. It should be noted that if we parallelise inode
writeback we should also be able to parallelise inode reclaim via
the same mechanism, so the need for throttling may relatively low
if we can have multiple inodes under reclaim at once.

It should be noted that complexity is exposed by interactions with concurrent
lookups, especially if we move to RCU locking on the radix tree. Firstly, we
need to be able to do an atomic swap of the compressed inode for the
uncompressed inode in the radix tree (and vice versa), to be able to tell them
apart (magic #), and to have atomic reference counts to ensure we can avoid use
after free situations when lookups race with compression or freeing.

Secondly, with the complex unlink/reclaim interactions we will need to be
careful to detect inodes in the process of reclaim - the lookupp process
will need to do different things depending on the state of reclaim. Indeed,
we will need to be able to cancel reclaim of an unlinked inode if we try
to allocate it before it has been fully unlinked or reclaimed. The same
can be said for an inode in the process of being compressed - if we get
a lookup during the compression process, we want to return the existing
inode, not have to wait, re-allocate and uncompress it again. These
are all solvable issues - they just add complexity.

== Accelerated Reclaim of buftarg Page Cache for Inodes ==
----------------------------------------------------

Per-buftarg buffer LRU reclaim: Done

Per-buftarg shrinker: Done

Per-buffer type reclaim prioritisation: Done

For single use inodes or even read-only inodes, we read them in, use them, then
reclaim them. With the compressed cache, they'll get compressed and live a lot
longer in memory. However, we also will have the inode cluster buffer pages
sitting in memory for some length of time after the inode was read in. This can
consume a large amount of memory that will never be used again, and does not
get reclaimed until they are purged from the LRU by the VM. It would be
advantageous to accelerate the reclaim of these pages so that they do not build
up unneccessarily.

A better method would appear to be to leverage the delayed read queue
mechanism. This delayed read queue pins read buffers for a short period of
time, and then if they have not been referenced they get torn down. If, as
part of this delayed read buffer teardown procedure we all free the backing
pages completely, we acheive the exact same result as having our own LRUs to
manage the page cache. This seems much simpler and a much more holistic
approach to solving the problem than implementing page LRUs.

== Killing Bufferheads (a.k.a "Die, buggerheads, Die!") ==

[This is not strictly about inode caching, but doesn't fit into
other areas of development as closely as it does to inode caching
optimisations.]

XFS is extent based. The Linux page cache is block based. Hence for
every cached page in memory, we have to attach a structure for mapping
the blocks on that page back to to the on-disk location. In XFs, we also
use this to hold state for delayed allocation and unwritten extent blocks
so the generic code can do the right thing when necessary. We also
use it to avoid extent lookups at various times within the XFS I/O
path.

However, this has a massive cost. While XFS might represent the
disk mapping of a 1GB extent in 24 bytes of memory, the page cache
requires 262,144 bufferheads (assuming 4k block size) to represent the
same mapping. That's roughly 14MB of memory neededtoo represent that.

Chris Mason wrote an extent map representation for page cache state
and mappings for BTRFS; that code is mostly generic and could be
adapted to XFS. This would allow us to hold all the page cache state
in extent format and greatly reduce the memory overhead that it currently
has. The tradeoff is increased CPU overhead due to tree lookups where
structure lookups currently are used. Still, this has much lower
overhead than xfs_bmapi() based lookups, so the penalty is going to
be lower than if we did these lookups right now.

If we make this change, we would then have three levels of extent
caching:

- the BMBT buffers
- the XFS incore inode extent tree (iext*)
- the page cache extent map tree

Effectively, the XFS incore inode extent tree becomes redundant - all
the extent state it holds can be moved to the generic page cache tree
and we can do all our incore operations there. Our logging of changes
is based on the BMBT buffers, so getting rid of the iext layer would
not impact the transaction subsystem at all.

Such integration with the generic code will also allow development
of generic writeback routines for delayed allocation, unwritten
extents, etc that are not specific to a given filesystem.

== Demand Paging of Large Inode Extent Maps ==

Currently the inode extent map is pinned in memory until the inode is
reclaimed. Hence an inode with millions of extents will pin a large
amount of memory and this can cause serious issues in low memory
situations. Ideally we would like to be able to page the extent
map in and out once they get to a certain size to avoid this
problem. This feature requires more investigation before an overall
approach can be detailed here.

It should be noted that if we move to an extent-based page cache mapping
tree, the associated extent state tree can be used to track sparse
regions. That is, regions of the extent map that are not in memory
can be easily represented and acceesses to an unread region can then
be used to trigger demand loading.

== Food For Thought (Crazy Ideas) ==

If we are not using inode buffers for logging changes to inodes, we should
consider whether we need them at all. What benefit do the buffers bring us when
all we will use them for is read or write I/O? Would it be better to go
straight to the buftarg page cache and do page based I/O via submit_bio()?

Improving inode Caching

2010-12-23T03:50:49Z

Dgc: /* Accelerated Reclaim of buftarg Page Cache for Inodes */

Future Directions for XFS

== Improving Inode Caching and Operation in XFS ==
--------------------------------------------

Thousand foot view:

We want to drive inode lookup in a manner that is as parallel, scalable and low
overhead as possible. This means efficient indexing, lowering memory
consumption, simplifying the caching heirachy, removing duplication and
reducing/removing lock traffic.

In addition, we want to provide a good foundation for simplifying inode I/O,
improving writeback clustering, preventing RMW of inode buffers under memory
pressure, reducing creation and deletion overhead and removing writeback of
unlogged changes completely.

There are a variety of features in disconnected trees and patch sets that need
to be combined to acheive this - the basic structure needed to implement this is
already in mainline and that is the radix tree inode indexing. Further
improvements are going to be based around this structure and using it
effectively to avoid needing other indexing mechanisms.

Discussion:

== Combining XFS and VFS inodes ==
----------------------------

Status: Done (October 2008)

== Compressed Inode Cache ==
-----------------------------

The XFs inode cache uses a lot of memory. We can avoid this problem by making
use of the compressed inode cache - only the active inodes are held in a
non-compressed form, hence most inodes will end up being cached in compressed
form rather than in the XFS/linux inode form. The compressed form can reduce
the cached inode footprint to 200-300 bytes per inode instead of 1-1.1k that
they currently take on a 64bit system. Hence by moving to a compressed cache we
can greatly increase the number of inodes cached in a given amount of memory
which more that offsets any comparitive increase we will see from inodes in
reclaim. the compressed cache should really have a LRU and a shrinker as well
so that memory pressure will slowly trim it as memory demands occur. [Note:
this compressed cache is discussed further later on in the reclaim context.]

== Fixed Inode Cache Size ==

It is worth noting that for embedded systems and appliances it may be worth while allowing
the size of the caches to be fixed. Also, to prevent memory fragmentation
problems, we could simply allocate that memory to the compressed cache slab.
In effect, this would become a 'static slab' in that it has a bound maximum
size and never frees and memory. When the cache is full, we reclaim an
object out of it for reuse - this could be done by triggering the shrinker
to reclaim from the LRU. This would prevent the compressed inode cache from
consuming excessive amounts of memory in tightly constrained evironments.
Such an extension to the slab caches does not look difficult to implement,
and would allow such customisation with minimal deviation from mainline code.

== Bypassing the Linux Inode Cache ==

Lookups: Done (October 2008)

Tracking dirty inodes: Done

Writeback of dirty inodes: Done

Writeback of dirty pages: still executed the by VFS

Now that we can track dirty inodes ourselves, we can pretty much isolate
writeback of both data and inodes from the generic pdflush code. If we add a
hook high up in the pdflush path that simply passes us a writeback control
structure with the current writeback guidelines, we can do writeback within
those guidelines in the most optimal fashion for XFS.

== Avoiding the Generic pdflush Code ==

Writeback of inodes via AIL: Done

For pdflush driven writeback, we only want to write back data; all other inode
writeback should be driven from the AIL (our time ordered dirty metadata list)
or xfssyncd in a manner that is most optimal for XFS.

Furthermore, if we implement our own pdflush method, we can parallelise it in
several ways. We can ensure that each filesystem has it's own flush thread or
thread pool, we can have a thread pool shared by all filesystems (like pdflush
currently operates), we can have a flush thread per inode radix tree, and so
one. The method of paralleisation is open for interpretation, but enabling
multiple flush threads to operate on a single filesystem is one of the necessary
requirements to avoid data writeback (and hence delayed allocation) being
limited to the throughput of a single CPU per filesystem.

== Improving Inode Writeback ==

To optimise inode writeback, we really need to reduce the impact of inode
buffer read-modify-write cycles. XFS is capable of caching far more inodes in
memory than it has buffer space available for, so RMW cycles during inode
writeback under memory pressure are quite common. Firstly, we want to avoid
blocking pdflush at all costs. Secondly, we want to issue as much localised
readahead as possible in ascending offset order to allow both elevator merging
of readahead and as little seeking as possible. Finally, we want to issue all
the write cycles as close together as possible to allow the same elevator and
I/O optimisations to take place.

To do this, firstly we need the non-blocking inode flush semantics to issue
readahead on buffers that are not up-to-date rather than reading them
synchronously. Inode writeback already has the interface to handle inodes that
weren't flushed - we return EAGAIN from xfs_iflush() and the higher inode
writeback layers handle this appropriately. It would be easy to add another
flag to pass down to the buffer layer to say 'issue but don't wait for any
read'. If we use a radix tree traversal to issue readahead in such a manner,
we'll get ascending offset readahead being issued.

One problem with this is that we can issue too much readahead and thrash the
cache. A possible solution to this is to make the readahead a 'delayed read'
and on I/o completion add it to a queue that holds a reference on the buffer.
If a followup read occurs soon after, we remove it from the queue and drop that
reference. This prevents the buffer from being reclaimed in betwen the
readahead completing and the real read being issued. We should also issue this
delayed read on buffers that are in the cache so that they don't get reclaimed
to make room for the readahead.

To prevent buildup of delayed read buffers, we can periodically purge them -
those that are older than a given age (say 5 seconds) can be removed from the
list and their reference dropped. This will free the buffer and allow it's
pages to be reclaimed.

Once we have done the readahead pass, we can then do a modify and writeback
pass over all the inodes, knowing that there will be no read cycles to delay
this step. Once again, a radix tree traversal gives us ascending order
writeback and hence the modified buffers we send to the device will be in
optimal order for merging and minimal seek overhead.

== Contiguous Inode Allocation ==

To make optimal use of the radix tree cache and enable wide-scale clustering of
inode writeback across multiple clusters, we really need to ensure that inode
allocation occurs in large contiguous chunks on disk. Right now we only
allocate chunks of 64 inodes at a time; ideally we want to allocate a stripe
unit (or multiple of) full of inodes at a time. This would allow inode
writeback clustering to do full stripe writes to the underlying RAID if there
are dirty inodes spanning the entire stripe unit.

The problem with doing this is that we don't want to introduce the latency of
creating megabytes of inodes when only one is needed for the current operation.
Hence we need to push the inode creation into a background thread and use that
to create contiguous inode chunks asynchronously. This moves the actual on-disk
allocation of inodes out of the normal create path; it should always be able to
find a free inode without doing on disk allocation. This will simplify the
create path by removing the allocate-on-disk-then-retry-the-create double
transaction that currently occurs.

As an aside, we could preallocate a small amount of inodes in each AG (10-20MB
of inodes per AG?) without impacting mkfs time too greatly. This would allow
the filesystem to be used immediately on the first mount without triggering
lots of background allocation. This could alsobe done after the first mount
occurs, but that could interfere with typical benchmarking situations. Another
good reason for this preallocation is that it will help reduce xfs_repair
runtime for most common filesystem usages.

One of the issues that the background create will cause is a substantial amount
of log traffic - every inode buffer initialised will be logged in whole. Hence
if we create a megabyte of inodes, we'll be causing a megabyte of log traffic
just for the inode buffers we've initialised. This is relatively simple to fix
- we don't log the buffer, we just log the fact that we need to initialise
inodes in a given range. In recovery, when we see this transaction, then we
build the buffers, initialise them and write them out. Hence, we don't need to
log the buffers used to initialise the inodes.

Also, we can use the background allocations to keep track of recently allocated
inode regions in the per-ag. Using that information to select the next inode to
be used rather than requiring btree searches on every create will greatly reduce
the CPU overhead of workloads that create lots of new inodes. It is not clear
whether a single background thread will be able to allocate enough inodes
to keep up with demand from the rest of the system - we may need multiple
threads for large configurations.

== Single Block Inode Allocation ==

One of the big problems we have withe filesystems that are approaching
full is that it can be hard to find a large enough extent to hold 64 inodes.
We've had ENOSPC errors on inode allocation reported on filesystems that
are only 85% full. This is a sign of free space fragmentation, and it
prevents inode allocation from succeeding. We could (and should) write
a free space defragmenter, but that does not solve the problem - it's
reactive, not preventative.

The main problem we have is that XFS uses inode chunk size and alignment
to optimise inode number to disk location conversion. That is, the conversion
becomes a single set of shifts and masks instead of an AGI btree lookup.
This optimisation substantially reduces the CPU and I/O overhead of
inode lookups, but it does limit our flexibility. If we break the
alignment restriction, every lookup has to go back to a btree search.
Hence we really want to avoid breaking chunk alignment and size
rules.

An approach to avoiding violation of this rule is to be able to determine which
index to look up when parsing the inode number. For example, we could use the
high bit of the inode number to indicate that it is located in a non-aligned
inode chunk and hence needs to be looked up in the btree. This would avoid
the lookup penalty for correctly aligned inode chunks.

If we then redefine the meaning of the contents of the AGI btree record for
such inode chunks, we do not need a new index to keep these in. Effectively,
we need to add a bitmask to the record to indicate which blocks inside
the chunk can actually contain inodes. We still use aligned/sized records,
but mask out the sections that we are not allowed to allocate inodes in.
Effectively, this would allow sparse inode chunks. There may be limitations
on the resolution of sparseness depending on inode size and block size,
but for the common cases of 4k block size and 256 or 512 byte inodes I
think we can run a fully sparse mapping for each inode chunk.

This would allow us to allocate inode extents of any alignment and size
that fits *inside* the existing alignment/size limitations. That is,
a single extent allocation could not span two btree records, but can
lie anywhere inside a single record. It also means that we can do
multiple extent allocations within one btree record to make optimal
use of the fragmented free space.

It should be noted that this will probably have impact on some of the
inode cluster buffer mapping and clustering algorithms. It is not clear
exactly what impact yet, but certainly write clustering will be affected.
Fortunately we'll be able to detect the inodes that will have this problem
by the high bit in the inode number.

== Inode Unlink ==

If we turn to look at unlink and reclaim interactions, there are a few
optimisations that can be made. Firstly, we don't need to do inode inactivation
in reclaim threads - these transactions can easily be pushed to a background
thread. This means that xfs_inactive would be little more than a vmtruncate()
call and queuing to a workqueue. This will substantially speed up the processing
of prune_icache() - we'll get inodes moved into reclaim much faster than we do
right now.

This will have a noticable effect, though. When inodes are unlinked the space
consumed by those inodes may not be immediately freed - it will be returned as
the inodes are processed through the reclaim threads. This means that userspace
monitoring tools such as 'df' may not immediately reflect the result of a
completed unlink operation. This will be a user visible change in behaviour,
though in most cases should not affect anyone and for those that it does affect
a 'sync' should be sufficient to wait for the space to be returned.

Now that inodes to be unlinked are out of general circulation, we can make the
unlinked path more complex. It is desirable to move the unlinked list from the
inode buffer to the inode core, but that has locking implications for incore
unlinked. Hence we really need background thread processing to enable this to
work (i.e. being able to requeue inodes for later processing). To ensure that
to overhead of this work is not a limiting factor, we will probably need
multiple workqueue processing threads for this.

Moving the logging to the inode core enables two things - it allows us to keep
an in-memory copy of the unlinked list off the perag and that allows us to remove
xfs_inotobp(). The in-memory unlinked list means we don't have to read and
traverse the buffers every time we need to find the previous buffer to remove an
inode from the list, but it does mean we have to take the inode lock. If the
previous inode is locked, then we can't remove the inode from the unlinked list
so we must requeue it for this to occur at a later time.

Combined with the changes to inode create, we effectively will only use the
inode buffer in the transaction subsystem for marking the region stale when
freeing an inode chunk from disk (i.e. the default noikeep configuration). If
we are using large inode allocation, we don't want to be freeing random inode
chunks - this will just leave us with fragmented inode regions and undo all the
good work that was done originally.

To avoid this, we should not be freeing inode chunks as soon as they no longer
have any empty inodes in them. We should periodically scan the AGI btree
looking for contiguous chunks that have no inodes allocated in them, and then
freeing the large contiguous regions we find in one go. It is likely this can
be done in a single transaction; it's one extent to be freed, along with a
contiguous set of records to be removed from the AGI btree so should not
require logging much at all. Also, the background scanning could be triggered
by a number of different events - low space in an AG, a large number of free
inodes in an AG, etc - as it doesn't need to be done frequently. As a result
of the lack of frequency that this needs to be done, it can probably be
handled by a single thread or delayed workqueue.

Further optimisations are possible here - if we rule that the AGI btree is the
sole place that inodes are marked free or in-use (with the exception of
unlinked inodes attached to the AGI lists), then we can avoid the need to
write back unlinked inodes or read newly created inodes from disk. This would
require all inodes to effectively use a random generation number assigned at
create time as we would not be reading it from disk - writing/reading the current
generation number appears to be the only real reason for doing this I/O. This
would require extra checks to determine if an inode is unlinked - we
need to do an imap lookup rather than reading it and then checking it is
valid if it is not already in memory. Avoiding the I/O, however, will greatly speed
up create and remove workloads. Note: the impact of this on the bulkstat algorithm
has not been determined yet.

One of the issues we need to consider with this background inactivation is that
we will be able to defer a large quantity of inactivation transactions so we are
going to need to be careful about how much we allow to be queued. Simple queue
depth throttling should be all that is needed to keep this under control.

== Reclaim Optimizations ==

Tracking inodes for reclaim in radix tree: Done
Using RCU for radix tree reclaim walks: Done
Non-blocking background reclaim: Done
Parallelised shrinker based reclaim: Done

Now that we have efficient unlink, we've got to handle the reclaim of all the
inodes that are now dead or simply not referenced. For inodes that are dirty,
we need to write them out to clean them. For inodes that are clean and not
unlinked, we need to compress them down for more compact storage. This involves
some CPU overhead, but it is worth noting that reclaiming of clean inodes
typically only occurs when we are under memory pressure.

By compressing the XFS inode in this case, we are effectively reducing the
memory usage of the inode rather than freeing it directly. If we then get
another operation on that inode (e.g. the working set is slightly larger than
can be held in linux+XFS inode pairs, we avoid having to read the inode off
disk again - it simply gets uncompressed out of the cache. In essence we use
the compressed inode cache as an exclusive second level cache - it has higher
density than the primary cache and higher load latency and CPU overhead,
but it still avoids I/O in exactly the same manner as the primary cache.

We cannot allow unrestricted build-up of reclaimable inodes - the memory they
consume will be large, so we should be aiming to compress reclaimable inodes as
soon as they are clean. This will prevent buildup of memory consuming
uncompressed inodes that are not likely to be referenced again immediately.

This clean inode reclaimation process can be accelerated by triggering reclaim
on inode I/O completion. If the inode is clean and reclaimable we should
trigger immediate reclaim processing of that inode. This will mean that
reclaim of newly cleaned inodes will not get held up behind reclaim of dirty
inodes.

For inodes that are unlinked, we can simply free them in reclaim as theƦ
are no longer in use. We don't want to poison the compressed cache with
unlinked inodes, nor do we need to because we can allocate new inodes
without incurring I/O.

Still, we may end up with lots of inodes queued for reclaim. We may need
to implement a throttle mechanism to slow down the rate at which inodes
are queued for reclaimation in the situation where the reclaim process
is not able to keep up. It should be noted that if we parallelise inode
writeback we should also be able to parallelise inode reclaim via
the same mechanism, so the need for throttling may relatively low
if we can have multiple inodes under reclaim at once.

It should be noted that complexity is exposed by interactions with concurrent
lookups, especially if we move to RCU locking on the radix tree. Firstly, we
need to be able to do an atomic swap of the compressed inode for the
uncompressed inode in the radix tree (and vice versa), to be able to tell them
apart (magic #), and to have atomic reference counts to ensure we can avoid use
after free situations when lookups race with compression or freeing.

Secondly, with the complex unlink/reclaim interactions we will need to be
careful to detect inodes in the process of reclaim - the lookupp process
will need to do different things depending on the state of reclaim. Indeed,
we will need to be able to cancel reclaim of an unlinked inode if we try
to allocate it before it has been fully unlinked or reclaimed. The same
can be said for an inode in the process of being compressed - if we get
a lookup during the compression process, we want to return the existing
inode, not have to wait, re-allocate and uncompress it again. These
are all solvable issues - they just add complexity.

== Accelerated Reclaim of buftarg Page Cache for Inodes ==
----------------------------------------------------

Per-buftarg buffer LRU reclaim: Done

Per-buftarg shrinker: Done

Per-buffer type reclaim prioritisation: Done

For single use inodes or even read-only inodes, we read them in, use them, then
reclaim them. With the compressed cache, they'll get compressed and live a lot
longer in memory. However, we also will have the inode cluster buffer pages
sitting in memory for some length of time after the inode was read in. This can
consume a large amount of memory that will never be used again, and does not
get reclaimed until they are purged from the LRU by the VM. It would be
advantageous to accelerate the reclaim of these pages so that they do not build
up unneccessarily.

A better method would appear to be to leverage the delayed read queue
mechanism. This delayed read queue pins read buffers for a short period of
time, and then if they have not been referenced they get torn down. If, as
part of this delayed read buffer teardown procedure we all free the backing
pages completely, we acheive the exact same result as having our own LRUs to
manage the page cache. This seems much simpler and a much more holistic
approach to solving the problem than implementing page LRUs.

== Killing Bufferheads (a.k.a "Die, buggerheads, Die!") ==

[This is not strictly about inode caching, but doesn't fit into
other areas of development as closely as it does to inode caching
optimisations.]

XFS is extent based. The Linux page cache is block based. Hence for
every cached page in memory, we have to attach a structure for mapping
the blocks on that page back to to the on-disk location. In XFs, we also
use this to hold state for delayed allocation and unwritten extent blocks
so the generic code can do the right thing when necessary. We also
use it to avoid extent lookups at various times within the XFS I/O
path.

However, this has a massive cost. While XFS might represent the
disk mapping of a 1GB extent in 24 bytes of memory, the page cache
requires 262,144 bufferheads (assuming 4k block size) to represent the
same mapping. That's roughly 14MB of memory neededtoo represent that.

Chris Mason wrote an extent map representation for page cache state
and mappings for BTRFS; that code is mostly generic and could be
adapted to XFS. This would allow us to hold all the page cache state
in extent format and greatly reduce the memory overhead that it currently
has. The tradeoff is increased CPU overhead due to tree lookups where
structure lookups currently are used. Still, this has much lower
overhead than xfs_bmapi() based lookups, so the penalty is going to
be lower than if we did these lookups right now.

If we make this change, we would then have three levels of extent
caching:

- the BMBT buffers
- the XFS incore inode extent tree (iext*)
- the page cache extent map tree

Effectively, the XFS incore inode extent tree becomes redundant - all
the extent state it holds can be moved to the generic page cache tree
and we can do all our incore operations there. Our logging of changes
is based on the BMBT buffers, so getting rid of the iext layer would
not impact the transaction subsystem at all.

Such integration with the generic code will also allow development
of generic writeback routines for delayed allocation, unwritten
extents, etc that are not specific to a given filesystem.

== Demand Paging of Large Inode Extent Maps ==

Currently the inode extent map is pinned in memory until the inode is
reclaimed. Hence an inode with millions of extents will pin a large
amount of memory and this can cause serious issues in low memory
situations. Ideally we would like to be able to page the extent
map in and out once they get to a certain size to avoid this
problem. This feature requires more investigation before an overall
approach can be detailed here.

It should be noted that if we move to an extent-based page cache mapping
tree, the associated extent state tree can be used to track sparse
regions. That is, regions of the extent map that are not in memory
can be easily represented and acceesses to an unread region can then
be used to trigger demand loading.

== Food For Thought (Crazy Ideas) ==

If we are not using inode buffers for logging changes to inodes, we should
consider whether we need them at all. What benefit do the buffers bring us when
all we will use them for is read or write I/O? Would it be better to go
straight to the buftarg page cache and do page based I/O via submit_bio()?

Improving inode Caching

2010-12-23T03:50:37Z

Dgc: /* Accelerated Reclaim of buftarg Page Cache for Inodes */

Future Directions for XFS

== Improving Inode Caching and Operation in XFS ==
--------------------------------------------

Thousand foot view:

We want to drive inode lookup in a manner that is as parallel, scalable and low
overhead as possible. This means efficient indexing, lowering memory
consumption, simplifying the caching heirachy, removing duplication and
reducing/removing lock traffic.

In addition, we want to provide a good foundation for simplifying inode I/O,
improving writeback clustering, preventing RMW of inode buffers under memory
pressure, reducing creation and deletion overhead and removing writeback of
unlogged changes completely.

There are a variety of features in disconnected trees and patch sets that need
to be combined to acheive this - the basic structure needed to implement this is
already in mainline and that is the radix tree inode indexing. Further
improvements are going to be based around this structure and using it
effectively to avoid needing other indexing mechanisms.

Discussion:

== Combining XFS and VFS inodes ==
----------------------------

Status: Done (October 2008)

== Compressed Inode Cache ==
-----------------------------

The XFs inode cache uses a lot of memory. We can avoid this problem by making
use of the compressed inode cache - only the active inodes are held in a
non-compressed form, hence most inodes will end up being cached in compressed
form rather than in the XFS/linux inode form. The compressed form can reduce
the cached inode footprint to 200-300 bytes per inode instead of 1-1.1k that
they currently take on a 64bit system. Hence by moving to a compressed cache we
can greatly increase the number of inodes cached in a given amount of memory
which more that offsets any comparitive increase we will see from inodes in
reclaim. the compressed cache should really have a LRU and a shrinker as well
so that memory pressure will slowly trim it as memory demands occur. [Note:
this compressed cache is discussed further later on in the reclaim context.]

== Fixed Inode Cache Size ==

It is worth noting that for embedded systems and appliances it may be worth while allowing
the size of the caches to be fixed. Also, to prevent memory fragmentation
problems, we could simply allocate that memory to the compressed cache slab.
In effect, this would become a 'static slab' in that it has a bound maximum
size and never frees and memory. When the cache is full, we reclaim an
object out of it for reuse - this could be done by triggering the shrinker
to reclaim from the LRU. This would prevent the compressed inode cache from
consuming excessive amounts of memory in tightly constrained evironments.
Such an extension to the slab caches does not look difficult to implement,
and would allow such customisation with minimal deviation from mainline code.

== Bypassing the Linux Inode Cache ==

Lookups: Done (October 2008)

Tracking dirty inodes: Done

Writeback of dirty inodes: Done

Writeback of dirty pages: still executed the by VFS

Now that we can track dirty inodes ourselves, we can pretty much isolate
writeback of both data and inodes from the generic pdflush code. If we add a
hook high up in the pdflush path that simply passes us a writeback control
structure with the current writeback guidelines, we can do writeback within
those guidelines in the most optimal fashion for XFS.

== Avoiding the Generic pdflush Code ==

Writeback of inodes via AIL: Done

For pdflush driven writeback, we only want to write back data; all other inode
writeback should be driven from the AIL (our time ordered dirty metadata list)
or xfssyncd in a manner that is most optimal for XFS.

Furthermore, if we implement our own pdflush method, we can parallelise it in
several ways. We can ensure that each filesystem has it's own flush thread or
thread pool, we can have a thread pool shared by all filesystems (like pdflush
currently operates), we can have a flush thread per inode radix tree, and so
one. The method of paralleisation is open for interpretation, but enabling
multiple flush threads to operate on a single filesystem is one of the necessary
requirements to avoid data writeback (and hence delayed allocation) being
limited to the throughput of a single CPU per filesystem.

== Improving Inode Writeback ==

To optimise inode writeback, we really need to reduce the impact of inode
buffer read-modify-write cycles. XFS is capable of caching far more inodes in
memory than it has buffer space available for, so RMW cycles during inode
writeback under memory pressure are quite common. Firstly, we want to avoid
blocking pdflush at all costs. Secondly, we want to issue as much localised
readahead as possible in ascending offset order to allow both elevator merging
of readahead and as little seeking as possible. Finally, we want to issue all
the write cycles as close together as possible to allow the same elevator and
I/O optimisations to take place.

To do this, firstly we need the non-blocking inode flush semantics to issue
readahead on buffers that are not up-to-date rather than reading them
synchronously. Inode writeback already has the interface to handle inodes that
weren't flushed - we return EAGAIN from xfs_iflush() and the higher inode
writeback layers handle this appropriately. It would be easy to add another
flag to pass down to the buffer layer to say 'issue but don't wait for any
read'. If we use a radix tree traversal to issue readahead in such a manner,
we'll get ascending offset readahead being issued.

One problem with this is that we can issue too much readahead and thrash the
cache. A possible solution to this is to make the readahead a 'delayed read'
and on I/o completion add it to a queue that holds a reference on the buffer.
If a followup read occurs soon after, we remove it from the queue and drop that
reference. This prevents the buffer from being reclaimed in betwen the
readahead completing and the real read being issued. We should also issue this
delayed read on buffers that are in the cache so that they don't get reclaimed
to make room for the readahead.

To prevent buildup of delayed read buffers, we can periodically purge them -
those that are older than a given age (say 5 seconds) can be removed from the
list and their reference dropped. This will free the buffer and allow it's
pages to be reclaimed.

Once we have done the readahead pass, we can then do a modify and writeback
pass over all the inodes, knowing that there will be no read cycles to delay
this step. Once again, a radix tree traversal gives us ascending order
writeback and hence the modified buffers we send to the device will be in
optimal order for merging and minimal seek overhead.

== Contiguous Inode Allocation ==

To make optimal use of the radix tree cache and enable wide-scale clustering of
inode writeback across multiple clusters, we really need to ensure that inode
allocation occurs in large contiguous chunks on disk. Right now we only
allocate chunks of 64 inodes at a time; ideally we want to allocate a stripe
unit (or multiple of) full of inodes at a time. This would allow inode
writeback clustering to do full stripe writes to the underlying RAID if there
are dirty inodes spanning the entire stripe unit.

The problem with doing this is that we don't want to introduce the latency of
creating megabytes of inodes when only one is needed for the current operation.
Hence we need to push the inode creation into a background thread and use that
to create contiguous inode chunks asynchronously. This moves the actual on-disk
allocation of inodes out of the normal create path; it should always be able to
find a free inode without doing on disk allocation. This will simplify the
create path by removing the allocate-on-disk-then-retry-the-create double
transaction that currently occurs.

As an aside, we could preallocate a small amount of inodes in each AG (10-20MB
of inodes per AG?) without impacting mkfs time too greatly. This would allow
the filesystem to be used immediately on the first mount without triggering
lots of background allocation. This could alsobe done after the first mount
occurs, but that could interfere with typical benchmarking situations. Another
good reason for this preallocation is that it will help reduce xfs_repair
runtime for most common filesystem usages.

One of the issues that the background create will cause is a substantial amount
of log traffic - every inode buffer initialised will be logged in whole. Hence
if we create a megabyte of inodes, we'll be causing a megabyte of log traffic
just for the inode buffers we've initialised. This is relatively simple to fix
- we don't log the buffer, we just log the fact that we need to initialise
inodes in a given range. In recovery, when we see this transaction, then we
build the buffers, initialise them and write them out. Hence, we don't need to
log the buffers used to initialise the inodes.

Also, we can use the background allocations to keep track of recently allocated
inode regions in the per-ag. Using that information to select the next inode to
be used rather than requiring btree searches on every create will greatly reduce
the CPU overhead of workloads that create lots of new inodes. It is not clear
whether a single background thread will be able to allocate enough inodes
to keep up with demand from the rest of the system - we may need multiple
threads for large configurations.

== Single Block Inode Allocation ==

One of the big problems we have withe filesystems that are approaching
full is that it can be hard to find a large enough extent to hold 64 inodes.
We've had ENOSPC errors on inode allocation reported on filesystems that
are only 85% full. This is a sign of free space fragmentation, and it
prevents inode allocation from succeeding. We could (and should) write
a free space defragmenter, but that does not solve the problem - it's
reactive, not preventative.

The main problem we have is that XFS uses inode chunk size and alignment
to optimise inode number to disk location conversion. That is, the conversion
becomes a single set of shifts and masks instead of an AGI btree lookup.
This optimisation substantially reduces the CPU and I/O overhead of
inode lookups, but it does limit our flexibility. If we break the
alignment restriction, every lookup has to go back to a btree search.
Hence we really want to avoid breaking chunk alignment and size
rules.

An approach to avoiding violation of this rule is to be able to determine which
index to look up when parsing the inode number. For example, we could use the
high bit of the inode number to indicate that it is located in a non-aligned
inode chunk and hence needs to be looked up in the btree. This would avoid
the lookup penalty for correctly aligned inode chunks.

If we then redefine the meaning of the contents of the AGI btree record for
such inode chunks, we do not need a new index to keep these in. Effectively,
we need to add a bitmask to the record to indicate which blocks inside
the chunk can actually contain inodes. We still use aligned/sized records,
but mask out the sections that we are not allowed to allocate inodes in.
Effectively, this would allow sparse inode chunks. There may be limitations
on the resolution of sparseness depending on inode size and block size,
but for the common cases of 4k block size and 256 or 512 byte inodes I
think we can run a fully sparse mapping for each inode chunk.

This would allow us to allocate inode extents of any alignment and size
that fits *inside* the existing alignment/size limitations. That is,
a single extent allocation could not span two btree records, but can
lie anywhere inside a single record. It also means that we can do
multiple extent allocations within one btree record to make optimal
use of the fragmented free space.

It should be noted that this will probably have impact on some of the
inode cluster buffer mapping and clustering algorithms. It is not clear
exactly what impact yet, but certainly write clustering will be affected.
Fortunately we'll be able to detect the inodes that will have this problem
by the high bit in the inode number.

== Inode Unlink ==

If we turn to look at unlink and reclaim interactions, there are a few
optimisations that can be made. Firstly, we don't need to do inode inactivation
in reclaim threads - these transactions can easily be pushed to a background
thread. This means that xfs_inactive would be little more than a vmtruncate()
call and queuing to a workqueue. This will substantially speed up the processing
of prune_icache() - we'll get inodes moved into reclaim much faster than we do
right now.

This will have a noticable effect, though. When inodes are unlinked the space
consumed by those inodes may not be immediately freed - it will be returned as
the inodes are processed through the reclaim threads. This means that userspace
monitoring tools such as 'df' may not immediately reflect the result of a
completed unlink operation. This will be a user visible change in behaviour,
though in most cases should not affect anyone and for those that it does affect
a 'sync' should be sufficient to wait for the space to be returned.

Now that inodes to be unlinked are out of general circulation, we can make the
unlinked path more complex. It is desirable to move the unlinked list from the
inode buffer to the inode core, but that has locking implications for incore
unlinked. Hence we really need background thread processing to enable this to
work (i.e. being able to requeue inodes for later processing). To ensure that
to overhead of this work is not a limiting factor, we will probably need
multiple workqueue processing threads for this.

Moving the logging to the inode core enables two things - it allows us to keep
an in-memory copy of the unlinked list off the perag and that allows us to remove
xfs_inotobp(). The in-memory unlinked list means we don't have to read and
traverse the buffers every time we need to find the previous buffer to remove an
inode from the list, but it does mean we have to take the inode lock. If the
previous inode is locked, then we can't remove the inode from the unlinked list
so we must requeue it for this to occur at a later time.

Combined with the changes to inode create, we effectively will only use the
inode buffer in the transaction subsystem for marking the region stale when
freeing an inode chunk from disk (i.e. the default noikeep configuration). If
we are using large inode allocation, we don't want to be freeing random inode
chunks - this will just leave us with fragmented inode regions and undo all the
good work that was done originally.

To avoid this, we should not be freeing inode chunks as soon as they no longer
have any empty inodes in them. We should periodically scan the AGI btree
looking for contiguous chunks that have no inodes allocated in them, and then
freeing the large contiguous regions we find in one go. It is likely this can
be done in a single transaction; it's one extent to be freed, along with a
contiguous set of records to be removed from the AGI btree so should not
require logging much at all. Also, the background scanning could be triggered
by a number of different events - low space in an AG, a large number of free
inodes in an AG, etc - as it doesn't need to be done frequently. As a result
of the lack of frequency that this needs to be done, it can probably be
handled by a single thread or delayed workqueue.

Further optimisations are possible here - if we rule that the AGI btree is the
sole place that inodes are marked free or in-use (with the exception of
unlinked inodes attached to the AGI lists), then we can avoid the need to
write back unlinked inodes or read newly created inodes from disk. This would
require all inodes to effectively use a random generation number assigned at
create time as we would not be reading it from disk - writing/reading the current
generation number appears to be the only real reason for doing this I/O. This
would require extra checks to determine if an inode is unlinked - we
need to do an imap lookup rather than reading it and then checking it is
valid if it is not already in memory. Avoiding the I/O, however, will greatly speed
up create and remove workloads. Note: the impact of this on the bulkstat algorithm
has not been determined yet.

One of the issues we need to consider with this background inactivation is that
we will be able to defer a large quantity of inactivation transactions so we are
going to need to be careful about how much we allow to be queued. Simple queue
depth throttling should be all that is needed to keep this under control.

== Reclaim Optimizations ==

Tracking inodes for reclaim in radix tree: Done
Using RCU for radix tree reclaim walks: Done
Non-blocking background reclaim: Done
Parallelised shrinker based reclaim: Done

Now that we have efficient unlink, we've got to handle the reclaim of all the
inodes that are now dead or simply not referenced. For inodes that are dirty,
we need to write them out to clean them. For inodes that are clean and not
unlinked, we need to compress them down for more compact storage. This involves
some CPU overhead, but it is worth noting that reclaiming of clean inodes
typically only occurs when we are under memory pressure.

By compressing the XFS inode in this case, we are effectively reducing the
memory usage of the inode rather than freeing it directly. If we then get
another operation on that inode (e.g. the working set is slightly larger than
can be held in linux+XFS inode pairs, we avoid having to read the inode off
disk again - it simply gets uncompressed out of the cache. In essence we use
the compressed inode cache as an exclusive second level cache - it has higher
density than the primary cache and higher load latency and CPU overhead,
but it still avoids I/O in exactly the same manner as the primary cache.

We cannot allow unrestricted build-up of reclaimable inodes - the memory they
consume will be large, so we should be aiming to compress reclaimable inodes as
soon as they are clean. This will prevent buildup of memory consuming
uncompressed inodes that are not likely to be referenced again immediately.

This clean inode reclaimation process can be accelerated by triggering reclaim
on inode I/O completion. If the inode is clean and reclaimable we should
trigger immediate reclaim processing of that inode. This will mean that
reclaim of newly cleaned inodes will not get held up behind reclaim of dirty
inodes.

For inodes that are unlinked, we can simply free them in reclaim as theƦ
are no longer in use. We don't want to poison the compressed cache with
unlinked inodes, nor do we need to because we can allocate new inodes
without incurring I/O.

Still, we may end up with lots of inodes queued for reclaim. We may need
to implement a throttle mechanism to slow down the rate at which inodes
are queued for reclaimation in the situation where the reclaim process
is not able to keep up. It should be noted that if we parallelise inode
writeback we should also be able to parallelise inode reclaim via
the same mechanism, so the need for throttling may relatively low
if we can have multiple inodes under reclaim at once.

It should be noted that complexity is exposed by interactions with concurrent
lookups, especially if we move to RCU locking on the radix tree. Firstly, we
need to be able to do an atomic swap of the compressed inode for the
uncompressed inode in the radix tree (and vice versa), to be able to tell them
apart (magic #), and to have atomic reference counts to ensure we can avoid use
after free situations when lookups race with compression or freeing.

Secondly, with the complex unlink/reclaim interactions we will need to be
careful to detect inodes in the process of reclaim - the lookupp process
will need to do different things depending on the state of reclaim. Indeed,
we will need to be able to cancel reclaim of an unlinked inode if we try
to allocate it before it has been fully unlinked or reclaimed. The same
can be said for an inode in the process of being compressed - if we get
a lookup during the compression process, we want to return the existing
inode, not have to wait, re-allocate and uncompress it again. These
are all solvable issues - they just add complexity.

== Accelerated Reclaim of buftarg Page Cache for Inodes ==
----------------------------------------------------

Per-buftarg buffer LRU reclaim: Done
Per-buftarg shrinker: Done
Per-buffer type reclaim prioritisation: Done

For single use inodes or even read-only inodes, we read them in, use them, then
reclaim them. With the compressed cache, they'll get compressed and live a lot
longer in memory. However, we also will have the inode cluster buffer pages
sitting in memory for some length of time after the inode was read in. This can
consume a large amount of memory that will never be used again, and does not
get reclaimed until they are purged from the LRU by the VM. It would be
advantageous to accelerate the reclaim of these pages so that they do not build
up unneccessarily.

A better method would appear to be to leverage the delayed read queue
mechanism. This delayed read queue pins read buffers for a short period of
time, and then if they have not been referenced they get torn down. If, as
part of this delayed read buffer teardown procedure we all free the backing
pages completely, we acheive the exact same result as having our own LRUs to
manage the page cache. This seems much simpler and a much more holistic
approach to solving the problem than implementing page LRUs.

== Killing Bufferheads (a.k.a "Die, buggerheads, Die!") ==

[This is not strictly about inode caching, but doesn't fit into
other areas of development as closely as it does to inode caching
optimisations.]

XFS is extent based. The Linux page cache is block based. Hence for
every cached page in memory, we have to attach a structure for mapping
the blocks on that page back to to the on-disk location. In XFs, we also
use this to hold state for delayed allocation and unwritten extent blocks
so the generic code can do the right thing when necessary. We also
use it to avoid extent lookups at various times within the XFS I/O
path.

However, this has a massive cost. While XFS might represent the
disk mapping of a 1GB extent in 24 bytes of memory, the page cache
requires 262,144 bufferheads (assuming 4k block size) to represent the
same mapping. That's roughly 14MB of memory neededtoo represent that.

Chris Mason wrote an extent map representation for page cache state
and mappings for BTRFS; that code is mostly generic and could be
adapted to XFS. This would allow us to hold all the page cache state
in extent format and greatly reduce the memory overhead that it currently
has. The tradeoff is increased CPU overhead due to tree lookups where
structure lookups currently are used. Still, this has much lower
overhead than xfs_bmapi() based lookups, so the penalty is going to
be lower than if we did these lookups right now.

If we make this change, we would then have three levels of extent
caching:

- the BMBT buffers
- the XFS incore inode extent tree (iext*)
- the page cache extent map tree

Effectively, the XFS incore inode extent tree becomes redundant - all
the extent state it holds can be moved to the generic page cache tree
and we can do all our incore operations there. Our logging of changes
is based on the BMBT buffers, so getting rid of the iext layer would
not impact the transaction subsystem at all.

Such integration with the generic code will also allow development
of generic writeback routines for delayed allocation, unwritten
extents, etc that are not specific to a given filesystem.

== Demand Paging of Large Inode Extent Maps ==

Currently the inode extent map is pinned in memory until the inode is
reclaimed. Hence an inode with millions of extents will pin a large
amount of memory and this can cause serious issues in low memory
situations. Ideally we would like to be able to page the extent
map in and out once they get to a certain size to avoid this
problem. This feature requires more investigation before an overall
approach can be detailed here.

It should be noted that if we move to an extent-based page cache mapping
tree, the associated extent state tree can be used to track sparse
regions. That is, regions of the extent map that are not in memory
can be easily represented and acceesses to an unread region can then
be used to trigger demand loading.

== Food For Thought (Crazy Ideas) ==

If we are not using inode buffers for logging changes to inodes, we should
consider whether we need them at all. What benefit do the buffers bring us when
all we will use them for is read or write I/O? Would it be better to go
straight to the buftarg page cache and do page based I/O via submit_bio()?

Improving inode Caching

2010-12-23T03:48:26Z

Dgc: /* Reclaim Optimizations */

Future Directions for XFS

== Improving Inode Caching and Operation in XFS ==
--------------------------------------------

Thousand foot view:

We want to drive inode lookup in a manner that is as parallel, scalable and low
overhead as possible. This means efficient indexing, lowering memory
consumption, simplifying the caching heirachy, removing duplication and
reducing/removing lock traffic.

In addition, we want to provide a good foundation for simplifying inode I/O,
improving writeback clustering, preventing RMW of inode buffers under memory
pressure, reducing creation and deletion overhead and removing writeback of
unlogged changes completely.

There are a variety of features in disconnected trees and patch sets that need
to be combined to acheive this - the basic structure needed to implement this is
already in mainline and that is the radix tree inode indexing. Further
improvements are going to be based around this structure and using it
effectively to avoid needing other indexing mechanisms.

Discussion:

== Combining XFS and VFS inodes ==
----------------------------

Status: Done (October 2008)

== Compressed Inode Cache ==
-----------------------------

The XFs inode cache uses a lot of memory. We can avoid this problem by making
use of the compressed inode cache - only the active inodes are held in a
non-compressed form, hence most inodes will end up being cached in compressed
form rather than in the XFS/linux inode form. The compressed form can reduce
the cached inode footprint to 200-300 bytes per inode instead of 1-1.1k that
they currently take on a 64bit system. Hence by moving to a compressed cache we
can greatly increase the number of inodes cached in a given amount of memory
which more that offsets any comparitive increase we will see from inodes in
reclaim. the compressed cache should really have a LRU and a shrinker as well
so that memory pressure will slowly trim it as memory demands occur. [Note:
this compressed cache is discussed further later on in the reclaim context.]

== Fixed Inode Cache Size ==

It is worth noting that for embedded systems and appliances it may be worth while allowing
the size of the caches to be fixed. Also, to prevent memory fragmentation
problems, we could simply allocate that memory to the compressed cache slab.
In effect, this would become a 'static slab' in that it has a bound maximum
size and never frees and memory. When the cache is full, we reclaim an
object out of it for reuse - this could be done by triggering the shrinker
to reclaim from the LRU. This would prevent the compressed inode cache from
consuming excessive amounts of memory in tightly constrained evironments.
Such an extension to the slab caches does not look difficult to implement,
and would allow such customisation with minimal deviation from mainline code.

== Bypassing the Linux Inode Cache ==

Lookups: Done (October 2008)

Tracking dirty inodes: Done

Writeback of dirty inodes: Done

Writeback of dirty pages: still executed the by VFS

Now that we can track dirty inodes ourselves, we can pretty much isolate
writeback of both data and inodes from the generic pdflush code. If we add a
hook high up in the pdflush path that simply passes us a writeback control
structure with the current writeback guidelines, we can do writeback within
those guidelines in the most optimal fashion for XFS.

== Avoiding the Generic pdflush Code ==

Writeback of inodes via AIL: Done

For pdflush driven writeback, we only want to write back data; all other inode
writeback should be driven from the AIL (our time ordered dirty metadata list)
or xfssyncd in a manner that is most optimal for XFS.

Furthermore, if we implement our own pdflush method, we can parallelise it in
several ways. We can ensure that each filesystem has it's own flush thread or
thread pool, we can have a thread pool shared by all filesystems (like pdflush
currently operates), we can have a flush thread per inode radix tree, and so
one. The method of paralleisation is open for interpretation, but enabling
multiple flush threads to operate on a single filesystem is one of the necessary
requirements to avoid data writeback (and hence delayed allocation) being
limited to the throughput of a single CPU per filesystem.

== Improving Inode Writeback ==

To optimise inode writeback, we really need to reduce the impact of inode
buffer read-modify-write cycles. XFS is capable of caching far more inodes in
memory than it has buffer space available for, so RMW cycles during inode
writeback under memory pressure are quite common. Firstly, we want to avoid
blocking pdflush at all costs. Secondly, we want to issue as much localised
readahead as possible in ascending offset order to allow both elevator merging
of readahead and as little seeking as possible. Finally, we want to issue all
the write cycles as close together as possible to allow the same elevator and
I/O optimisations to take place.

To do this, firstly we need the non-blocking inode flush semantics to issue
readahead on buffers that are not up-to-date rather than reading them
synchronously. Inode writeback already has the interface to handle inodes that
weren't flushed - we return EAGAIN from xfs_iflush() and the higher inode
writeback layers handle this appropriately. It would be easy to add another
flag to pass down to the buffer layer to say 'issue but don't wait for any
read'. If we use a radix tree traversal to issue readahead in such a manner,
we'll get ascending offset readahead being issued.

One problem with this is that we can issue too much readahead and thrash the
cache. A possible solution to this is to make the readahead a 'delayed read'
and on I/o completion add it to a queue that holds a reference on the buffer.
If a followup read occurs soon after, we remove it from the queue and drop that
reference. This prevents the buffer from being reclaimed in betwen the
readahead completing and the real read being issued. We should also issue this
delayed read on buffers that are in the cache so that they don't get reclaimed
to make room for the readahead.

To prevent buildup of delayed read buffers, we can periodically purge them -
those that are older than a given age (say 5 seconds) can be removed from the
list and their reference dropped. This will free the buffer and allow it's
pages to be reclaimed.

Once we have done the readahead pass, we can then do a modify and writeback
pass over all the inodes, knowing that there will be no read cycles to delay
this step. Once again, a radix tree traversal gives us ascending order
writeback and hence the modified buffers we send to the device will be in
optimal order for merging and minimal seek overhead.

== Contiguous Inode Allocation ==

To make optimal use of the radix tree cache and enable wide-scale clustering of
inode writeback across multiple clusters, we really need to ensure that inode
allocation occurs in large contiguous chunks on disk. Right now we only
allocate chunks of 64 inodes at a time; ideally we want to allocate a stripe
unit (or multiple of) full of inodes at a time. This would allow inode
writeback clustering to do full stripe writes to the underlying RAID if there
are dirty inodes spanning the entire stripe unit.

The problem with doing this is that we don't want to introduce the latency of
creating megabytes of inodes when only one is needed for the current operation.
Hence we need to push the inode creation into a background thread and use that
to create contiguous inode chunks asynchronously. This moves the actual on-disk
allocation of inodes out of the normal create path; it should always be able to
find a free inode without doing on disk allocation. This will simplify the
create path by removing the allocate-on-disk-then-retry-the-create double
transaction that currently occurs.

As an aside, we could preallocate a small amount of inodes in each AG (10-20MB
of inodes per AG?) without impacting mkfs time too greatly. This would allow
the filesystem to be used immediately on the first mount without triggering
lots of background allocation. This could alsobe done after the first mount
occurs, but that could interfere with typical benchmarking situations. Another
good reason for this preallocation is that it will help reduce xfs_repair
runtime for most common filesystem usages.

One of the issues that the background create will cause is a substantial amount
of log traffic - every inode buffer initialised will be logged in whole. Hence
if we create a megabyte of inodes, we'll be causing a megabyte of log traffic
just for the inode buffers we've initialised. This is relatively simple to fix
- we don't log the buffer, we just log the fact that we need to initialise
inodes in a given range. In recovery, when we see this transaction, then we
build the buffers, initialise them and write them out. Hence, we don't need to
log the buffers used to initialise the inodes.

Also, we can use the background allocations to keep track of recently allocated
inode regions in the per-ag. Using that information to select the next inode to
be used rather than requiring btree searches on every create will greatly reduce
the CPU overhead of workloads that create lots of new inodes. It is not clear
whether a single background thread will be able to allocate enough inodes
to keep up with demand from the rest of the system - we may need multiple
threads for large configurations.

== Single Block Inode Allocation ==

One of the big problems we have withe filesystems that are approaching
full is that it can be hard to find a large enough extent to hold 64 inodes.
We've had ENOSPC errors on inode allocation reported on filesystems that
are only 85% full. This is a sign of free space fragmentation, and it
prevents inode allocation from succeeding. We could (and should) write
a free space defragmenter, but that does not solve the problem - it's
reactive, not preventative.

The main problem we have is that XFS uses inode chunk size and alignment
to optimise inode number to disk location conversion. That is, the conversion
becomes a single set of shifts and masks instead of an AGI btree lookup.
This optimisation substantially reduces the CPU and I/O overhead of
inode lookups, but it does limit our flexibility. If we break the
alignment restriction, every lookup has to go back to a btree search.
Hence we really want to avoid breaking chunk alignment and size
rules.

An approach to avoiding violation of this rule is to be able to determine which
index to look up when parsing the inode number. For example, we could use the
high bit of the inode number to indicate that it is located in a non-aligned
inode chunk and hence needs to be looked up in the btree. This would avoid
the lookup penalty for correctly aligned inode chunks.

If we then redefine the meaning of the contents of the AGI btree record for
such inode chunks, we do not need a new index to keep these in. Effectively,
we need to add a bitmask to the record to indicate which blocks inside
the chunk can actually contain inodes. We still use aligned/sized records,
but mask out the sections that we are not allowed to allocate inodes in.
Effectively, this would allow sparse inode chunks. There may be limitations
on the resolution of sparseness depending on inode size and block size,
but for the common cases of 4k block size and 256 or 512 byte inodes I
think we can run a fully sparse mapping for each inode chunk.

This would allow us to allocate inode extents of any alignment and size
that fits *inside* the existing alignment/size limitations. That is,
a single extent allocation could not span two btree records, but can
lie anywhere inside a single record. It also means that we can do
multiple extent allocations within one btree record to make optimal
use of the fragmented free space.

It should be noted that this will probably have impact on some of the
inode cluster buffer mapping and clustering algorithms. It is not clear
exactly what impact yet, but certainly write clustering will be affected.
Fortunately we'll be able to detect the inodes that will have this problem
by the high bit in the inode number.

== Inode Unlink ==

If we turn to look at unlink and reclaim interactions, there are a few
optimisations that can be made. Firstly, we don't need to do inode inactivation
in reclaim threads - these transactions can easily be pushed to a background
thread. This means that xfs_inactive would be little more than a vmtruncate()
call and queuing to a workqueue. This will substantially speed up the processing
of prune_icache() - we'll get inodes moved into reclaim much faster than we do
right now.

This will have a noticable effect, though. When inodes are unlinked the space
consumed by those inodes may not be immediately freed - it will be returned as
the inodes are processed through the reclaim threads. This means that userspace
monitoring tools such as 'df' may not immediately reflect the result of a
completed unlink operation. This will be a user visible change in behaviour,
though in most cases should not affect anyone and for those that it does affect
a 'sync' should be sufficient to wait for the space to be returned.

Now that inodes to be unlinked are out of general circulation, we can make the
unlinked path more complex. It is desirable to move the unlinked list from the
inode buffer to the inode core, but that has locking implications for incore
unlinked. Hence we really need background thread processing to enable this to
work (i.e. being able to requeue inodes for later processing). To ensure that
to overhead of this work is not a limiting factor, we will probably need
multiple workqueue processing threads for this.

Moving the logging to the inode core enables two things - it allows us to keep
an in-memory copy of the unlinked list off the perag and that allows us to remove
xfs_inotobp(). The in-memory unlinked list means we don't have to read and
traverse the buffers every time we need to find the previous buffer to remove an
inode from the list, but it does mean we have to take the inode lock. If the
previous inode is locked, then we can't remove the inode from the unlinked list
so we must requeue it for this to occur at a later time.

Combined with the changes to inode create, we effectively will only use the
inode buffer in the transaction subsystem for marking the region stale when
freeing an inode chunk from disk (i.e. the default noikeep configuration). If
we are using large inode allocation, we don't want to be freeing random inode
chunks - this will just leave us with fragmented inode regions and undo all the
good work that was done originally.

To avoid this, we should not be freeing inode chunks as soon as they no longer
have any empty inodes in them. We should periodically scan the AGI btree
looking for contiguous chunks that have no inodes allocated in them, and then
freeing the large contiguous regions we find in one go. It is likely this can
be done in a single transaction; it's one extent to be freed, along with a
contiguous set of records to be removed from the AGI btree so should not
require logging much at all. Also, the background scanning could be triggered
by a number of different events - low space in an AG, a large number of free
inodes in an AG, etc - as it doesn't need to be done frequently. As a result
of the lack of frequency that this needs to be done, it can probably be
handled by a single thread or delayed workqueue.

Further optimisations are possible here - if we rule that the AGI btree is the
sole place that inodes are marked free or in-use (with the exception of
unlinked inodes attached to the AGI lists), then we can avoid the need to
write back unlinked inodes or read newly created inodes from disk. This would
require all inodes to effectively use a random generation number assigned at
create time as we would not be reading it from disk - writing/reading the current
generation number appears to be the only real reason for doing this I/O. This
would require extra checks to determine if an inode is unlinked - we
need to do an imap lookup rather than reading it and then checking it is
valid if it is not already in memory. Avoiding the I/O, however, will greatly speed
up create and remove workloads. Note: the impact of this on the bulkstat algorithm
has not been determined yet.

One of the issues we need to consider with this background inactivation is that
we will be able to defer a large quantity of inactivation transactions so we are
going to need to be careful about how much we allow to be queued. Simple queue
depth throttling should be all that is needed to keep this under control.

== Reclaim Optimizations ==

Tracking inodes for reclaim in radix tree: Done
Using RCU for radix tree reclaim walks: Done
Non-blocking background reclaim: Done
Parallelised shrinker based reclaim: Done

Now that we have efficient unlink, we've got to handle the reclaim of all the
inodes that are now dead or simply not referenced. For inodes that are dirty,
we need to write them out to clean them. For inodes that are clean and not
unlinked, we need to compress them down for more compact storage. This involves
some CPU overhead, but it is worth noting that reclaiming of clean inodes
typically only occurs when we are under memory pressure.

By compressing the XFS inode in this case, we are effectively reducing the
memory usage of the inode rather than freeing it directly. If we then get
another operation on that inode (e.g. the working set is slightly larger than
can be held in linux+XFS inode pairs, we avoid having to read the inode off
disk again - it simply gets uncompressed out of the cache. In essence we use
the compressed inode cache as an exclusive second level cache - it has higher
density than the primary cache and higher load latency and CPU overhead,
but it still avoids I/O in exactly the same manner as the primary cache.

We cannot allow unrestricted build-up of reclaimable inodes - the memory they
consume will be large, so we should be aiming to compress reclaimable inodes as
soon as they are clean. This will prevent buildup of memory consuming
uncompressed inodes that are not likely to be referenced again immediately.

This clean inode reclaimation process can be accelerated by triggering reclaim
on inode I/O completion. If the inode is clean and reclaimable we should
trigger immediate reclaim processing of that inode. This will mean that
reclaim of newly cleaned inodes will not get held up behind reclaim of dirty
inodes.

For inodes that are unlinked, we can simply free them in reclaim as theƦ
are no longer in use. We don't want to poison the compressed cache with
unlinked inodes, nor do we need to because we can allocate new inodes
without incurring I/O.

Still, we may end up with lots of inodes queued for reclaim. We may need
to implement a throttle mechanism to slow down the rate at which inodes
are queued for reclaimation in the situation where the reclaim process
is not able to keep up. It should be noted that if we parallelise inode
writeback we should also be able to parallelise inode reclaim via
the same mechanism, so the need for throttling may relatively low
if we can have multiple inodes under reclaim at once.

It should be noted that complexity is exposed by interactions with concurrent
lookups, especially if we move to RCU locking on the radix tree. Firstly, we
need to be able to do an atomic swap of the compressed inode for the
uncompressed inode in the radix tree (and vice versa), to be able to tell them
apart (magic #), and to have atomic reference counts to ensure we can avoid use
after free situations when lookups race with compression or freeing.

Secondly, with the complex unlink/reclaim interactions we will need to be
careful to detect inodes in the process of reclaim - the lookupp process
will need to do different things depending on the state of reclaim. Indeed,
we will need to be able to cancel reclaim of an unlinked inode if we try
to allocate it before it has been fully unlinked or reclaimed. The same
can be said for an inode in the process of being compressed - if we get
a lookup during the compression process, we want to return the existing
inode, not have to wait, re-allocate and uncompress it again. These
are all solvable issues - they just add complexity.

== Accelerated Reclaim of buftarg Page Cache for Inodes ==
----------------------------------------------------

For single use inodes or even read-only inodes, we read them in, use them, then
reclaim them. With the compressed cache, they'll get compressed and live a lot
longer in memory. However, we also will have the inode cluster buffer pages
sitting in memory for some length of time after the inode was read in. This can
consume a large amount of memory that will never be used again, and does not
get reclaimed until they are purged from the LRU by the VM. It would be
advantageous to accelerate the reclaim of these pages so that they do not build
up unneccessarily.

One method we could use for this would be to introduce our own page LRUs into
the buftarg cache that we can reclaim from. This would allow us to sort pages
according to their contents into different LRUs and periodically reclaim pages
of specific types that were not referenced. This, however, would introduce a
fair amount of complexity into the buffer cache that doesn't currently exist.
Also, from a higher perspective, it makes the buffer cache a complex
part-buffer cache, part VM frankenstein.

A better method would appear to be to leverage the delayed read queue
mechanism. This delayed read queue pins read buffers for a short period of
time, and then if they have not been referenced they get torn down. If, as
part of this delayed read buffer teardown procedure we all free the backing
pages completely, we acheive the exact same result as having our own LRUs to
manage the page cache. This seems much simpler and a much more holistic
approach to solving the problem than implementing page LRUs.

As an aside, we already have the mechanism in place to vary buffer aging based
on their type. The Irix buffer cache used this to great effect when under
memory pressure and the XFS code that configured it still exists in the Linux
code base. However, the Linux XFS buffer cache has never implemented any
mechanism to allow this functionality to be exploited. A delayed buffer reclaim
mechanism as described above could be greatly enhanced by making use of this
code in XFS.

== Killing Bufferheads (a.k.a "Die, buggerheads, Die!") ==

[This is not strictly about inode caching, but doesn't fit into
other areas of development as closely as it does to inode caching
optimisations.]

XFS is extent based. The Linux page cache is block based. Hence for
every cached page in memory, we have to attach a structure for mapping
the blocks on that page back to to the on-disk location. In XFs, we also
use this to hold state for delayed allocation and unwritten extent blocks
so the generic code can do the right thing when necessary. We also
use it to avoid extent lookups at various times within the XFS I/O
path.

However, this has a massive cost. While XFS might represent the
disk mapping of a 1GB extent in 24 bytes of memory, the page cache
requires 262,144 bufferheads (assuming 4k block size) to represent the
same mapping. That's roughly 14MB of memory neededtoo represent that.

Chris Mason wrote an extent map representation for page cache state
and mappings for BTRFS; that code is mostly generic and could be
adapted to XFS. This would allow us to hold all the page cache state
in extent format and greatly reduce the memory overhead that it currently
has. The tradeoff is increased CPU overhead due to tree lookups where
structure lookups currently are used. Still, this has much lower
overhead than xfs_bmapi() based lookups, so the penalty is going to
be lower than if we did these lookups right now.

If we make this change, we would then have three levels of extent
caching:

- the BMBT buffers
- the XFS incore inode extent tree (iext*)
- the page cache extent map tree

Effectively, the XFS incore inode extent tree becomes redundant - all
the extent state it holds can be moved to the generic page cache tree
and we can do all our incore operations there. Our logging of changes
is based on the BMBT buffers, so getting rid of the iext layer would
not impact the transaction subsystem at all.

Such integration with the generic code will also allow development
of generic writeback routines for delayed allocation, unwritten
extents, etc that are not specific to a given filesystem.

== Demand Paging of Large Inode Extent Maps ==

Currently the inode extent map is pinned in memory until the inode is
reclaimed. Hence an inode with millions of extents will pin a large
amount of memory and this can cause serious issues in low memory
situations. Ideally we would like to be able to page the extent
map in and out once they get to a certain size to avoid this
problem. This feature requires more investigation before an overall
approach can be detailed here.

It should be noted that if we move to an extent-based page cache mapping
tree, the associated extent state tree can be used to track sparse
regions. That is, regions of the extent map that are not in memory
can be easily represented and acceesses to an unread region can then
be used to trigger demand loading.

== Food For Thought (Crazy Ideas) ==

If we are not using inode buffers for logging changes to inodes, we should
consider whether we need them at all. What benefit do the buffers bring us when
all we will use them for is read or write I/O? Would it be better to go
straight to the buftarg page cache and do page based I/O via submit_bio()?

Improving inode Caching

2010-12-23T03:42:48Z

Dgc: /* Avoiding the Generic pdflush Code */

Future Directions for XFS

== Improving Inode Caching and Operation in XFS ==
--------------------------------------------

Thousand foot view:

We want to drive inode lookup in a manner that is as parallel, scalable and low
overhead as possible. This means efficient indexing, lowering memory
consumption, simplifying the caching heirachy, removing duplication and
reducing/removing lock traffic.

In addition, we want to provide a good foundation for simplifying inode I/O,
improving writeback clustering, preventing RMW of inode buffers under memory
pressure, reducing creation and deletion overhead and removing writeback of
unlogged changes completely.

There are a variety of features in disconnected trees and patch sets that need
to be combined to acheive this - the basic structure needed to implement this is
already in mainline and that is the radix tree inode indexing. Further
improvements are going to be based around this structure and using it
effectively to avoid needing other indexing mechanisms.

Discussion:

== Combining XFS and VFS inodes ==
----------------------------

Status: Done (October 2008)

== Compressed Inode Cache ==
-----------------------------

The XFs inode cache uses a lot of memory. We can avoid this problem by making
use of the compressed inode cache - only the active inodes are held in a
non-compressed form, hence most inodes will end up being cached in compressed
form rather than in the XFS/linux inode form. The compressed form can reduce
the cached inode footprint to 200-300 bytes per inode instead of 1-1.1k that
they currently take on a 64bit system. Hence by moving to a compressed cache we
can greatly increase the number of inodes cached in a given amount of memory
which more that offsets any comparitive increase we will see from inodes in
reclaim. the compressed cache should really have a LRU and a shrinker as well
so that memory pressure will slowly trim it as memory demands occur. [Note:
this compressed cache is discussed further later on in the reclaim context.]

== Fixed Inode Cache Size ==

It is worth noting that for embedded systems and appliances it may be worth while allowing
the size of the caches to be fixed. Also, to prevent memory fragmentation
problems, we could simply allocate that memory to the compressed cache slab.
In effect, this would become a 'static slab' in that it has a bound maximum
size and never frees and memory. When the cache is full, we reclaim an
object out of it for reuse - this could be done by triggering the shrinker
to reclaim from the LRU. This would prevent the compressed inode cache from
consuming excessive amounts of memory in tightly constrained evironments.
Such an extension to the slab caches does not look difficult to implement,
and would allow such customisation with minimal deviation from mainline code.

== Bypassing the Linux Inode Cache ==

Lookups: Done (October 2008)

Tracking dirty inodes: Done

Writeback of dirty inodes: Done

Writeback of dirty pages: still executed the by VFS

Now that we can track dirty inodes ourselves, we can pretty much isolate
writeback of both data and inodes from the generic pdflush code. If we add a
hook high up in the pdflush path that simply passes us a writeback control
structure with the current writeback guidelines, we can do writeback within
those guidelines in the most optimal fashion for XFS.

== Avoiding the Generic pdflush Code ==

Writeback of inodes via AIL: Done

For pdflush driven writeback, we only want to write back data; all other inode
writeback should be driven from the AIL (our time ordered dirty metadata list)
or xfssyncd in a manner that is most optimal for XFS.

Furthermore, if we implement our own pdflush method, we can parallelise it in
several ways. We can ensure that each filesystem has it's own flush thread or
thread pool, we can have a thread pool shared by all filesystems (like pdflush
currently operates), we can have a flush thread per inode radix tree, and so
one. The method of paralleisation is open for interpretation, but enabling
multiple flush threads to operate on a single filesystem is one of the necessary
requirements to avoid data writeback (and hence delayed allocation) being
limited to the throughput of a single CPU per filesystem.

== Improving Inode Writeback ==

To optimise inode writeback, we really need to reduce the impact of inode
buffer read-modify-write cycles. XFS is capable of caching far more inodes in
memory than it has buffer space available for, so RMW cycles during inode
writeback under memory pressure are quite common. Firstly, we want to avoid
blocking pdflush at all costs. Secondly, we want to issue as much localised
readahead as possible in ascending offset order to allow both elevator merging
of readahead and as little seeking as possible. Finally, we want to issue all
the write cycles as close together as possible to allow the same elevator and
I/O optimisations to take place.

To do this, firstly we need the non-blocking inode flush semantics to issue
readahead on buffers that are not up-to-date rather than reading them
synchronously. Inode writeback already has the interface to handle inodes that
weren't flushed - we return EAGAIN from xfs_iflush() and the higher inode
writeback layers handle this appropriately. It would be easy to add another
flag to pass down to the buffer layer to say 'issue but don't wait for any
read'. If we use a radix tree traversal to issue readahead in such a manner,
we'll get ascending offset readahead being issued.

One problem with this is that we can issue too much readahead and thrash the
cache. A possible solution to this is to make the readahead a 'delayed read'
and on I/o completion add it to a queue that holds a reference on the buffer.
If a followup read occurs soon after, we remove it from the queue and drop that
reference. This prevents the buffer from being reclaimed in betwen the
readahead completing and the real read being issued. We should also issue this
delayed read on buffers that are in the cache so that they don't get reclaimed
to make room for the readahead.

To prevent buildup of delayed read buffers, we can periodically purge them -
those that are older than a given age (say 5 seconds) can be removed from the
list and their reference dropped. This will free the buffer and allow it's
pages to be reclaimed.

Once we have done the readahead pass, we can then do a modify and writeback
pass over all the inodes, knowing that there will be no read cycles to delay
this step. Once again, a radix tree traversal gives us ascending order
writeback and hence the modified buffers we send to the device will be in
optimal order for merging and minimal seek overhead.

== Contiguous Inode Allocation ==

To make optimal use of the radix tree cache and enable wide-scale clustering of
inode writeback across multiple clusters, we really need to ensure that inode
allocation occurs in large contiguous chunks on disk. Right now we only
allocate chunks of 64 inodes at a time; ideally we want to allocate a stripe
unit (or multiple of) full of inodes at a time. This would allow inode
writeback clustering to do full stripe writes to the underlying RAID if there
are dirty inodes spanning the entire stripe unit.

The problem with doing this is that we don't want to introduce the latency of
creating megabytes of inodes when only one is needed for the current operation.
Hence we need to push the inode creation into a background thread and use that
to create contiguous inode chunks asynchronously. This moves the actual on-disk
allocation of inodes out of the normal create path; it should always be able to
find a free inode without doing on disk allocation. This will simplify the
create path by removing the allocate-on-disk-then-retry-the-create double
transaction that currently occurs.

As an aside, we could preallocate a small amount of inodes in each AG (10-20MB
of inodes per AG?) without impacting mkfs time too greatly. This would allow
the filesystem to be used immediately on the first mount without triggering
lots of background allocation. This could alsobe done after the first mount
occurs, but that could interfere with typical benchmarking situations. Another
good reason for this preallocation is that it will help reduce xfs_repair
runtime for most common filesystem usages.

One of the issues that the background create will cause is a substantial amount
of log traffic - every inode buffer initialised will be logged in whole. Hence
if we create a megabyte of inodes, we'll be causing a megabyte of log traffic
just for the inode buffers we've initialised. This is relatively simple to fix
- we don't log the buffer, we just log the fact that we need to initialise
inodes in a given range. In recovery, when we see this transaction, then we
build the buffers, initialise them and write them out. Hence, we don't need to
log the buffers used to initialise the inodes.

Also, we can use the background allocations to keep track of recently allocated
inode regions in the per-ag. Using that information to select the next inode to
be used rather than requiring btree searches on every create will greatly reduce
the CPU overhead of workloads that create lots of new inodes. It is not clear
whether a single background thread will be able to allocate enough inodes
to keep up with demand from the rest of the system - we may need multiple
threads for large configurations.

== Single Block Inode Allocation ==

One of the big problems we have withe filesystems that are approaching
full is that it can be hard to find a large enough extent to hold 64 inodes.
We've had ENOSPC errors on inode allocation reported on filesystems that
are only 85% full. This is a sign of free space fragmentation, and it
prevents inode allocation from succeeding. We could (and should) write
a free space defragmenter, but that does not solve the problem - it's
reactive, not preventative.

The main problem we have is that XFS uses inode chunk size and alignment
to optimise inode number to disk location conversion. That is, the conversion
becomes a single set of shifts and masks instead of an AGI btree lookup.
This optimisation substantially reduces the CPU and I/O overhead of
inode lookups, but it does limit our flexibility. If we break the
alignment restriction, every lookup has to go back to a btree search.
Hence we really want to avoid breaking chunk alignment and size
rules.

An approach to avoiding violation of this rule is to be able to determine which
index to look up when parsing the inode number. For example, we could use the
high bit of the inode number to indicate that it is located in a non-aligned
inode chunk and hence needs to be looked up in the btree. This would avoid
the lookup penalty for correctly aligned inode chunks.

If we then redefine the meaning of the contents of the AGI btree record for
such inode chunks, we do not need a new index to keep these in. Effectively,
we need to add a bitmask to the record to indicate which blocks inside
the chunk can actually contain inodes. We still use aligned/sized records,
but mask out the sections that we are not allowed to allocate inodes in.
Effectively, this would allow sparse inode chunks. There may be limitations
on the resolution of sparseness depending on inode size and block size,
but for the common cases of 4k block size and 256 or 512 byte inodes I
think we can run a fully sparse mapping for each inode chunk.

This would allow us to allocate inode extents of any alignment and size
that fits *inside* the existing alignment/size limitations. That is,
a single extent allocation could not span two btree records, but can
lie anywhere inside a single record. It also means that we can do
multiple extent allocations within one btree record to make optimal
use of the fragmented free space.

It should be noted that this will probably have impact on some of the
inode cluster buffer mapping and clustering algorithms. It is not clear
exactly what impact yet, but certainly write clustering will be affected.
Fortunately we'll be able to detect the inodes that will have this problem
by the high bit in the inode number.

== Inode Unlink ==

If we turn to look at unlink and reclaim interactions, there are a few
optimisations that can be made. Firstly, we don't need to do inode inactivation
in reclaim threads - these transactions can easily be pushed to a background
thread. This means that xfs_inactive would be little more than a vmtruncate()
call and queuing to a workqueue. This will substantially speed up the processing
of prune_icache() - we'll get inodes moved into reclaim much faster than we do
right now.

This will have a noticable effect, though. When inodes are unlinked the space
consumed by those inodes may not be immediately freed - it will be returned as
the inodes are processed through the reclaim threads. This means that userspace
monitoring tools such as 'df' may not immediately reflect the result of a
completed unlink operation. This will be a user visible change in behaviour,
though in most cases should not affect anyone and for those that it does affect
a 'sync' should be sufficient to wait for the space to be returned.

Now that inodes to be unlinked are out of general circulation, we can make the
unlinked path more complex. It is desirable to move the unlinked list from the
inode buffer to the inode core, but that has locking implications for incore
unlinked. Hence we really need background thread processing to enable this to
work (i.e. being able to requeue inodes for later processing). To ensure that
to overhead of this work is not a limiting factor, we will probably need
multiple workqueue processing threads for this.

Moving the logging to the inode core enables two things - it allows us to keep
an in-memory copy of the unlinked list off the perag and that allows us to remove
xfs_inotobp(). The in-memory unlinked list means we don't have to read and
traverse the buffers every time we need to find the previous buffer to remove an
inode from the list, but it does mean we have to take the inode lock. If the
previous inode is locked, then we can't remove the inode from the unlinked list
so we must requeue it for this to occur at a later time.

Combined with the changes to inode create, we effectively will only use the
inode buffer in the transaction subsystem for marking the region stale when
freeing an inode chunk from disk (i.e. the default noikeep configuration). If
we are using large inode allocation, we don't want to be freeing random inode
chunks - this will just leave us with fragmented inode regions and undo all the
good work that was done originally.

To avoid this, we should not be freeing inode chunks as soon as they no longer
have any empty inodes in them. We should periodically scan the AGI btree
looking for contiguous chunks that have no inodes allocated in them, and then
freeing the large contiguous regions we find in one go. It is likely this can
be done in a single transaction; it's one extent to be freed, along with a
contiguous set of records to be removed from the AGI btree so should not
require logging much at all. Also, the background scanning could be triggered
by a number of different events - low space in an AG, a large number of free
inodes in an AG, etc - as it doesn't need to be done frequently. As a result
of the lack of frequency that this needs to be done, it can probably be
handled by a single thread or delayed workqueue.

Further optimisations are possible here - if we rule that the AGI btree is the
sole place that inodes are marked free or in-use (with the exception of
unlinked inodes attached to the AGI lists), then we can avoid the need to
write back unlinked inodes or read newly created inodes from disk. This would
require all inodes to effectively use a random generation number assigned at
create time as we would not be reading it from disk - writing/reading the current
generation number appears to be the only real reason for doing this I/O. This
would require extra checks to determine if an inode is unlinked - we
need to do an imap lookup rather than reading it and then checking it is
valid if it is not already in memory. Avoiding the I/O, however, will greatly speed
up create and remove workloads. Note: the impact of this on the bulkstat algorithm
has not been determined yet.

One of the issues we need to consider with this background inactivation is that
we will be able to defer a large quantity of inactivation transactions so we are
going to need to be careful about how much we allow to be queued. Simple queue
depth throttling should be all that is needed to keep this under control.

== Reclaim Optimizations ==

Now that we have efficient unlink, we've got to handle the reclaim of all the
inodes that are now dead or simply not referenced. For inodes that are dirty,
we need to write them out to clean them. For inodes that are clean and not
unlinked, we need to compress them down for more compact storage. This involves
some CPU overhead, but it is worth noting that reclaiming of clean inodes
typically only occurs when we are under memory pressure.

By compressing the XFS inode in this case, we are effectively reducing the
memory usage of the inode rather than freeing it directly. If we then get
another operation on that inode (e.g. the working set is slightly larger than
can be held in linux+XFS inode pairs, we avoid having to read the inode off
disk again - it simply gets uncompressed out of the cache. In essence we use
the compressed inode cache as an exclusive second level cache - it has higher
density than the primary cache and higher load latency and CPU overhead,
but it still avoids I/O in exactly the same manner as the primary cache.

We cannot allow unrestricted build-up of reclaimable inodes - the memory they
consume will be large, so we should be aiming to compress reclaimable inodes as
soon as they are clean. This will prevent buildup of memory consuming
uncompressed inodes that are not likely to be referenced again immediately.

This clean inode reclaimation process can be accelerated by triggering reclaim
on inode I/O completion. If the inode is clean and reclaimable we should
trigger immediate reclaim processing of that inode. This will mean that
reclaim of newly cleaned inodes will not get held up behind reclaim of dirty
inodes.

For inodes that are unlinked, we can simply free them in reclaim as theƦ
are no longer in use. We don't want to poison the compressed cache with
unlinked inodes, nor do we need to because we can allocate new inodes
without incurring I/O.

Still, we may end up with lots of inodes queued for reclaim. We may need
to implement a throttle mechanism to slow down the rate at which inodes
are queued for reclaimation in the situation where the reclaim process
is not able to keep up. It should be noted that if we parallelise inode
writeback we should also be able to parallelise inode reclaim via
the same mechanism, so the need for throttling may relatively low
if we can have multiple inodes under reclaim at once.

It should be noted that complexity is exposed by interactions with concurrent
lookups, especially if we move to RCU locking on the radix tree. Firstly, we
need to be able to do an atomic swap of the compressed inode for the
uncompressed inode in the radix tree (and vice versa), to be able to tell them
apart (magic #), and to have atomic reference counts to ensure we can avoid use
after free situations when lookups race with compression or freeing.

Secondly, with the complex unlink/reclaim interactions we will need to be
careful to detect inodes in the process of reclaim - the lookupp process
will need to do different things depending on the state of reclaim. Indeed,
we will need to be able to cancel reclaim of an unlinked inode if we try
to allocate it before it has been fully unlinked or reclaimed. The same
can be said for an inode in the process of being compressed - if we get
a lookup during the compression process, we want to return the existing
inode, not have to wait, re-allocate and uncompress it again. These
are all solvable issues - they just add complexity.

== Accelerated Reclaim of buftarg Page Cache for Inodes ==
----------------------------------------------------

For single use inodes or even read-only inodes, we read them in, use them, then
reclaim them. With the compressed cache, they'll get compressed and live a lot
longer in memory. However, we also will have the inode cluster buffer pages
sitting in memory for some length of time after the inode was read in. This can
consume a large amount of memory that will never be used again, and does not
get reclaimed until they are purged from the LRU by the VM. It would be
advantageous to accelerate the reclaim of these pages so that they do not build
up unneccessarily.

One method we could use for this would be to introduce our own page LRUs into
the buftarg cache that we can reclaim from. This would allow us to sort pages
according to their contents into different LRUs and periodically reclaim pages
of specific types that were not referenced. This, however, would introduce a
fair amount of complexity into the buffer cache that doesn't currently exist.
Also, from a higher perspective, it makes the buffer cache a complex
part-buffer cache, part VM frankenstein.

A better method would appear to be to leverage the delayed read queue
mechanism. This delayed read queue pins read buffers for a short period of
time, and then if they have not been referenced they get torn down. If, as
part of this delayed read buffer teardown procedure we all free the backing
pages completely, we acheive the exact same result as having our own LRUs to
manage the page cache. This seems much simpler and a much more holistic
approach to solving the problem than implementing page LRUs.

As an aside, we already have the mechanism in place to vary buffer aging based
on their type. The Irix buffer cache used this to great effect when under
memory pressure and the XFS code that configured it still exists in the Linux
code base. However, the Linux XFS buffer cache has never implemented any
mechanism to allow this functionality to be exploited. A delayed buffer reclaim
mechanism as described above could be greatly enhanced by making use of this
code in XFS.

== Killing Bufferheads (a.k.a "Die, buggerheads, Die!") ==

[This is not strictly about inode caching, but doesn't fit into
other areas of development as closely as it does to inode caching
optimisations.]

XFS is extent based. The Linux page cache is block based. Hence for
every cached page in memory, we have to attach a structure for mapping
the blocks on that page back to to the on-disk location. In XFs, we also
use this to hold state for delayed allocation and unwritten extent blocks
so the generic code can do the right thing when necessary. We also
use it to avoid extent lookups at various times within the XFS I/O
path.

However, this has a massive cost. While XFS might represent the
disk mapping of a 1GB extent in 24 bytes of memory, the page cache
requires 262,144 bufferheads (assuming 4k block size) to represent the
same mapping. That's roughly 14MB of memory neededtoo represent that.

Chris Mason wrote an extent map representation for page cache state
and mappings for BTRFS; that code is mostly generic and could be
adapted to XFS. This would allow us to hold all the page cache state
in extent format and greatly reduce the memory overhead that it currently
has. The tradeoff is increased CPU overhead due to tree lookups where
structure lookups currently are used. Still, this has much lower
overhead than xfs_bmapi() based lookups, so the penalty is going to
be lower than if we did these lookups right now.

If we make this change, we would then have three levels of extent
caching:

- the BMBT buffers
- the XFS incore inode extent tree (iext*)
- the page cache extent map tree

Effectively, the XFS incore inode extent tree becomes redundant - all
the extent state it holds can be moved to the generic page cache tree
and we can do all our incore operations there. Our logging of changes
is based on the BMBT buffers, so getting rid of the iext layer would
not impact the transaction subsystem at all.

Such integration with the generic code will also allow development
of generic writeback routines for delayed allocation, unwritten
extents, etc that are not specific to a given filesystem.

== Demand Paging of Large Inode Extent Maps ==

Currently the inode extent map is pinned in memory until the inode is
reclaimed. Hence an inode with millions of extents will pin a large
amount of memory and this can cause serious issues in low memory
situations. Ideally we would like to be able to page the extent
map in and out once they get to a certain size to avoid this
problem. This feature requires more investigation before an overall
approach can be detailed here.

It should be noted that if we move to an extent-based page cache mapping
tree, the associated extent state tree can be used to track sparse
regions. That is, regions of the extent map that are not in memory
can be easily represented and acceesses to an unread region can then
be used to trigger demand loading.

== Food For Thought (Crazy Ideas) ==

If we are not using inode buffers for logging changes to inodes, we should
consider whether we need them at all. What benefit do the buffers bring us when
all we will use them for is read or write I/O? Would it be better to go
straight to the buftarg page cache and do page based I/O via submit_bio()?

Improving inode Caching

2010-12-23T03:41:39Z

Dgc: /* Bypassing the Linux Inode Cache */

Future Directions for XFS

== Improving Inode Caching and Operation in XFS ==
--------------------------------------------

Thousand foot view:

We want to drive inode lookup in a manner that is as parallel, scalable and low
overhead as possible. This means efficient indexing, lowering memory
consumption, simplifying the caching heirachy, removing duplication and
reducing/removing lock traffic.

In addition, we want to provide a good foundation for simplifying inode I/O,
improving writeback clustering, preventing RMW of inode buffers under memory
pressure, reducing creation and deletion overhead and removing writeback of
unlogged changes completely.

There are a variety of features in disconnected trees and patch sets that need
to be combined to acheive this - the basic structure needed to implement this is
already in mainline and that is the radix tree inode indexing. Further
improvements are going to be based around this structure and using it
effectively to avoid needing other indexing mechanisms.

Discussion:

== Combining XFS and VFS inodes ==
----------------------------

Status: Done (October 2008)

== Compressed Inode Cache ==
-----------------------------

The XFs inode cache uses a lot of memory. We can avoid this problem by making
use of the compressed inode cache - only the active inodes are held in a
non-compressed form, hence most inodes will end up being cached in compressed
form rather than in the XFS/linux inode form. The compressed form can reduce
the cached inode footprint to 200-300 bytes per inode instead of 1-1.1k that
they currently take on a 64bit system. Hence by moving to a compressed cache we
can greatly increase the number of inodes cached in a given amount of memory
which more that offsets any comparitive increase we will see from inodes in
reclaim. the compressed cache should really have a LRU and a shrinker as well
so that memory pressure will slowly trim it as memory demands occur. [Note:
this compressed cache is discussed further later on in the reclaim context.]

== Fixed Inode Cache Size ==

It is worth noting that for embedded systems and appliances it may be worth while allowing
the size of the caches to be fixed. Also, to prevent memory fragmentation
problems, we could simply allocate that memory to the compressed cache slab.
In effect, this would become a 'static slab' in that it has a bound maximum
size and never frees and memory. When the cache is full, we reclaim an
object out of it for reuse - this could be done by triggering the shrinker
to reclaim from the LRU. This would prevent the compressed inode cache from
consuming excessive amounts of memory in tightly constrained evironments.
Such an extension to the slab caches does not look difficult to implement,
and would allow such customisation with minimal deviation from mainline code.

== Bypassing the Linux Inode Cache ==

Lookups: Done (October 2008)

Tracking dirty inodes: Done

Writeback of dirty inodes: Done

Writeback of dirty pages: still executed the by VFS

Now that we can track dirty inodes ourselves, we can pretty much isolate
writeback of both data and inodes from the generic pdflush code. If we add a
hook high up in the pdflush path that simply passes us a writeback control
structure with the current writeback guidelines, we can do writeback within
those guidelines in the most optimal fashion for XFS.

== Avoiding the Generic pdflush Code ==

For pdflush driven writeback, we only want to write back data; all other inode
writeback should be driven from the AIL (our time ordered dirty metadata list)
or xfssyncd in a manner that is most optimal for XFS.

Furthermore, if we implement our own pdflush method, we can parallelise it in
several ways. We can ensure that each filesystem has it's own flush thread or
thread pool, we can have a thread pool shared by all filesystems (like pdflush
currently operates), we can have a flush thread per inode radix tree, and so
one. The method of paralleisation is open for interpretation, but enabling
multiple flush threads to operate on a single filesystem is one of the necessary
requirements to avoid data writeback (and hence delayed allocation) being
limited to the throughput of a single CPU per filesystem.

As it is, once data writeback is separated from inode writeback, we could
simply use pushing the AIL as a method of writing back metadata in the
background. There is no good reason for writing the inode immediately after
data if the inode is in the AIL - it will get written soon enough as the tail
of the AIL gets moved along. If we log all inode changes, then we'll be
unlikely to write the inode multiple times over it's dirty life-cycle as it
will continue to be moved forward in the AIL each time it is logged...

== Improving Inode Writeback ==

To optimise inode writeback, we really need to reduce the impact of inode
buffer read-modify-write cycles. XFS is capable of caching far more inodes in
memory than it has buffer space available for, so RMW cycles during inode
writeback under memory pressure are quite common. Firstly, we want to avoid
blocking pdflush at all costs. Secondly, we want to issue as much localised
readahead as possible in ascending offset order to allow both elevator merging
of readahead and as little seeking as possible. Finally, we want to issue all
the write cycles as close together as possible to allow the same elevator and
I/O optimisations to take place.

To do this, firstly we need the non-blocking inode flush semantics to issue
readahead on buffers that are not up-to-date rather than reading them
synchronously. Inode writeback already has the interface to handle inodes that
weren't flushed - we return EAGAIN from xfs_iflush() and the higher inode
writeback layers handle this appropriately. It would be easy to add another
flag to pass down to the buffer layer to say 'issue but don't wait for any
read'. If we use a radix tree traversal to issue readahead in such a manner,
we'll get ascending offset readahead being issued.

One problem with this is that we can issue too much readahead and thrash the
cache. A possible solution to this is to make the readahead a 'delayed read'
and on I/o completion add it to a queue that holds a reference on the buffer.
If a followup read occurs soon after, we remove it from the queue and drop that
reference. This prevents the buffer from being reclaimed in betwen the
readahead completing and the real read being issued. We should also issue this
delayed read on buffers that are in the cache so that they don't get reclaimed
to make room for the readahead.

To prevent buildup of delayed read buffers, we can periodically purge them -
those that are older than a given age (say 5 seconds) can be removed from the
list and their reference dropped. This will free the buffer and allow it's
pages to be reclaimed.

Once we have done the readahead pass, we can then do a modify and writeback
pass over all the inodes, knowing that there will be no read cycles to delay
this step. Once again, a radix tree traversal gives us ascending order
writeback and hence the modified buffers we send to the device will be in
optimal order for merging and minimal seek overhead.

== Contiguous Inode Allocation ==

To make optimal use of the radix tree cache and enable wide-scale clustering of
inode writeback across multiple clusters, we really need to ensure that inode
allocation occurs in large contiguous chunks on disk. Right now we only
allocate chunks of 64 inodes at a time; ideally we want to allocate a stripe
unit (or multiple of) full of inodes at a time. This would allow inode
writeback clustering to do full stripe writes to the underlying RAID if there
are dirty inodes spanning the entire stripe unit.

The problem with doing this is that we don't want to introduce the latency of
creating megabytes of inodes when only one is needed for the current operation.
Hence we need to push the inode creation into a background thread and use that
to create contiguous inode chunks asynchronously. This moves the actual on-disk
allocation of inodes out of the normal create path; it should always be able to
find a free inode without doing on disk allocation. This will simplify the
create path by removing the allocate-on-disk-then-retry-the-create double
transaction that currently occurs.

As an aside, we could preallocate a small amount of inodes in each AG (10-20MB
of inodes per AG?) without impacting mkfs time too greatly. This would allow
the filesystem to be used immediately on the first mount without triggering
lots of background allocation. This could alsobe done after the first mount
occurs, but that could interfere with typical benchmarking situations. Another
good reason for this preallocation is that it will help reduce xfs_repair
runtime for most common filesystem usages.

One of the issues that the background create will cause is a substantial amount
of log traffic - every inode buffer initialised will be logged in whole. Hence
if we create a megabyte of inodes, we'll be causing a megabyte of log traffic
just for the inode buffers we've initialised. This is relatively simple to fix
- we don't log the buffer, we just log the fact that we need to initialise
inodes in a given range. In recovery, when we see this transaction, then we
build the buffers, initialise them and write them out. Hence, we don't need to
log the buffers used to initialise the inodes.

Also, we can use the background allocations to keep track of recently allocated
inode regions in the per-ag. Using that information to select the next inode to
be used rather than requiring btree searches on every create will greatly reduce
the CPU overhead of workloads that create lots of new inodes. It is not clear
whether a single background thread will be able to allocate enough inodes
to keep up with demand from the rest of the system - we may need multiple
threads for large configurations.

== Single Block Inode Allocation ==

One of the big problems we have withe filesystems that are approaching
full is that it can be hard to find a large enough extent to hold 64 inodes.
We've had ENOSPC errors on inode allocation reported on filesystems that
are only 85% full. This is a sign of free space fragmentation, and it
prevents inode allocation from succeeding. We could (and should) write
a free space defragmenter, but that does not solve the problem - it's
reactive, not preventative.

The main problem we have is that XFS uses inode chunk size and alignment
to optimise inode number to disk location conversion. That is, the conversion
becomes a single set of shifts and masks instead of an AGI btree lookup.
This optimisation substantially reduces the CPU and I/O overhead of
inode lookups, but it does limit our flexibility. If we break the
alignment restriction, every lookup has to go back to a btree search.
Hence we really want to avoid breaking chunk alignment and size
rules.

An approach to avoiding violation of this rule is to be able to determine which
index to look up when parsing the inode number. For example, we could use the
high bit of the inode number to indicate that it is located in a non-aligned
inode chunk and hence needs to be looked up in the btree. This would avoid
the lookup penalty for correctly aligned inode chunks.

If we then redefine the meaning of the contents of the AGI btree record for
such inode chunks, we do not need a new index to keep these in. Effectively,
we need to add a bitmask to the record to indicate which blocks inside
the chunk can actually contain inodes. We still use aligned/sized records,
but mask out the sections that we are not allowed to allocate inodes in.
Effectively, this would allow sparse inode chunks. There may be limitations
on the resolution of sparseness depending on inode size and block size,
but for the common cases of 4k block size and 256 or 512 byte inodes I
think we can run a fully sparse mapping for each inode chunk.

This would allow us to allocate inode extents of any alignment and size
that fits *inside* the existing alignment/size limitations. That is,
a single extent allocation could not span two btree records, but can
lie anywhere inside a single record. It also means that we can do
multiple extent allocations within one btree record to make optimal
use of the fragmented free space.

It should be noted that this will probably have impact on some of the
inode cluster buffer mapping and clustering algorithms. It is not clear
exactly what impact yet, but certainly write clustering will be affected.
Fortunately we'll be able to detect the inodes that will have this problem
by the high bit in the inode number.

== Inode Unlink ==

If we turn to look at unlink and reclaim interactions, there are a few
optimisations that can be made. Firstly, we don't need to do inode inactivation
in reclaim threads - these transactions can easily be pushed to a background
thread. This means that xfs_inactive would be little more than a vmtruncate()
call and queuing to a workqueue. This will substantially speed up the processing
of prune_icache() - we'll get inodes moved into reclaim much faster than we do
right now.

This will have a noticable effect, though. When inodes are unlinked the space
consumed by those inodes may not be immediately freed - it will be returned as
the inodes are processed through the reclaim threads. This means that userspace
monitoring tools such as 'df' may not immediately reflect the result of a
completed unlink operation. This will be a user visible change in behaviour,
though in most cases should not affect anyone and for those that it does affect
a 'sync' should be sufficient to wait for the space to be returned.

Now that inodes to be unlinked are out of general circulation, we can make the
unlinked path more complex. It is desirable to move the unlinked list from the
inode buffer to the inode core, but that has locking implications for incore
unlinked. Hence we really need background thread processing to enable this to
work (i.e. being able to requeue inodes for later processing). To ensure that
to overhead of this work is not a limiting factor, we will probably need
multiple workqueue processing threads for this.

Moving the logging to the inode core enables two things - it allows us to keep
an in-memory copy of the unlinked list off the perag and that allows us to remove
xfs_inotobp(). The in-memory unlinked list means we don't have to read and
traverse the buffers every time we need to find the previous buffer to remove an
inode from the list, but it does mean we have to take the inode lock. If the
previous inode is locked, then we can't remove the inode from the unlinked list
so we must requeue it for this to occur at a later time.

Combined with the changes to inode create, we effectively will only use the
inode buffer in the transaction subsystem for marking the region stale when
freeing an inode chunk from disk (i.e. the default noikeep configuration). If
we are using large inode allocation, we don't want to be freeing random inode
chunks - this will just leave us with fragmented inode regions and undo all the
good work that was done originally.

To avoid this, we should not be freeing inode chunks as soon as they no longer
have any empty inodes in them. We should periodically scan the AGI btree
looking for contiguous chunks that have no inodes allocated in them, and then
freeing the large contiguous regions we find in one go. It is likely this can
be done in a single transaction; it's one extent to be freed, along with a
contiguous set of records to be removed from the AGI btree so should not
require logging much at all. Also, the background scanning could be triggered
by a number of different events - low space in an AG, a large number of free
inodes in an AG, etc - as it doesn't need to be done frequently. As a result
of the lack of frequency that this needs to be done, it can probably be
handled by a single thread or delayed workqueue.

Further optimisations are possible here - if we rule that the AGI btree is the
sole place that inodes are marked free or in-use (with the exception of
unlinked inodes attached to the AGI lists), then we can avoid the need to
write back unlinked inodes or read newly created inodes from disk. This would
require all inodes to effectively use a random generation number assigned at
create time as we would not be reading it from disk - writing/reading the current
generation number appears to be the only real reason for doing this I/O. This
would require extra checks to determine if an inode is unlinked - we
need to do an imap lookup rather than reading it and then checking it is
valid if it is not already in memory. Avoiding the I/O, however, will greatly speed
up create and remove workloads. Note: the impact of this on the bulkstat algorithm
has not been determined yet.

One of the issues we need to consider with this background inactivation is that
we will be able to defer a large quantity of inactivation transactions so we are
going to need to be careful about how much we allow to be queued. Simple queue
depth throttling should be all that is needed to keep this under control.

== Reclaim Optimizations ==

Now that we have efficient unlink, we've got to handle the reclaim of all the
inodes that are now dead or simply not referenced. For inodes that are dirty,
we need to write them out to clean them. For inodes that are clean and not
unlinked, we need to compress them down for more compact storage. This involves
some CPU overhead, but it is worth noting that reclaiming of clean inodes
typically only occurs when we are under memory pressure.

By compressing the XFS inode in this case, we are effectively reducing the
memory usage of the inode rather than freeing it directly. If we then get
another operation on that inode (e.g. the working set is slightly larger than
can be held in linux+XFS inode pairs, we avoid having to read the inode off
disk again - it simply gets uncompressed out of the cache. In essence we use
the compressed inode cache as an exclusive second level cache - it has higher
density than the primary cache and higher load latency and CPU overhead,
but it still avoids I/O in exactly the same manner as the primary cache.

We cannot allow unrestricted build-up of reclaimable inodes - the memory they
consume will be large, so we should be aiming to compress reclaimable inodes as
soon as they are clean. This will prevent buildup of memory consuming
uncompressed inodes that are not likely to be referenced again immediately.

This clean inode reclaimation process can be accelerated by triggering reclaim
on inode I/O completion. If the inode is clean and reclaimable we should
trigger immediate reclaim processing of that inode. This will mean that
reclaim of newly cleaned inodes will not get held up behind reclaim of dirty
inodes.

For inodes that are unlinked, we can simply free them in reclaim as theƦ
are no longer in use. We don't want to poison the compressed cache with
unlinked inodes, nor do we need to because we can allocate new inodes
without incurring I/O.

Still, we may end up with lots of inodes queued for reclaim. We may need
to implement a throttle mechanism to slow down the rate at which inodes
are queued for reclaimation in the situation where the reclaim process
is not able to keep up. It should be noted that if we parallelise inode
writeback we should also be able to parallelise inode reclaim via
the same mechanism, so the need for throttling may relatively low
if we can have multiple inodes under reclaim at once.

It should be noted that complexity is exposed by interactions with concurrent
lookups, especially if we move to RCU locking on the radix tree. Firstly, we
need to be able to do an atomic swap of the compressed inode for the
uncompressed inode in the radix tree (and vice versa), to be able to tell them
apart (magic #), and to have atomic reference counts to ensure we can avoid use
after free situations when lookups race with compression or freeing.

Secondly, with the complex unlink/reclaim interactions we will need to be
careful to detect inodes in the process of reclaim - the lookupp process
will need to do different things depending on the state of reclaim. Indeed,
we will need to be able to cancel reclaim of an unlinked inode if we try
to allocate it before it has been fully unlinked or reclaimed. The same
can be said for an inode in the process of being compressed - if we get
a lookup during the compression process, we want to return the existing
inode, not have to wait, re-allocate and uncompress it again. These
are all solvable issues - they just add complexity.

== Accelerated Reclaim of buftarg Page Cache for Inodes ==
----------------------------------------------------

For single use inodes or even read-only inodes, we read them in, use them, then
reclaim them. With the compressed cache, they'll get compressed and live a lot
longer in memory. However, we also will have the inode cluster buffer pages
sitting in memory for some length of time after the inode was read in. This can
consume a large amount of memory that will never be used again, and does not
get reclaimed until they are purged from the LRU by the VM. It would be
advantageous to accelerate the reclaim of these pages so that they do not build
up unneccessarily.

One method we could use for this would be to introduce our own page LRUs into
the buftarg cache that we can reclaim from. This would allow us to sort pages
according to their contents into different LRUs and periodically reclaim pages
of specific types that were not referenced. This, however, would introduce a
fair amount of complexity into the buffer cache that doesn't currently exist.
Also, from a higher perspective, it makes the buffer cache a complex
part-buffer cache, part VM frankenstein.

A better method would appear to be to leverage the delayed read queue
mechanism. This delayed read queue pins read buffers for a short period of
time, and then if they have not been referenced they get torn down. If, as
part of this delayed read buffer teardown procedure we all free the backing
pages completely, we acheive the exact same result as having our own LRUs to
manage the page cache. This seems much simpler and a much more holistic
approach to solving the problem than implementing page LRUs.

As an aside, we already have the mechanism in place to vary buffer aging based
on their type. The Irix buffer cache used this to great effect when under
memory pressure and the XFS code that configured it still exists in the Linux
code base. However, the Linux XFS buffer cache has never implemented any
mechanism to allow this functionality to be exploited. A delayed buffer reclaim
mechanism as described above could be greatly enhanced by making use of this
code in XFS.

== Killing Bufferheads (a.k.a "Die, buggerheads, Die!") ==

[This is not strictly about inode caching, but doesn't fit into
other areas of development as closely as it does to inode caching
optimisations.]

XFS is extent based. The Linux page cache is block based. Hence for
every cached page in memory, we have to attach a structure for mapping
the blocks on that page back to to the on-disk location. In XFs, we also
use this to hold state for delayed allocation and unwritten extent blocks
so the generic code can do the right thing when necessary. We also
use it to avoid extent lookups at various times within the XFS I/O
path.

However, this has a massive cost. While XFS might represent the
disk mapping of a 1GB extent in 24 bytes of memory, the page cache
requires 262,144 bufferheads (assuming 4k block size) to represent the
same mapping. That's roughly 14MB of memory neededtoo represent that.

Chris Mason wrote an extent map representation for page cache state
and mappings for BTRFS; that code is mostly generic and could be
adapted to XFS. This would allow us to hold all the page cache state
in extent format and greatly reduce the memory overhead that it currently
has. The tradeoff is increased CPU overhead due to tree lookups where
structure lookups currently are used. Still, this has much lower
overhead than xfs_bmapi() based lookups, so the penalty is going to
be lower than if we did these lookups right now.

If we make this change, we would then have three levels of extent
caching:

- the BMBT buffers
- the XFS incore inode extent tree (iext*)
- the page cache extent map tree

Effectively, the XFS incore inode extent tree becomes redundant - all
the extent state it holds can be moved to the generic page cache tree
and we can do all our incore operations there. Our logging of changes
is based on the BMBT buffers, so getting rid of the iext layer would
not impact the transaction subsystem at all.

Such integration with the generic code will also allow development
of generic writeback routines for delayed allocation, unwritten
extents, etc that are not specific to a given filesystem.

== Demand Paging of Large Inode Extent Maps ==

Currently the inode extent map is pinned in memory until the inode is
reclaimed. Hence an inode with millions of extents will pin a large
amount of memory and this can cause serious issues in low memory
situations. Ideally we would like to be able to page the extent
map in and out once they get to a certain size to avoid this
problem. This feature requires more investigation before an overall
approach can be detailed here.

It should be noted that if we move to an extent-based page cache mapping
tree, the associated extent state tree can be used to track sparse
regions. That is, regions of the extent map that are not in memory
can be easily represented and acceesses to an unread region can then
be used to trigger demand loading.

== Food For Thought (Crazy Ideas) ==

If we are not using inode buffers for logging changes to inodes, we should
consider whether we need them at all. What benefit do the buffers bring us when
all we will use them for is read or write I/O? Would it be better to go
straight to the buftarg page cache and do page based I/O via submit_bio()?

Improving inode Caching

2010-12-23T03:37:35Z

Dgc: /* Combining XFS and VFS inodes */

Future Directions for XFS

== Improving Inode Caching and Operation in XFS ==
--------------------------------------------

Thousand foot view:

We want to drive inode lookup in a manner that is as parallel, scalable and low
overhead as possible. This means efficient indexing, lowering memory
consumption, simplifying the caching heirachy, removing duplication and
reducing/removing lock traffic.

In addition, we want to provide a good foundation for simplifying inode I/O,
improving writeback clustering, preventing RMW of inode buffers under memory
pressure, reducing creation and deletion overhead and removing writeback of
unlogged changes completely.

There are a variety of features in disconnected trees and patch sets that need
to be combined to acheive this - the basic structure needed to implement this is
already in mainline and that is the radix tree inode indexing. Further
improvements are going to be based around this structure and using it
effectively to avoid needing other indexing mechanisms.

Discussion:

== Combining XFS and VFS inodes ==
----------------------------

Status: Done (October 2008)

== Compressed Inode Cache ==
-----------------------------

The XFs inode cache uses a lot of memory. We can avoid this problem by making
use of the compressed inode cache - only the active inodes are held in a
non-compressed form, hence most inodes will end up being cached in compressed
form rather than in the XFS/linux inode form. The compressed form can reduce
the cached inode footprint to 200-300 bytes per inode instead of 1-1.1k that
they currently take on a 64bit system. Hence by moving to a compressed cache we
can greatly increase the number of inodes cached in a given amount of memory
which more that offsets any comparitive increase we will see from inodes in
reclaim. the compressed cache should really have a LRU and a shrinker as well
so that memory pressure will slowly trim it as memory demands occur. [Note:
this compressed cache is discussed further later on in the reclaim context.]

== Fixed Inode Cache Size ==

It is worth noting that for embedded systems and appliances it may be worth while allowing
the size of the caches to be fixed. Also, to prevent memory fragmentation
problems, we could simply allocate that memory to the compressed cache slab.
In effect, this would become a 'static slab' in that it has a bound maximum
size and never frees and memory. When the cache is full, we reclaim an
object out of it for reuse - this could be done by triggering the shrinker
to reclaim from the LRU. This would prevent the compressed inode cache from
consuming excessive amounts of memory in tightly constrained evironments.
Such an extension to the slab caches does not look difficult to implement,
and would allow such customisation with minimal deviation from mainline code.

== Bypassing the Linux Inode Cache ==

With a larger number of cached inodes that the linux inode cache could possibly
hold, it makes sense to completely remove the linux inode cache from the lookup
path. That is, we do all our inode lookup based on the XFS cache, and if we find
a compressed inode we uncompress it and turn it into a combined linux+XFS inode.
If we also fast-path igrab() to avoid the inode_lock in the common case
(refcount > 0) then we will substantially reduce the traffic on the inode_lock.

If we have not hashed the inode in the linux inode cache, we now have to take
care or tracking dirty inodes ourselves - unhashed inodes are not added to the
superblock dirty inode list by __mark_inode_dirty(). However, we do get a
callout (->dirty_inode) that allows us to do this ourselves. We can use this
callout and a tag in the inode radix tree to track all dirty inodes, or even
just use the superblock list ourselves. Either way, we now have a mechanism that
allows us to track all dirty inodes our own way.

Now that we can track dirty inodes ourselves, we can pretty much isolate
writeback of both data and inodes from the generic pdflush code. If we add a
hook high up in the pdflush path that simply passes us a writeback control
structure with the current writeback guidelines, we can do writeback within
those guidelines in the most optimal fashion for XFS.

== Avoiding the Generic pdflush Code ==

For pdflush driven writeback, we only want to write back data; all other inode
writeback should be driven from the AIL (our time ordered dirty metadata list)
or xfssyncd in a manner that is most optimal for XFS.

Furthermore, if we implement our own pdflush method, we can parallelise it in
several ways. We can ensure that each filesystem has it's own flush thread or
thread pool, we can have a thread pool shared by all filesystems (like pdflush
currently operates), we can have a flush thread per inode radix tree, and so
one. The method of paralleisation is open for interpretation, but enabling
multiple flush threads to operate on a single filesystem is one of the necessary
requirements to avoid data writeback (and hence delayed allocation) being
limited to the throughput of a single CPU per filesystem.

As it is, once data writeback is separated from inode writeback, we could
simply use pushing the AIL as a method of writing back metadata in the
background. There is no good reason for writing the inode immediately after
data if the inode is in the AIL - it will get written soon enough as the tail
of the AIL gets moved along. If we log all inode changes, then we'll be
unlikely to write the inode multiple times over it's dirty life-cycle as it
will continue to be moved forward in the AIL each time it is logged...

== Improving Inode Writeback ==

To optimise inode writeback, we really need to reduce the impact of inode
buffer read-modify-write cycles. XFS is capable of caching far more inodes in
memory than it has buffer space available for, so RMW cycles during inode
writeback under memory pressure are quite common. Firstly, we want to avoid
blocking pdflush at all costs. Secondly, we want to issue as much localised
readahead as possible in ascending offset order to allow both elevator merging
of readahead and as little seeking as possible. Finally, we want to issue all
the write cycles as close together as possible to allow the same elevator and
I/O optimisations to take place.

To do this, firstly we need the non-blocking inode flush semantics to issue
readahead on buffers that are not up-to-date rather than reading them
synchronously. Inode writeback already has the interface to handle inodes that
weren't flushed - we return EAGAIN from xfs_iflush() and the higher inode
writeback layers handle this appropriately. It would be easy to add another
flag to pass down to the buffer layer to say 'issue but don't wait for any
read'. If we use a radix tree traversal to issue readahead in such a manner,
we'll get ascending offset readahead being issued.

One problem with this is that we can issue too much readahead and thrash the
cache. A possible solution to this is to make the readahead a 'delayed read'
and on I/o completion add it to a queue that holds a reference on the buffer.
If a followup read occurs soon after, we remove it from the queue and drop that
reference. This prevents the buffer from being reclaimed in betwen the
readahead completing and the real read being issued. We should also issue this
delayed read on buffers that are in the cache so that they don't get reclaimed
to make room for the readahead.

To prevent buildup of delayed read buffers, we can periodically purge them -
those that are older than a given age (say 5 seconds) can be removed from the
list and their reference dropped. This will free the buffer and allow it's
pages to be reclaimed.

Once we have done the readahead pass, we can then do a modify and writeback
pass over all the inodes, knowing that there will be no read cycles to delay
this step. Once again, a radix tree traversal gives us ascending order
writeback and hence the modified buffers we send to the device will be in
optimal order for merging and minimal seek overhead.

== Contiguous Inode Allocation ==

To make optimal use of the radix tree cache and enable wide-scale clustering of
inode writeback across multiple clusters, we really need to ensure that inode
allocation occurs in large contiguous chunks on disk. Right now we only
allocate chunks of 64 inodes at a time; ideally we want to allocate a stripe
unit (or multiple of) full of inodes at a time. This would allow inode
writeback clustering to do full stripe writes to the underlying RAID if there
are dirty inodes spanning the entire stripe unit.

The problem with doing this is that we don't want to introduce the latency of
creating megabytes of inodes when only one is needed for the current operation.
Hence we need to push the inode creation into a background thread and use that
to create contiguous inode chunks asynchronously. This moves the actual on-disk
allocation of inodes out of the normal create path; it should always be able to
find a free inode without doing on disk allocation. This will simplify the
create path by removing the allocate-on-disk-then-retry-the-create double
transaction that currently occurs.

As an aside, we could preallocate a small amount of inodes in each AG (10-20MB
of inodes per AG?) without impacting mkfs time too greatly. This would allow
the filesystem to be used immediately on the first mount without triggering
lots of background allocation. This could alsobe done after the first mount
occurs, but that could interfere with typical benchmarking situations. Another
good reason for this preallocation is that it will help reduce xfs_repair
runtime for most common filesystem usages.

One of the issues that the background create will cause is a substantial amount
of log traffic - every inode buffer initialised will be logged in whole. Hence
if we create a megabyte of inodes, we'll be causing a megabyte of log traffic
just for the inode buffers we've initialised. This is relatively simple to fix
- we don't log the buffer, we just log the fact that we need to initialise
inodes in a given range. In recovery, when we see this transaction, then we
build the buffers, initialise them and write them out. Hence, we don't need to
log the buffers used to initialise the inodes.

Also, we can use the background allocations to keep track of recently allocated
inode regions in the per-ag. Using that information to select the next inode to
be used rather than requiring btree searches on every create will greatly reduce
the CPU overhead of workloads that create lots of new inodes. It is not clear
whether a single background thread will be able to allocate enough inodes
to keep up with demand from the rest of the system - we may need multiple
threads for large configurations.

== Single Block Inode Allocation ==

One of the big problems we have withe filesystems that are approaching
full is that it can be hard to find a large enough extent to hold 64 inodes.
We've had ENOSPC errors on inode allocation reported on filesystems that
are only 85% full. This is a sign of free space fragmentation, and it
prevents inode allocation from succeeding. We could (and should) write
a free space defragmenter, but that does not solve the problem - it's
reactive, not preventative.

The main problem we have is that XFS uses inode chunk size and alignment
to optimise inode number to disk location conversion. That is, the conversion
becomes a single set of shifts and masks instead of an AGI btree lookup.
This optimisation substantially reduces the CPU and I/O overhead of
inode lookups, but it does limit our flexibility. If we break the
alignment restriction, every lookup has to go back to a btree search.
Hence we really want to avoid breaking chunk alignment and size
rules.

An approach to avoiding violation of this rule is to be able to determine which
index to look up when parsing the inode number. For example, we could use the
high bit of the inode number to indicate that it is located in a non-aligned
inode chunk and hence needs to be looked up in the btree. This would avoid
the lookup penalty for correctly aligned inode chunks.

If we then redefine the meaning of the contents of the AGI btree record for
such inode chunks, we do not need a new index to keep these in. Effectively,
we need to add a bitmask to the record to indicate which blocks inside
the chunk can actually contain inodes. We still use aligned/sized records,
but mask out the sections that we are not allowed to allocate inodes in.
Effectively, this would allow sparse inode chunks. There may be limitations
on the resolution of sparseness depending on inode size and block size,
but for the common cases of 4k block size and 256 or 512 byte inodes I
think we can run a fully sparse mapping for each inode chunk.

This would allow us to allocate inode extents of any alignment and size
that fits *inside* the existing alignment/size limitations. That is,
a single extent allocation could not span two btree records, but can
lie anywhere inside a single record. It also means that we can do
multiple extent allocations within one btree record to make optimal
use of the fragmented free space.

It should be noted that this will probably have impact on some of the
inode cluster buffer mapping and clustering algorithms. It is not clear
exactly what impact yet, but certainly write clustering will be affected.
Fortunately we'll be able to detect the inodes that will have this problem
by the high bit in the inode number.

== Inode Unlink ==

If we turn to look at unlink and reclaim interactions, there are a few
optimisations that can be made. Firstly, we don't need to do inode inactivation
in reclaim threads - these transactions can easily be pushed to a background
thread. This means that xfs_inactive would be little more than a vmtruncate()
call and queuing to a workqueue. This will substantially speed up the processing
of prune_icache() - we'll get inodes moved into reclaim much faster than we do
right now.

This will have a noticable effect, though. When inodes are unlinked the space
consumed by those inodes may not be immediately freed - it will be returned as
the inodes are processed through the reclaim threads. This means that userspace
monitoring tools such as 'df' may not immediately reflect the result of a
completed unlink operation. This will be a user visible change in behaviour,
though in most cases should not affect anyone and for those that it does affect
a 'sync' should be sufficient to wait for the space to be returned.

Now that inodes to be unlinked are out of general circulation, we can make the
unlinked path more complex. It is desirable to move the unlinked list from the
inode buffer to the inode core, but that has locking implications for incore
unlinked. Hence we really need background thread processing to enable this to
work (i.e. being able to requeue inodes for later processing). To ensure that
to overhead of this work is not a limiting factor, we will probably need
multiple workqueue processing threads for this.

Moving the logging to the inode core enables two things - it allows us to keep
an in-memory copy of the unlinked list off the perag and that allows us to remove
xfs_inotobp(). The in-memory unlinked list means we don't have to read and
traverse the buffers every time we need to find the previous buffer to remove an
inode from the list, but it does mean we have to take the inode lock. If the
previous inode is locked, then we can't remove the inode from the unlinked list
so we must requeue it for this to occur at a later time.

Combined with the changes to inode create, we effectively will only use the
inode buffer in the transaction subsystem for marking the region stale when
freeing an inode chunk from disk (i.e. the default noikeep configuration). If
we are using large inode allocation, we don't want to be freeing random inode
chunks - this will just leave us with fragmented inode regions and undo all the
good work that was done originally.

To avoid this, we should not be freeing inode chunks as soon as they no longer
have any empty inodes in them. We should periodically scan the AGI btree
looking for contiguous chunks that have no inodes allocated in them, and then
freeing the large contiguous regions we find in one go. It is likely this can
be done in a single transaction; it's one extent to be freed, along with a
contiguous set of records to be removed from the AGI btree so should not
require logging much at all. Also, the background scanning could be triggered
by a number of different events - low space in an AG, a large number of free
inodes in an AG, etc - as it doesn't need to be done frequently. As a result
of the lack of frequency that this needs to be done, it can probably be
handled by a single thread or delayed workqueue.

Further optimisations are possible here - if we rule that the AGI btree is the
sole place that inodes are marked free or in-use (with the exception of
unlinked inodes attached to the AGI lists), then we can avoid the need to
write back unlinked inodes or read newly created inodes from disk. This would
require all inodes to effectively use a random generation number assigned at
create time as we would not be reading it from disk - writing/reading the current
generation number appears to be the only real reason for doing this I/O. This
would require extra checks to determine if an inode is unlinked - we
need to do an imap lookup rather than reading it and then checking it is
valid if it is not already in memory. Avoiding the I/O, however, will greatly speed
up create and remove workloads. Note: the impact of this on the bulkstat algorithm
has not been determined yet.

One of the issues we need to consider with this background inactivation is that
we will be able to defer a large quantity of inactivation transactions so we are
going to need to be careful about how much we allow to be queued. Simple queue
depth throttling should be all that is needed to keep this under control.

== Reclaim Optimizations ==

Now that we have efficient unlink, we've got to handle the reclaim of all the
inodes that are now dead or simply not referenced. For inodes that are dirty,
we need to write them out to clean them. For inodes that are clean and not
unlinked, we need to compress them down for more compact storage. This involves
some CPU overhead, but it is worth noting that reclaiming of clean inodes
typically only occurs when we are under memory pressure.

By compressing the XFS inode in this case, we are effectively reducing the
memory usage of the inode rather than freeing it directly. If we then get
another operation on that inode (e.g. the working set is slightly larger than
can be held in linux+XFS inode pairs, we avoid having to read the inode off
disk again - it simply gets uncompressed out of the cache. In essence we use
the compressed inode cache as an exclusive second level cache - it has higher
density than the primary cache and higher load latency and CPU overhead,
but it still avoids I/O in exactly the same manner as the primary cache.

We cannot allow unrestricted build-up of reclaimable inodes - the memory they
consume will be large, so we should be aiming to compress reclaimable inodes as
soon as they are clean. This will prevent buildup of memory consuming
uncompressed inodes that are not likely to be referenced again immediately.

This clean inode reclaimation process can be accelerated by triggering reclaim
on inode I/O completion. If the inode is clean and reclaimable we should
trigger immediate reclaim processing of that inode. This will mean that
reclaim of newly cleaned inodes will not get held up behind reclaim of dirty
inodes.

For inodes that are unlinked, we can simply free them in reclaim as theƦ
are no longer in use. We don't want to poison the compressed cache with
unlinked inodes, nor do we need to because we can allocate new inodes
without incurring I/O.

Still, we may end up with lots of inodes queued for reclaim. We may need
to implement a throttle mechanism to slow down the rate at which inodes
are queued for reclaimation in the situation where the reclaim process
is not able to keep up. It should be noted that if we parallelise inode
writeback we should also be able to parallelise inode reclaim via
the same mechanism, so the need for throttling may relatively low
if we can have multiple inodes under reclaim at once.

It should be noted that complexity is exposed by interactions with concurrent
lookups, especially if we move to RCU locking on the radix tree. Firstly, we
need to be able to do an atomic swap of the compressed inode for the
uncompressed inode in the radix tree (and vice versa), to be able to tell them
apart (magic #), and to have atomic reference counts to ensure we can avoid use
after free situations when lookups race with compression or freeing.

Secondly, with the complex unlink/reclaim interactions we will need to be
careful to detect inodes in the process of reclaim - the lookupp process
will need to do different things depending on the state of reclaim. Indeed,
we will need to be able to cancel reclaim of an unlinked inode if we try
to allocate it before it has been fully unlinked or reclaimed. The same
can be said for an inode in the process of being compressed - if we get
a lookup during the compression process, we want to return the existing
inode, not have to wait, re-allocate and uncompress it again. These
are all solvable issues - they just add complexity.

== Accelerated Reclaim of buftarg Page Cache for Inodes ==
----------------------------------------------------

For single use inodes or even read-only inodes, we read them in, use them, then
reclaim them. With the compressed cache, they'll get compressed and live a lot
longer in memory. However, we also will have the inode cluster buffer pages
sitting in memory for some length of time after the inode was read in. This can
consume a large amount of memory that will never be used again, and does not
get reclaimed until they are purged from the LRU by the VM. It would be
advantageous to accelerate the reclaim of these pages so that they do not build
up unneccessarily.

One method we could use for this would be to introduce our own page LRUs into
the buftarg cache that we can reclaim from. This would allow us to sort pages
according to their contents into different LRUs and periodically reclaim pages
of specific types that were not referenced. This, however, would introduce a
fair amount of complexity into the buffer cache that doesn't currently exist.
Also, from a higher perspective, it makes the buffer cache a complex
part-buffer cache, part VM frankenstein.

A better method would appear to be to leverage the delayed read queue
mechanism. This delayed read queue pins read buffers for a short period of
time, and then if they have not been referenced they get torn down. If, as
part of this delayed read buffer teardown procedure we all free the backing
pages completely, we acheive the exact same result as having our own LRUs to
manage the page cache. This seems much simpler and a much more holistic
approach to solving the problem than implementing page LRUs.

As an aside, we already have the mechanism in place to vary buffer aging based
on their type. The Irix buffer cache used this to great effect when under
memory pressure and the XFS code that configured it still exists in the Linux
code base. However, the Linux XFS buffer cache has never implemented any
mechanism to allow this functionality to be exploited. A delayed buffer reclaim
mechanism as described above could be greatly enhanced by making use of this
code in XFS.

== Killing Bufferheads (a.k.a "Die, buggerheads, Die!") ==

[This is not strictly about inode caching, but doesn't fit into
other areas of development as closely as it does to inode caching
optimisations.]

XFS is extent based. The Linux page cache is block based. Hence for
every cached page in memory, we have to attach a structure for mapping
the blocks on that page back to to the on-disk location. In XFs, we also
use this to hold state for delayed allocation and unwritten extent blocks
so the generic code can do the right thing when necessary. We also
use it to avoid extent lookups at various times within the XFS I/O
path.

However, this has a massive cost. While XFS might represent the
disk mapping of a 1GB extent in 24 bytes of memory, the page cache
requires 262,144 bufferheads (assuming 4k block size) to represent the
same mapping. That's roughly 14MB of memory neededtoo represent that.

Chris Mason wrote an extent map representation for page cache state
and mappings for BTRFS; that code is mostly generic and could be
adapted to XFS. This would allow us to hold all the page cache state
in extent format and greatly reduce the memory overhead that it currently
has. The tradeoff is increased CPU overhead due to tree lookups where
structure lookups currently are used. Still, this has much lower
overhead than xfs_bmapi() based lookups, so the penalty is going to
be lower than if we did these lookups right now.

If we make this change, we would then have three levels of extent
caching:

- the BMBT buffers
- the XFS incore inode extent tree (iext*)
- the page cache extent map tree

Effectively, the XFS incore inode extent tree becomes redundant - all
the extent state it holds can be moved to the generic page cache tree
and we can do all our incore operations there. Our logging of changes
is based on the BMBT buffers, so getting rid of the iext layer would
not impact the transaction subsystem at all.

Such integration with the generic code will also allow development
of generic writeback routines for delayed allocation, unwritten
extents, etc that are not specific to a given filesystem.

== Demand Paging of Large Inode Extent Maps ==

Currently the inode extent map is pinned in memory until the inode is
reclaimed. Hence an inode with millions of extents will pin a large
amount of memory and this can cause serious issues in low memory
situations. Ideally we would like to be able to page the extent
map in and out once they get to a certain size to avoid this
problem. This feature requires more investigation before an overall
approach can be detailed here.

It should be noted that if we move to an extent-based page cache mapping
tree, the associated extent state tree can be used to track sparse
regions. That is, regions of the extent map that are not in memory
can be easily represented and acceesses to an unread region can then
be used to trigger demand loading.

== Food For Thought (Crazy Ideas) ==

If we are not using inode buffers for logging changes to inodes, we should
consider whether we need them at all. What benefit do the buffers bring us when
all we will use them for is read or write I/O? Would it be better to go
straight to the buftarg page cache and do page based I/O via submit_bio()?

XFS FAQ

2010-12-09T01:33:03Z

Dgc:

Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]

Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.

== Q: Where can I find documentation about XFS? ==

The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.

You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the '''<nowiki>#xfs</nowiki>''' IRC channel on ''irc.freenode.net''.

== Q: Where can I find documentation about ACLs? ==

Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/

The '''acl(5)''' manual page is also quite extensive.

== Q: Where can I find information about the internals of XFS? ==

An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.

Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.

== Q: What partition type should I use for XFS on Linux? ==

Linux native filesystem (83).

== Q: What mount options does XFS have? ==

There are a number of mount options influencing XFS filesystems - refer to the '''mount(8)''' manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])

== Q: Is there any relation between the XFS utilities and the kernel version? ==

No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.

== Q: Does it run on platforms other than i386? ==

XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.

== Q: Quota: Do quotas work on XFS? ==

Yes.

To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/ http://sourceforge.net/projects/linuxquota/] or use '''xfs_quota(8)'''.

== Q: Quota: What's project quota? ==

The project quota is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.

== Q: Quota: Can group quota and project quota be used at the same time? ==

No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.

== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==

To be answered.

== Q: Are there any dump/restore tools for XFS? ==

'''xfsdump(8)''' and '''xfsrestore(8)''' are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.

== Q: Does LILO work with XFS? ==

This depends on where you install LILO.

Yes, for MBR (Master Boot Record) installations.

No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.

== Q: Does GRUB work with XFS? ==

There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.

== Q: Can XFS be used for a root filesystem? ==

Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the "rootflags=" kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit "logdev=" specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]

== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==

Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be "clean" when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.

== Q: Is there a way to make a XFS filesystem larger or smaller? ==

You can ''NOT'' make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.

An XFS filesystem may be enlarged by using '''xfs_growfs(8)'''.

If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the ''exact same'' starting point. Run '''xfs_growfs''' to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.

Using XFS filesystems on top of a volume manager makes this a lot easier.

== Q: What information should I include when reporting a problem? ==

Things to include are what version of XFS you are using, if this is a CVS version of what date and version of the kernel. If you have problems with userland packages please report the version of the package you are using.

If the problem relates to a particular filesystem, the output from the '''xfs_info(8)''' command and any '''mount(8)''' options in use will also be useful to the developers.

If you experience an oops, please run it through '''ksymoops''' so that it can be interpreted.

If you have a filesystem that cannot be repaired, make sure you have xfsprogs 2.9.0 or later and run '''xfs_metadump(8)''' to capture the metadata (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.

== Q: Mounting an XFS filesystem does not work - what is wrong? ==

If mount prints an error message something like:

mount: /dev/hda5 has wrong major or minor number

you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the "-t xfs" option on mount or the "xfs" option in <tt>/etc/fstab</tt>.

If you get something like:

mount: wrong fs type, bad option, bad superblock on /dev/sda1,
or too many mounted file systems

Refer to your system log file (<tt>/var/log/messages</tt>) for a detailed diagnostic message from the kernel.

== Q: Does the filesystem have an undelete capability? ==

There is no undelete in XFS. However at least some XFS driver implementations does not wipe file information nodes completely so there are chance to recover files with specialized commercial software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS].
In this kind of XFS driver implementation it does not re-use directory entries immediately so there are chance to get back recently deleted files even with their real names.

This applies to most recent Linux distributions, as well as to most popular NAS boxes that use embedded linux and XFS file system.

Anyway, the best is to always keep backups.

== Q: How can I backup a XFS filesystem and ACLs? ==

You can backup a XFS filesystem with utilities like '''xfsdump(8)''' and standard '''tar(1)''' for standard files. If you want to backup ACLs you will need to use '''xfsdump''' or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (> version 3.1.4) or [http://rsync.samba.org/ rsync] (>= version 3.0.0) to backup ACLs and EAs. '''xfsdump''' can also be integrated with [http://www.amanda.org/ amanda(8)].

== Q: I see applications returning error 990 or "Structure needs cleaning", what is wrong? ==

The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], "Structure needs cleaning."

The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.

There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.

You can use xfs_check and xfs_repair to remedy the problem (with the file system unmounted).

== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==

Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.

XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.

Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you'll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the '''xfs_bmap(8)''' command).

== Q: What is the problem with the write cache on journaled filesystems? ==

Many drives use a write back cache in order to speed up the performance of writes. However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk. Further, the drive can de-stage data from the write cache to the platters in any order that it chooses. This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk. When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.

With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information. In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.

With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued. A powerfail "only" loses data in the cache but no essential ordering is violated, and corruption will not occur.

With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance. But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.

== Q: How can I tell if I have the disk write cache enabled? ==

For SCSI/SATA:

* Look in dmesg(8) output for a driver line, such as: "SCSI device sda: drive cache: write back"
* <nowiki># sginfo -c /dev/sda | grep -i 'write cache' </nowiki>

For PATA/SATA (although for SATA this only works on a recent kernel with ATA command passthrough):

* <nowiki># hdparm -I /dev/sda</nowiki> and look under "Enabled Supported" for "Write cache"

For RAID controllers:

* See the section about RAID controllers below

== Q: How can I address the problem with the disk write cache? ==

=== Disabling the disk write back cache. ===

For SATA/PATA(IDE): (although for SATA this only works on a recent kernel with ATA command passthrough): 

* <nowiki># hdparm -W0 /dev/sda</nowiki> # hdparm -W0 /dev/hda
* <nowiki># blktool /dev/sda wcache off</nowiki> # blktool /dev/hda wcache off

For SCSI:

* Using sginfo(8) which is a little tedious It takes 3 steps. For example:
*# <nowiki>#sginfo -c /dev/sda</nowiki> which gives a list of attribute names and values
*# <nowiki>#sginfo -cX /dev/sda</nowiki> which gives an array of cache values which you must match up with from step 1, e.g. 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0
*# <nowiki>#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0</nowiki> allows you to reset the value of the cache attributes.

For RAID controllers:

* See the section about RAID controllers below

This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled. 

=== Using an external log. ===

Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will '''not''' solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won't be able to guarantee that if the metadata is on a drive with the write cache enabled.

In fact using an external log will disable XFS' write barrier support.

=== Write barrier support. ===

Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with "nobarrier". Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:

* "Disabling barriers, not supported with external log device"
* "Disabling barriers, not supported by the underlying device"
* "Disabling barriers, trial barrier write failed"

If the filesystem is mounted with an external log device then we currently don't support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn't support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.

== Q. Should barriers be enabled with storage which has a persistent write cache? ==

Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with "nobarrier". But take care about the hard disk write cache, which should be off.

== Q. Which settings does my RAID controller need ? ==

It's hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:

Real RAID controllers (not those found onboard of mainboards) normally have a battery backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory "[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]") which is used for buffering writes to improve speed. Even if it's battery backed, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents in that case.

* onboard RAID controllers: there are so many different types it's hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn't even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.

* 3ware: /cX/uX set cache=off, see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf , page 86

* Adaptec: allows setting individual drives cache
arcconf setcache <disk> wb|wt
wb=write back, which means write cache on, wt=write through, which means write cache off. So "wt" should be chosen.

* Areca: In archttp under "System Controls" -> "System Configuration" there's the option "Disk Write Cache Mode" (defaults "Auto")

"Off": disk write cache is turned off

"On": disk write cache is enabled, this is not save for your data but fast

"Auto": If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to "On", because neither controller cache nor disk cache is save so you don't seem to care about your data and just want high speed (which you get then).

That's a very sensible default so you can let it "Auto" or enforce "Off" to be sure.

* LSI MegaRAID: allows setting individual disks cache:
MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL -EnDskCache|DisDskCache

* Xyratex: from the docs: "Write cache includes the disk drive cache and controller cache.". So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.

== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==

The biggest problem is that those products seem to also virtualize disk
writes in a way that even barriers don't work anymore, which means even
a fsync is not reliable. Tests confirm that unplugging the power from
such a system even with RAID controller with battery backed cache and
hard disk cache turned off (which is save on a normal host) you can
destroy a database within the virtual machine (client, domU whatever you
call it).

In qemu you can specify cache=off on the line specifying the virtual
disk. For others information is missing.

== Q: What is the issue with directory corruption in Linux 2.6.17? ==

In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some "sparse" endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.

''Update: the fix is included in 2.6.17.7 and later kernels.''

To add insult to injury, '''xfs_repair(8)''' is currently not correcting these directories on detection of this corrupt state either. This '''xfs_repair''' issue is actively being worked on, and a fixed version will be available shortly.

''Update: a fixed '''xfs_repair''' is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.''

No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.

The '''xfs_check''' tool, or '''xfs_repair -n''', should be able to detect any directory corruption.

Until a fixed '''xfs_repair''' binary is available, one can make use of the '''xfs_db(8)''' command to mark the problem directory for removal (see the example below). A subsequent '''xfs_repair''' invocation will remove the directory and move all contents into "lost+found", named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
core.mode = 040755
core.version = 2
core.format = 3 (btree)
...
xfs_db> write core.mode 0
xfs_db> quit
</nowiki>

A subsequent '''xfs_repair''' will clear the directory, and add new entries (named by inode number) in lost+found.

The easiest way to map inode numbers to full paths is via '''xfs_ncheck(8)'''<nowiki>: </nowiki>

<nowiki>
# xfs_ncheck -i 14101 -i 14102 /dev/sdXXX
14101 full/path/mumble_fratz_foo_bar_1495
14102 full/path/mumble_fratz_foo_bar_1494
</nowiki>

Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
...
next_unlinked = null
u.bmbt.level = 1
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[0]
u.bmbt.ptrs[1] = 1:3628
xfs_db> fsblock 3628
xfs_db> type bmapbtd
xfs_db> print
magic = 0x424d4150
level = 0
numrecs = 19
leftsib = null
rightsib = null
recs[1-19] = [startoff,startblock,blockcount,extentflag]
1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]
5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]
9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]
12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]
15:[33554436,3488,8,0] 16:[33554444,3629,4,0]
17:[33554448,3748,4,0] 18:[33554452,3900,4,0]
19:[67108864,3364,4,0]
</nowiki>

At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the '''xfs_db''' dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:

...
xfs_db> dblock 20
xfs_db> print
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0
dhdr.bestfree[0].length = 0
dhdr.bestfree[1].offset = 0
dhdr.bestfree[1].length = 0
dhdr.bestfree[2].offset = 0
dhdr.bestfree[2].length = 0
du[0].inumber = 13937
du[0].namelen = 25
du[0].name = "mumble_fratz_foo_bar_1595"
du[0].tag = 0x10
du[1].inumber = 13938
du[1].namelen = 25
du[1].name = "mumble_fratz_foo_bar_1594"
du[1].tag = 0x38
...

So, here we can see that inode number 13938 matches up with name "mumble_fratz_foo_bar_1594". Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at "lost+found" (once '''xfs_repair''' has removed the corrupt directory).

== Q: Why does my > 2TB XFS partition disappear when I reboot ? ==

Strictly speaking this is not an XFS problem.

To support > 2TB partitions you need two things: a kernel that supports large block devices (<tt>CONFIG_LBD=y</tt>) and a partition table format that can hold large partitions. The default DOS partition tables don't. The best partition format for
> 2TB partitions is the EFI GPT format (<tt>CONFIG_EFI_PARTITION=y</tt>).

Without CONFIG_LBD=y you can't even create the filesystem, but without <tt>CONFIG_EFI_PARTITION=y</tt> it works fine until you reboot at which point the partition will disappear. Note that you need to enable the <tt>CONFIG_PARTITION_ADVANCED</tt> option before you can set <tt>CONFIG_EFI_PARTITION=y</tt>.

== Q: Why do I receive <tt>No space left on device</tt> after <tt>xfs_growfs</tt>? ==

After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:

The only way to fix this is to move data around to free up space
below 1TB. Find your oldest data (i.e. that was around before even
the first grow) and move it off the filesystem (move, not copy).
Then if you copy it back on, the data blocks will end up above 1TB
and that should leave you with plenty of space for inodes below 1TB.

A complete dump and restore will also fix the problem ;)

Also, you can add 'inode64' to your mount options to allow inodes to live above 1TB.

== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==

The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons.

Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.

== Q: How to get around a bad inode repair is unable to clean up ==

The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.

xfs_db -x -c 'inode XXX' -c 'write core.nextents 0' -c 'write core.size 0' /dev/hdXX

== Q: How to calculate the correct sunit,swidth values for optimal performance ==

XFS allows to optimize for a given RAID stripe size and number of disks via mount options. The calculation of these values is quite simple:

su = <RAID controllers stripe size in BYTES (or KiBytes when used with k)>
sw = <# of data disks>

So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use

su = 64k
sw = 6 (RAID-6 of 8 disks has 6 data disks)

A RAID stripe size of 256KB with a RAID-10 over 16 disks should use

su = 256k
sw = 8 (RAID-10 of 16 disks has 8 data disks)

Alternatively, you can use "sunit" instead of "su", but then the value means "number of 512B sectors", and you *must* use "swidth" then as "number of 512B sectors" then as well!

Please be aware that when xfs_info or mkfs.xfs report the sunit/swidth values, they use a different unit size than above. Whereas sunit and swidth are first specified in 512B sectors, xfs_info and mkfs.xfs report them in multiples of your basic block size.

You can check this quite easily. Assuming a swidth/sunit of 1024/4096, and a block size of 4096, you should see a reported sunit/swidth of 128/512. 1024 * 512 == 128 * 4096.

== Q: Why doesn't NFS-exporting subdirectories of inode64-mounted filesystem work? ==

The default <tt>fsid</tt> type encodes only 32-bit of the inode number for subdirectory exports. However, exporting the root of the filesystem works, or using one of the non-default <tt>fsid</tt> types (<tt>fsid=uuid</tt> in <tt>/etc/exports</tt> with recent <tt>nfs-utils</tt>) should work as well. (Thanks, Christoph!)

== Q: What is the inode64 mount option for? ==

By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like "disk full" when you still have plenty space free, but there's no more place in the first TB to create a new inode. Also, performance sucks.

To come around this, use the inode64 mount options for filesystems >1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.

Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.

== Q: Can I just try the inode64 option to see if it helps me? ==

Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can't access files & dirs that have been created with an inode >32bit anymore.

== Q: Performance: mkfs.xfs -n size=64k option ==

Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:

Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a
directory entry is determined by the length of the name.

There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there's the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.

For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.

In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don't have any numbers on what the difference might be - I'm getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....

== Q: I want to tune my XFS filesystems for <something> ==

The standard answer you will get to this question is this: use the defaults.

There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to configure the filesystem appropriately.

There are a lot of "XFS tuning guides" that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don't expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.

In most cases, the only thing you need to to consider for mkfs.xfs is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the logbsize and delaylog mount options. Increasing logbsize reduces the number of journal IOs for a given workload, and delaylog will reduce them even further. The trade off for this increase in metadata performance is that more operations may be "missing" after recovery if the system crashes while actively making modifications.

XFS FAQ

2010-12-09T01:12:48Z

Dgc: /* Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? */

Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]

Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.

== Q: Where can I find documentation about XFS? ==

The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.

You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the '''<nowiki>#xfs</nowiki>''' IRC channel on ''irc.freenode.net''.

== Q: Where can I find documentation about ACLs? ==

Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/

The '''acl(5)''' manual page is also quite extensive.

== Q: Where can I find information about the internals of XFS? ==

An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.

Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.

== Q: What partition type should I use for XFS on Linux? ==

Linux native filesystem (83).

== Q: What mount options does XFS have? ==

There are a number of mount options influencing XFS filesystems - refer to the '''mount(8)''' manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])

== Q: Is there any relation between the XFS utilities and the kernel version? ==

No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.

== Q: Does it run on platforms other than i386? ==

XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.

== Q: Quota: Do quotas work on XFS? ==

Yes.

To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/ http://sourceforge.net/projects/linuxquota/] or use '''xfs_quota(8)'''.

== Q: Quota: What's project quota? ==

The project quota is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.

== Q: Quota: Can group quota and project quota be used at the same time? ==

No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.

== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==

To be answered.

== Q: Are there any dump/restore tools for XFS? ==

'''xfsdump(8)''' and '''xfsrestore(8)''' are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.

== Q: Does LILO work with XFS? ==

This depends on where you install LILO.

Yes, for MBR (Master Boot Record) installations.

No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.

== Q: Does GRUB work with XFS? ==

There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.

== Q: Can XFS be used for a root filesystem? ==

Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the "rootflags=" kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit "logdev=" specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]

== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==

Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be "clean" when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.

== Q: Is there a way to make a XFS filesystem larger or smaller? ==

You can ''NOT'' make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.

An XFS filesystem may be enlarged by using '''xfs_growfs(8)'''.

If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the ''exact same'' starting point. Run '''xfs_growfs''' to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.

Using XFS filesystems on top of a volume manager makes this a lot easier.

== Q: What information should I include when reporting a problem? ==

Things to include are what version of XFS you are using, if this is a CVS version of what date and version of the kernel. If you have problems with userland packages please report the version of the package you are using.

If the problem relates to a particular filesystem, the output from the '''xfs_info(8)''' command and any '''mount(8)''' options in use will also be useful to the developers.

If you experience an oops, please run it through '''ksymoops''' so that it can be interpreted.

If you have a filesystem that cannot be repaired, make sure you have xfsprogs 2.9.0 or later and run '''xfs_metadump(8)''' to capture the metadata (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.

== Q: Mounting an XFS filesystem does not work - what is wrong? ==

If mount prints an error message something like:

mount: /dev/hda5 has wrong major or minor number

you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the "-t xfs" option on mount or the "xfs" option in <tt>/etc/fstab</tt>.

If you get something like:

mount: wrong fs type, bad option, bad superblock on /dev/sda1,
or too many mounted file systems

Refer to your system log file (<tt>/var/log/messages</tt>) for a detailed diagnostic message from the kernel.

== Q: Does the filesystem have an undelete capability? ==

There is no undelete in XFS. However at least some XFS driver implementations does not wipe file information nodes completely so there are chance to recover files with specialized commercial software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS].
In this kind of XFS driver implementation it does not re-use directory entries immediately so there are chance to get back recently deleted files even with their real names.

This applies to most recent Linux distributions, as well as to most popular NAS boxes that use embedded linux and XFS file system.

Anyway, the best is to always keep backups.

== Q: How can I backup a XFS filesystem and ACLs? ==

You can backup a XFS filesystem with utilities like '''xfsdump(8)''' and standard '''tar(1)''' for standard files. If you want to backup ACLs you will need to use '''xfsdump''' or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (> version 3.1.4) or [http://rsync.samba.org/ rsync] (>= version 3.0.0) to backup ACLs and EAs. '''xfsdump''' can also be integrated with [http://www.amanda.org/ amanda(8)].

== Q: I see applications returning error 990 or "Structure needs cleaning", what is wrong? ==

The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], "Structure needs cleaning."

The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.

There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.

You can use xfs_check and xfs_repair to remedy the problem (with the file system unmounted).

== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==

Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.

XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.

Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you'll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the '''xfs_bmap(8)''' command).

== Q: What is the problem with the write cache on journaled filesystems? ==

Many drives use a write back cache in order to speed up the performance of writes. However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk. Further, the drive can de-stage data from the write cache to the platters in any order that it chooses. This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk. When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.

With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information. In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.

With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued. A powerfail "only" loses data in the cache but no essential ordering is violated, and corruption will not occur.

With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance. But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.

== Q: How can I tell if I have the disk write cache enabled? ==

For SCSI/SATA:

* Look in dmesg(8) output for a driver line, such as: "SCSI device sda: drive cache: write back"
* <nowiki># sginfo -c /dev/sda | grep -i 'write cache' </nowiki>

For PATA/SATA (although for SATA this only works on a recent kernel with ATA command passthrough):

* <nowiki># hdparm -I /dev/sda</nowiki> and look under "Enabled Supported" for "Write cache"

For RAID controllers:

* See the section about RAID controllers below

== Q: How can I address the problem with the disk write cache? ==

=== Disabling the disk write back cache. ===

For SATA/PATA(IDE): (although for SATA this only works on a recent kernel with ATA command passthrough): 

* <nowiki># hdparm -W0 /dev/sda</nowiki> # hdparm -W0 /dev/hda
* <nowiki># blktool /dev/sda wcache off</nowiki> # blktool /dev/hda wcache off

For SCSI:

* Using sginfo(8) which is a little tedious It takes 3 steps. For example:
*# <nowiki>#sginfo -c /dev/sda</nowiki> which gives a list of attribute names and values
*# <nowiki>#sginfo -cX /dev/sda</nowiki> which gives an array of cache values which you must match up with from step 1, e.g. 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0
*# <nowiki>#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0</nowiki> allows you to reset the value of the cache attributes.

For RAID controllers:

* See the section about RAID controllers below

This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled. 

=== Using an external log. ===

Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will '''not''' solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won't be able to guarantee that if the metadata is on a drive with the write cache enabled.

In fact using an external log will disable XFS' write barrier support.

=== Write barrier support. ===

Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with "nobarrier". Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:

* "Disabling barriers, not supported with external log device"
* "Disabling barriers, not supported by the underlying device"
* "Disabling barriers, trial barrier write failed"

If the filesystem is mounted with an external log device then we currently don't support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn't support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.

== Q. Should barriers be enabled with storage which has a persistent write cache? ==

Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with "nobarrier". But take care about the hard disk write cache, which should be off.

== Q. Which settings does my RAID controller need ? ==

It's hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:

Real RAID controllers (not those found onboard of mainboards) normally have a battery backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory "[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]") which is used for buffering writes to improve speed. Even if it's battery backed, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents in that case.

* onboard RAID controllers: there are so many different types it's hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn't even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.

* 3ware: /cX/uX set cache=off, see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf , page 86

* Adaptec: allows setting individual drives cache
arcconf setcache <disk> wb|wt
wb=write back, which means write cache on, wt=write through, which means write cache off. So "wt" should be chosen.

* Areca: In archttp under "System Controls" -> "System Configuration" there's the option "Disk Write Cache Mode" (defaults "Auto")

"Off": disk write cache is turned off

"On": disk write cache is enabled, this is not save for your data but fast

"Auto": If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to "On", because neither controller cache nor disk cache is save so you don't seem to care about your data and just want high speed (which you get then).

That's a very sensible default so you can let it "Auto" or enforce "Off" to be sure.

* LSI MegaRAID: allows setting individual disks cache:
MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL -EnDskCache|DisDskCache

* Xyratex: from the docs: "Write cache includes the disk drive cache and controller cache.". So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.

== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==

The biggest problem is that those products seem to also virtualize disk
writes in a way that even barriers don't work anymore, which means even
a fsync is not reliable. Tests confirm that unplugging the power from
such a system even with RAID controller with battery backed cache and
hard disk cache turned off (which is save on a normal host) you can
destroy a database within the virtual machine (client, domU whatever you
call it).

In qemu you can specify cache=off on the line specifying the virtual
disk. For others information is missing.

== Q: What is the issue with directory corruption in Linux 2.6.17? ==

In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some "sparse" endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.

''Update: the fix is included in 2.6.17.7 and later kernels.''

To add insult to injury, '''xfs_repair(8)''' is currently not correcting these directories on detection of this corrupt state either. This '''xfs_repair''' issue is actively being worked on, and a fixed version will be available shortly.

''Update: a fixed '''xfs_repair''' is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.''

No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.

The '''xfs_check''' tool, or '''xfs_repair -n''', should be able to detect any directory corruption.

Until a fixed '''xfs_repair''' binary is available, one can make use of the '''xfs_db(8)''' command to mark the problem directory for removal (see the example below). A subsequent '''xfs_repair''' invocation will remove the directory and move all contents into "lost+found", named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
core.mode = 040755
core.version = 2
core.format = 3 (btree)
...
xfs_db> write core.mode 0
xfs_db> quit
</nowiki>

A subsequent '''xfs_repair''' will clear the directory, and add new entries (named by inode number) in lost+found.

The easiest way to map inode numbers to full paths is via '''xfs_ncheck(8)'''<nowiki>: </nowiki>

<nowiki>
# xfs_ncheck -i 14101 -i 14102 /dev/sdXXX
14101 full/path/mumble_fratz_foo_bar_1495
14102 full/path/mumble_fratz_foo_bar_1494
</nowiki>

Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
...
next_unlinked = null
u.bmbt.level = 1
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[0]
u.bmbt.ptrs[1] = 1:3628
xfs_db> fsblock 3628
xfs_db> type bmapbtd
xfs_db> print
magic = 0x424d4150
level = 0
numrecs = 19
leftsib = null
rightsib = null
recs[1-19] = [startoff,startblock,blockcount,extentflag]
1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]
5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]
9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]
12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]
15:[33554436,3488,8,0] 16:[33554444,3629,4,0]
17:[33554448,3748,4,0] 18:[33554452,3900,4,0]
19:[67108864,3364,4,0]
</nowiki>

At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the '''xfs_db''' dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:

...
xfs_db> dblock 20
xfs_db> print
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0
dhdr.bestfree[0].length = 0
dhdr.bestfree[1].offset = 0
dhdr.bestfree[1].length = 0
dhdr.bestfree[2].offset = 0
dhdr.bestfree[2].length = 0
du[0].inumber = 13937
du[0].namelen = 25
du[0].name = "mumble_fratz_foo_bar_1595"
du[0].tag = 0x10
du[1].inumber = 13938
du[1].namelen = 25
du[1].name = "mumble_fratz_foo_bar_1594"
du[1].tag = 0x38
...

So, here we can see that inode number 13938 matches up with name "mumble_fratz_foo_bar_1594". Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at "lost+found" (once '''xfs_repair''' has removed the corrupt directory).

== Q: Why does my > 2TB XFS partition disappear when I reboot ? ==

Strictly speaking this is not an XFS problem.

To support > 2TB partitions you need two things: a kernel that supports large block devices (<tt>CONFIG_LBD=y</tt>) and a partition table format that can hold large partitions. The default DOS partition tables don't. The best partition format for
> 2TB partitions is the EFI GPT format (<tt>CONFIG_EFI_PARTITION=y</tt>).

Without CONFIG_LBD=y you can't even create the filesystem, but without <tt>CONFIG_EFI_PARTITION=y</tt> it works fine until you reboot at which point the partition will disappear. Note that you need to enable the <tt>CONFIG_PARTITION_ADVANCED</tt> option before you can set <tt>CONFIG_EFI_PARTITION=y</tt>.

== Q: Why do I receive <tt>No space left on device</tt> after <tt>xfs_growfs</tt>? ==

After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:

The only way to fix this is to move data around to free up space
below 1TB. Find your oldest data (i.e. that was around before even
the first grow) and move it off the filesystem (move, not copy).
Then if you copy it back on, the data blocks will end up above 1TB
and that should leave you with plenty of space for inodes below 1TB.

A complete dump and restore will also fix the problem ;)

Also, you can add 'inode64' to your mount options to allow inodes to live above 1TB.

== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==

The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons.

Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.

== Q: How to get around a bad inode repair is unable to clean up ==

The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.

xfs_db -x -c 'inode XXX' -c 'write core.nextents 0' -c 'write core.size 0' /dev/hdXX

== Q: How to calculate the correct sunit,swidth values for optimal performance ==

XFS allows to optimize for a given RAID stripe size and number of disks via mount options. The calculation of these values is quite simple:

su = <RAID controllers stripe size in BYTES (or KiBytes when used with k)>
sw = <# of data disks>

So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use

su = 64k
sw = 6 (RAID-6 of 8 disks has 6 data disks)

A RAID stripe size of 256KB with a RAID-10 over 16 disks should use

su = 256k
sw = 8 (RAID-10 of 16 disks has 8 data disks)

Alternatively, you can use "sunit" instead of "su", but then the value means "number of 512B sectors", and you *must* use "swidth" then as "number of 512B sectors" then as well!

Please be aware that when xfs_info or mkfs.xfs report the sunit/swidth values, they use a different unit size than above. Whereas sunit and swidth are first specified in 512B sectors, xfs_info and mkfs.xfs report them in multiples of your basic block size.

You can check this quite easily. Assuming a swidth/sunit of 1024/4096, and a block size of 4096, you should see a reported sunit/swidth of 128/512. 1024 * 512 == 128 * 4096.

== Q: Why doesn't NFS-exporting subdirectories of inode64-mounted filesystem work? ==

The default <tt>fsid</tt> type encodes only 32-bit of the inode number for subdirectory exports. However, exporting the root of the filesystem works, or using one of the non-default <tt>fsid</tt> types (<tt>fsid=uuid</tt> in <tt>/etc/exports</tt> with recent <tt>nfs-utils</tt>) should work as well. (Thanks, Christoph!)

== Q: What is the inode64 mount option for? ==

By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like "disk full" when you still have plenty space free, but there's no more place in the first TB to create a new inode. Also, performance sucks.

To come around this, use the inode64 mount options for filesystems >1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.

Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.

== Q: Can I just try the inode64 option to see if it helps me? ==

Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can't access files & dirs that have been created with an inode >32bit anymore.

== Q: Performance: mkfs.xfs -n size=64k option ==

Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:

Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a
directory entry is determined by the length of the name.

There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there's the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.

For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.

In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don't have any numbers on what the difference might be - I'm getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....