xfs.org - User contributions [en]

XFS Papers and Documentation

2024-11-29T19:28:18Z

Cattelan: update links

=== Primary XFS Documentation ===

The XFS documentation started by SGI has been converted to docbook/[https://fedorahosted.org/publican/ Publican] format. The material is suitable for experienced users as well as developers and support staff. The XML source is available in a [http://git.kernel.org/?p=fs/xfs/xfsdocs-xml-dev.git;a=summary git repository] and builds of the documentation are available here:

* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-US/html/index.html XFS User Guide]

* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html XFS File System Structure]
** [http://sites.google.com/site/kandamotohiro/xfs Japanese translation] is also available.

* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_Labs/tmp/en-US/html/index.html XFS Training Labs]

* (Original versions of this material are still available at [http://oss.sgi.com/projects/xfs/training/index.html XFS Overview and Internals (html)] and [http://xfs.org/docs/papers/xfs_filesystem_structure.pdf XFS Filesystem Structure (pdf)]

The format of <tt>/proc/fs/xfs/stat</tt> also has been documented:
* [[Runtime_Stats|Runtime_Stats]]

=== Papers, Presentations, Etc ===

At the linux.conf.au 2012 event, Dave Chinner presented a talk on filesystem metadata scalability:

* ''XFS - Recent and Future Adventures in Filesystem Scalability'' [[http://www.youtube.com/watch?v=FegjLbCnoBw Video]] [ [[:File:Xfs-scalability-lca2012.pdf|Presentation Slides]] ]

The October 2009 issue of the USENIX ;login: magazine published an article about XFS targeted at system administrators:

* ''XFS: The big storage file system for Linux'' [[http://xfs.org/docs/papers/hellwig.pdf pdf]]

At the Ottawa Linux Symposium (July 2006), Dave Chinner presented a paper on filesystem scalability in Linux 2.6 kernels:

* ''High Bandwidth Filesystems on Large Systems'' (July 2006) [[http://xfs.org/docs/papers/ols2006/ols-2006-paper.pdf paper]] [[http://xfs.org/docs/papers/ols2006/ols-2006-presentation.pdf presentation]]

At linux.conf.au 2008 Dave Chinner gave a presentation about xfs_repair that he co-authored with Barry Naujok:

* Fixing XFS Filesystems Faster [[http://mirror.linux.org.au/pub/linux.conf.au/2008/slides/135-fixing_xfs_faster.pdf pdf]]

In July 2006, SGI storage marketing updated the XFS datasheet:

* ''Open Source XFS for Linux'' [[http://oss.sgi.com/projects/xfs/datasheet.pdf pdf]]

At UKUUG 2003, Christoph Hellwig presented a talk on XFS:

* ''XFS for Linux'' (July 2003) [[http://xfs.org/docs/papers/ukuug2003.pdf pdf]] [[http://verein.lst.de/~hch/talks/ukuug2003/ html]]

Originally published in Proceedings of the FREENIX Track: 2002 Usenix Annual Technical Conference:

* ''Filesystem Performance and Scalability in Linux 2.4.17'' (June 2002) [[http://xfs.org/docs/papers/filesystem-perf-tm.pdf pdf]]

At the Ottawa Linux Symposium, an updated presentation on porting XFS to Linux was given:

* ''Porting XFS to Linux'' (July 2000) [[http://xfs.org/docs/papers/ols2000/ols-xfs.htm html]]

At the Atlanta Linux Showcase, SGI presented the following paper on the port of XFS to Linux:

* ''Porting the SGI XFS File System to Linux'' (October 1999) [[http://xfs.org/docs/papers/als/als.ps ps]] [[http://xfs.org/docs/papers/als/als.pdf pdf]]

At the 6th Linux Kongress & the Linux Storage Management Workshop (LSMW) in Germany in September, 1999, SGI had a few presentations including the following:

* ''SGI's port of XFS to Linux'' (September 1999) [[http://xfs.org/docs/papers/linux_kongress/index.htm html]]
* ''Overview of DMF'' (September 1999) [[http://xfs.org/docs/papers/DMF-over/index.htm html]]

At the LinuxWorld Conference & Expo in August 1999, SGI published:

* ''An Open Source XFS data sheet'' (August 1999) [[http://xfs.org/docs/papers/xfs_GPL.pdf pdf]]

From the 1996 USENIX conference:

* ''An XFS white paper'' [[http://xfs.org/docs/papers/xfs_usenix/index.html html]]

=== Other historical articles, press-releases, etc ===

* IBM's ''Advanced Filesystem Implementor's Guide'' has a chapter ''Introducing XFS'' [[http://www-106.ibm.com/developerworks/library/l-fs9.html html]]

* An editorial titled ''Tired of fscking? Try a journaling filesystem!'', Freshmeat (February 2001) [[http://freshmeat.net/articles/view/212/ html]]

* ''Who give a fsck about filesystems'' provides an overview of the Linux 2.4 filesystems [[http://www.linuxuser.co.uk/articles/issue6/lu6-All_you_need_to_know_about-Filesystems.pdf html]]

* ''Journal File Systems'' in issue 55 of ''Linux Gazette'' provides a comparison of journaled filesystems.

* The original XFS beta release announcement was published in ''Linux Today'' (September 2000) [[http://linuxtoday.com/news_story.php3?ltsn=2000-09-26-017-04-OS-SW html]]

* ''XFS: It's worth the wait'' was published on ''EarthWeb'' (July 2000) [[http://networking.earthweb.com/netos/oslin/article/0,,12284_623661,00.html html]]

* An ''IRIX-XFS data sheet'' (July 1999) [[http://xfs.org/docs/papers/IRIX_xfs_data_sheet.pdf pdf]]

* The ''Getting Started with XFS'' book (1994) [[http://xfs.org/docs/papers/getting_started_with_xfs.pdf pdf]]

* Original ''XFS design documents'' (1993) ([http://oss.sgi.com/projects/xfs/design_docs/xfsdocs93_ps/ ps], [http://oss.sgi.com/projects/xfs/design_docs/xfsdocs93_pdf/ pdf])

XFS Papers and Documentation

2024-11-29T19:08:57Z

Cattelan: update links

=== Primary XFS Documentation ===

The XFS documentation started by SGI has been converted to docbook/[https://fedorahosted.org/publican/ Publican] format. The material is suitable for experienced users as well as developers and support staff. The XML source is available in a [http://git.kernel.org/?p=fs/xfs/xfsdocs-xml-dev.git;a=summary git repository] and builds of the documentation are available here:

* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-US/html/index.html XFS User Guide]

* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html XFS File System Structure]
** [http://sites.google.com/site/kandamotohiro/xfs Japanese translation] is also available.

* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_Labs/tmp/en-US/html/index.html XFS Training Labs]

* (Original versions of this material are still available at [http://oss.sgi.com/projects/xfs/training/index.html XFS Overview and Internals (html)] and [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS Filesystem Structure (pdf)]

The format of <tt>/proc/fs/xfs/stat</tt> also has been documented:
* [[Runtime_Stats|Runtime_Stats]]

=== Papers, Presentations, Etc ===

At the linux.conf.au 2012 event, Dave Chinner presented a talk on filesystem metadata scalability:

* ''XFS - Recent and Future Adventures in Filesystem Scalability'' [[http://www.youtube.com/watch?v=FegjLbCnoBw Video]] [ [[:File:Xfs-scalability-lca2012.pdf|Presentation Slides]] ]

The October 2009 issue of the USENIX ;login: magazine published an article about XFS targeted at system administrators:

* ''XFS: The big storage file system for Linux'' [[http://oss.sgi.com/projects/xfs/papers/hellwig.pdf pdf]]

At the Ottawa Linux Symposium (July 2006), Dave Chinner presented a paper on filesystem scalability in Linux 2.6 kernels:

* ''High Bandwidth Filesystems on Large Systems'' (July 2006) [[http://xfs.org/docs/papers/ols2006/ols-2006-paper.pdf paper]] [[http://xfs.org/docs/papers/ols2006/ols-2006-presentation.pdf presentation]]

At linux.conf.au 2008 Dave Chinner gave a presentation about xfs_repair that he co-authored with Barry Naujok:

* Fixing XFS Filesystems Faster [[http://mirror.linux.org.au/pub/linux.conf.au/2008/slides/135-fixing_xfs_faster.pdf pdf]]

In July 2006, SGI storage marketing updated the XFS datasheet:

* ''Open Source XFS for Linux'' [[http://oss.sgi.com/projects/xfs/datasheet.pdf pdf]]

At UKUUG 2003, Christoph Hellwig presented a talk on XFS:

* ''XFS for Linux'' (July 2003) [[http://oss.sgi.com/projects/xfs/papers/ukuug2003.pdf pdf]] [[http://verein.lst.de/~hch/talks/ukuug2003/ html]]

Originally published in Proceedings of the FREENIX Track: 2002 Usenix Annual Technical Conference:

* ''Filesystem Performance and Scalability in Linux 2.4.17'' (June 2002) [[http://oss.sgi.com/projects/xfs/papers/filesystem-perf-tm.pdf pdf]]

At the Ottawa Linux Symposium, an updated presentation on porting XFS to Linux was given:

* ''Porting XFS to Linux'' (July 2000) [[http://oss.sgi.com/projects/xfs/papers/ols2000/ols-xfs.htm html]]

At the Atlanta Linux Showcase, SGI presented the following paper on the port of XFS to Linux:

* ''Porting the SGI XFS File System to Linux'' (October 1999) [[http://oss.sgi.com/projects/xfs/papers/als/als.ps ps]] [[http://oss.sgi.com/projects/xfs/papers/als/als.pdf pdf]]

At the 6th Linux Kongress & the Linux Storage Management Workshop (LSMW) in Germany in September, 1999, SGI had a few presentations including the following:

* ''SGI's port of XFS to Linux'' (September 1999) [[http://oss.sgi.com/projects/xfs/papers/linux_kongress/index.htm html]]
* ''Overview of DMF'' (September 1999) [[http://oss.sgi.com/projects/xfs/papers/DMF-over/index.htm html]]

At the LinuxWorld Conference & Expo in August 1999, SGI published:

* ''An Open Source XFS data sheet'' (August 1999) [[http://oss.sgi.com/projects/xfs/papers/xfs_GPL.pdf pdf]]

From the 1996 USENIX conference:

* ''An XFS white paper'' [[http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html html]]

=== Other historical articles, press-releases, etc ===

* IBM's ''Advanced Filesystem Implementor's Guide'' has a chapter ''Introducing XFS'' [[http://www-106.ibm.com/developerworks/library/l-fs9.html html]]

* An editorial titled ''Tired of fscking? Try a journaling filesystem!'', Freshmeat (February 2001) [[http://freshmeat.net/articles/view/212/ html]]

* ''Who give a fsck about filesystems'' provides an overview of the Linux 2.4 filesystems [[http://www.linuxuser.co.uk/articles/issue6/lu6-All_you_need_to_know_about-Filesystems.pdf html]]

* ''Journal File Systems'' in issue 55 of ''Linux Gazette'' provides a comparison of journaled filesystems.

* The original XFS beta release announcement was published in ''Linux Today'' (September 2000) [[http://linuxtoday.com/news_story.php3?ltsn=2000-09-26-017-04-OS-SW html]]

* ''XFS: It's worth the wait'' was published on ''EarthWeb'' (July 2000) [[http://networking.earthweb.com/netos/oslin/article/0,,12284_623661,00.html html]]

* An ''IRIX-XFS data sheet'' (July 1999) [[http://oss.sgi.com/projects/xfs/papers/IRIX_xfs_data_sheet.pdf pdf]]

* The ''Getting Started with XFS'' book (1994) [[http://oss.sgi.com/projects/xfs/papers/getting_started_with_xfs.pdf pdf]]

* Original ''XFS design documents'' (1993) ([http://oss.sgi.com/projects/xfs/design_docs/xfsdocs93_ps/ ps], [http://oss.sgi.com/projects/xfs/design_docs/xfsdocs93_pdf/ pdf])

User talk:Cattelan

2021-05-17T04:08:13Z

Cattelan: Cattelan moved page User talk:Anshul.kundra to User talk:Cattelan: Automatically moved page while merging the account "Anshul.kundra" to "Cattelan"

== XFS_IOCORE_R ==

To Developers ,
I have read about the new member named as xfs_extdelta that is passed in different xfs internal routines i.e xfs_bmapi , In the 2.4 versions instead of using it is just passed as NULL can anyone provide info regarding that where to initialize and if I pass it NULl then is there any adverse effect of it

XFS_IOCORE_RT not been used in 2.6 version , so if instead of this flag I will pass XFS_IOCORE_EXCL it will be ok or will cause any crash or adverse effects or either there is any alternative present to sought out from these two problems

Regards
Anshul Kundra
HCL TECHNOLOGIES
ERS
: Has been answered on the [http://www.spinics.net/lists/xfs/msg09007.html mailinglist] -- [[User:Ckujau|Ckujau]] 23:39, 16 February 2012 (UTC)

== XFS File Inode number is changing using the utilities ==

To Developers,

I have seen a different behaviour in XFS

Suppose I have a file with inode number "131", I have noticed that the inode number of file got changed without deleting the file. When we change the data of the file everytime it changes the inode number. The complete description over the test is as follows

Steps are as follows:

1) I have created a file using "dd" of size 100MB:

#dd if=/dev/zero of=xfs.img bs=1M count=100

2) Created a loopback device over the image:
#losetup /dev/loop1 xfs.img

3) Created file system:
#mkfs.xfs /dev/loop1

4) Mounted:
#mount /dev/loop1 /mnt/xfs_mnt

5) Please check the mount output:
# mount

/dev/sdb2 on / type ext3 (rw,acl,user_xattr)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
debugfs on /sys/kernel/debug type debugfs (rw)
devtmpfs on /dev type devtmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,mode=1777)
devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
fusectl on /sys/fs/fuse/connections type fusectl (rw)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/loop0 on /mnt/mount_test type xfs (rw)
/dev/loop1 on /mnt/xfs_mnt type xfs (rw)

6) Created a file using "touch"

# touch kundra.txt

7) Checking the file and its inode number:

# ls -li
total 0
131 -rw-r--r-- 1 root root 0 2012-10-20 01:41 kundra.txt

8) I have written some data using the vim editor, I can't provide snapshot of vim on the list:

#vim kundra.txt

9) Now I checked the inode number using the "ls -li"
# ls -li

total 4
133 -rw-r--r-- 1 root root 19 2012-10-20 01:43 kundra.txt

Please check that the inode number ( from "131" to "133" ) and total value (from "0" to "4" )in the filesystem got changed, I am assuming that the reasom may be due to filesystem of small size but it is showing unexpected behaviour.

Please provide some description over this issue, I am working on Linux SLES

# cat /etc/issue

Welcome to SUSE Linux Enterprise Server 11 SP1 (x86_64) - Kernel \r (\l).

# uname -a
Linux linux-sles 2.6.32.19-0.6-default #1 SMP Fri Aug 31 01:37:50 IST 2012 x86_64 x86_64 x86_64 GNU/Linux

Thanks & Best Regards
Anshul Kundra
: Anshul, as I have suggested [[User:Anshul.kundra|earlier]]: please ask questions on the [[XFS_email_list_and_archives|mailing lists]]. Also, your question [https://encrypted.google.com/search?hl=en&q=vi%20inode%20change has been answered many times] already. -- [[User:Ckujau|Ckujau]] ([[User talk:Ckujau|talk]]) 19:07, 19 October 2012 (UTC)

XFS FAQ

2016-08-10T21:25:47Z

Cattelan: /* SSD disks or rotational disks but with hardware raid card that has cache enabled */

Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]

Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.

== Q: Where can I find documentation about XFS? ==

The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.

You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the '''<nowiki>#xfs</nowiki>''' IRC channel on ''irc.freenode.net''.

== Q: Where can I find documentation about ACLs? ==

Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/

The '''acl(5)''' manual page is also quite extensive.

== Q: Where can I find information about the internals of XFS? ==

An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.

Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.

== Q: What partition type should I use for XFS on Linux? ==

Linux native filesystem (83).

== Q: What mount options does XFS have? ==

There are a number of mount options influencing XFS filesystems - refer to the '''mount(8)''' manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])

== Q: Is there any relation between the XFS utilities and the kernel version? ==

No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.

== Q: Does it run on platforms other than i386? ==

XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.

== Q: Quota: Do quotas work on XFS? ==

Yes.

To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/ http://sourceforge.net/projects/linuxquota/] or use '''xfs_quota(8)'''.

== Q: Quota: What's project quota? ==

The project quota is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.

== Q: Quota: Can group quota and project quota be used at the same time? ==

No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.

== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==

To be answered.

== Q: Are there any dump/restore tools for XFS? ==

'''xfsdump(8)''' and '''xfsrestore(8)''' are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.

== Q: Does LILO work with XFS? ==

This depends on where you install LILO.

Yes, for MBR (Master Boot Record) installations.

No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.

== Q: Does GRUB work with XFS? ==

There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.

== Q: Can XFS be used for a root filesystem? ==

Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the "rootflags=" kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit "logdev=" specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]

== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==

Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be "clean" when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.

== Q: Is there a way to make a XFS filesystem larger or smaller? ==

You can ''NOT'' make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.

An XFS filesystem may be enlarged by using '''xfs_growfs(8)'''.

If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the ''exact same'' starting point. Run '''xfs_growfs''' to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.

Using XFS filesystems on top of a volume manager makes this a lot easier.

== Q: What information should I include when reporting a problem? ==

What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:

* kernel version (uname -a)
* xfsprogs version (xfs_repair -V)
* number of CPUs
* contents of /proc/meminfo
* contents of /proc/mounts
* contents of /proc/partitions
* RAID layout (hardware and/or software)
* LVM configuration
* type of disks you are using
* write cache status of drives
* size of BBWC and mode it is running in
* xfs_info output on the filesystem in question
* dmesg output showing all error messages and stack traces

Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:

# iostat -x -d -m 5
# vmstat 5

can give us insight into the IO and memory utilisation of your machine at the time of the problem.

If the filesystem is hanging, then capture the output of the dmesg command after running:

# echo w > /proc/sysrq-trigger
# dmesg

will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.

And for advanced users, capturing an event trace using '''trace-cmd''' (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it's a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:

# trace-cmd record -e xfs\*

before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:

# trace-cmd report > trace_report.txt

Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.

If you have a problem with '''xfs_repair(8)''', make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using '''xfs_metadump(8)''' (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.

== Q: Mounting an XFS filesystem does not work - what is wrong? ==

If mount prints an error message something like:

mount: /dev/hda5 has wrong major or minor number

you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the "-t xfs" option on mount or the "xfs" option in <tt>/etc/fstab</tt>.

If you get something like:

mount: wrong fs type, bad option, bad superblock on /dev/sda1,
or too many mounted file systems

Refer to your system log file (<tt>/var/log/messages</tt>) for a detailed diagnostic message from the kernel.

== Q: Does the filesystem have an undelete capability? ==

There is no undelete in XFS.

However, if an inode is unlinked but neither it nor its associated data blocks get immediately re-used and overwritten, there is some small chance to recover the file from the disk.

''photorec'', ''xfs_irecover'' or ''xfsr'' are some tools which attempt to do this, with varying success.

There are also commercial data recovery services and closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS] which claims to recover data, although this has not been tested by the XFS developers.

As always, the best advice is to keep good backups.

== Q: How can I backup a XFS filesystem and ACLs? ==

You can backup a XFS filesystem with utilities like '''xfsdump(8)''' and standard '''tar(1)''' for standard files. If you want to backup ACLs you will need to use '''xfsdump''' or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (> version 3.1.4) or [http://rsync.samba.org/ rsync] (>= version 3.0.0) to backup ACLs and EAs. '''xfsdump''' can also be integrated with [http://www.amanda.org/ amanda(8)].

== Q: I see applications returning error 990 or "Structure needs cleaning", what is wrong? ==

The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], "Structure needs cleaning."

The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.

There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.

You can use xfs_repair to remedy the problem (with the file system unmounted).

== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==

Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.

XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.

Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you'll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the '''xfs_bmap(8)''' command).

== Q: What is the problem with the write cache on journaled filesystems? ==

Many drives use a write back cache in order to speed up the performance of writes. However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk. Further, the drive can de-stage data from the write cache to the platters in any order that it chooses. This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk. When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.

With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information. In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.

With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued. A powerfail "only" loses data in the cache but no essential ordering is violated, and corruption will not occur.

With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance. But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.

== Q: How can I tell if I have the disk write cache enabled? ==

For SCSI/SATA:

* Look in dmesg(8) output for a driver line, such as: "SCSI device sda: drive cache: write back"
* <nowiki># sginfo -c /dev/sda | grep -i 'write cache' </nowiki>

For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):

* <nowiki># hdparm -I /dev/sda</nowiki> and look under "Enabled Supported" for "Write cache"

For RAID controllers:

* See the section about RAID controllers below

== Q: How can I address the problem with the disk write cache? ==

=== Disabling the disk write back cache. ===

For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]): 

* <nowiki># hdparm -W0 /dev/sda</nowiki> # hdparm -W0 /dev/hda
* <nowiki># blktool /dev/sda wcache off</nowiki> # blktool /dev/hda wcache off

For SCSI:

* Using sginfo(8) which is a little tedious It takes 3 steps. For example:
*# <nowiki>#sginfo -c /dev/sda</nowiki> which gives a list of attribute names and values
*# <nowiki>#sginfo -cX /dev/sda</nowiki> which gives an array of cache values which you must match up with from step 1, e.g. 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0
*# <nowiki>#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0</nowiki> allows you to reset the value of the cache attributes.

For RAID controllers:

* See the section about RAID controllers below

This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled. 

=== Using an external log. ===

Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will '''not''' solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won't be able to guarantee that if the metadata is on a drive with the write cache enabled.

In fact using an external log will disable XFS' write barrier support.

=== Write barrier support. ===

Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with "nobarrier". Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:

* "Disabling barriers, not supported with external log device"
* "Disabling barriers, not supported by the underlying device"
* "Disabling barriers, trial barrier write failed"

If the filesystem is mounted with an external log device then we currently don't support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn't support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.

== Q. Should barriers be enabled with storage which has a persistent write cache? ==

Many hardware RAIDs have a persistent write cache which is preserved across power failure, interface resets, system crashes, etc. The same may be true of some SSD devices. This sort of hardware should report to the operating system that no flushes are required, and in that case barriers will not be issued, even without the "nobarrier" option. Quoting Christoph Hellwig [http://oss.sgi.com/archives/xfs/2015-12/msg00281.html on the xfs list],
If the device does not need cache flushes it should not report requiring
flushes, in which case nobarrier will be a noop. Or to phrase it
differently: If nobarrier makes a difference skipping it is not safe.
On modern kernels with hardware which properly reports write cache behavior, there is no need to change barrier options at mount time.

== Q. Which settings does my RAID controller need ? ==

It's hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:

Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory "[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]") which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.

If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.

* onboard RAID controllers: there are so many different types it's hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn't even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.

* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86);

* Adaptec: allows setting individual drives cache
arcconf setcache <disk> wb|wt
wb=write back, which means write cache on, wt=write through, which means write cache off. So "wt" should be chosen.

* Areca: In archttp under "System Controls" -> "System Configuration" there's the option "Disk Write Cache Mode" (defaults "Auto")

"Off": disk write cache is turned off

"On": disk write cache is enabled, this is not safe for your data but fast

"Auto": If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to "On", because neither controller cache nor disk cache is safe so you don't seem to care about your data and just want high speed (which you get then).

That's a very sensible default so you can let it "Auto" or enforce "Off" to be sure.

* LSI MegaRAID: allows setting individual disks cache:
MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL # flushes the controller cache
MegaCli -LDGetProp -Cache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the controller cache settings
MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the disk cache settings (for all phys. disks in logical disk)
MegaCli -LDSetProp -EnDskCache|DisDskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # set disk cache setting

* Xyratex: from the docs: "Write cache includes the disk drive cache and controller cache.". So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.

== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==

The biggest problem is that those products seem to also virtualize disk
writes in a way that even barriers don't work any more, which means even
a fsync is not reliable. Tests confirm that unplugging the power from
such a system even with RAID controller with battery backed cache and
hard disk cache turned off (which is safe on a normal host) you can
destroy a database within the virtual machine (client, domU whatever you
call it).

In qemu you can specify cache=off on the line specifying the virtual
disk. For others information is missing.

== Q: What is the issue with directory corruption in Linux 2.6.17? ==

In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some "sparse" endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.

''Update: the fix is included in 2.6.17.7 and later kernels.''

To add insult to injury, '''xfs_repair(8)''' is currently not correcting these directories on detection of this corrupt state either. This '''xfs_repair''' issue is actively being worked on, and a fixed version will be available shortly.

''Update: a fixed '''xfs_repair''' is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.''

No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.

'''xfs_repair -n''' should be able to detect any directory corruption.

Until a fixed '''xfs_repair''' binary is available, one can make use of the '''xfs_db(8)''' command to mark the problem directory for removal (see the example below). A subsequent '''xfs_repair''' invocation will remove the directory and move all contents into "lost+found", named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
core.mode = 040755
core.version = 2
core.format = 3 (btree)
...
xfs_db> write core.mode 0
xfs_db> quit
</nowiki>

A subsequent '''xfs_repair''' will clear the directory, and add new entries (named by inode number) in lost+found.

The easiest way to map inode numbers to full paths is via '''xfs_ncheck(8)'''<nowiki>: </nowiki>

<nowiki>
# xfs_ncheck -i 14101 -i 14102 /dev/sdXXX
14101 full/path/mumble_fratz_foo_bar_1495
14102 full/path/mumble_fratz_foo_bar_1494
</nowiki>

Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
...
next_unlinked = null
u.bmbt.level = 1
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[0]
u.bmbt.ptrs[1] = 1:3628
xfs_db> fsblock 3628
xfs_db> type bmapbtd
xfs_db> print
magic = 0x424d4150
level = 0
numrecs = 19
leftsib = null
rightsib = null
recs[1-19] = [startoff,startblock,blockcount,extentflag]
1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]
5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]
9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]
12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]
15:[33554436,3488,8,0] 16:[33554444,3629,4,0]
17:[33554448,3748,4,0] 18:[33554452,3900,4,0]
19:[67108864,3364,4,0]
</nowiki>

At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the '''xfs_db''' dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:

...
xfs_db> dblock 20
xfs_db> print
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0
dhdr.bestfree[0].length = 0
dhdr.bestfree[1].offset = 0
dhdr.bestfree[1].length = 0
dhdr.bestfree[2].offset = 0
dhdr.bestfree[2].length = 0
du[0].inumber = 13937
du[0].namelen = 25
du[0].name = "mumble_fratz_foo_bar_1595"
du[0].tag = 0x10
du[1].inumber = 13938
du[1].namelen = 25
du[1].name = "mumble_fratz_foo_bar_1594"
du[1].tag = 0x38
...

So, here we can see that inode number 13938 matches up with name "mumble_fratz_foo_bar_1594". Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at "lost+found" (once '''xfs_repair''' has removed the corrupt directory).

== Q: Why does my > 2TB XFS partition disappear when I reboot ? ==

Strictly speaking this is not an XFS problem.

To support > 2TB partitions you need two things: a kernel that supports large block devices (<tt>CONFIG_LBD=y</tt>) and a partition table format that can hold large partitions. The default DOS partition tables don't. The best partition format for
> 2TB partitions is the EFI GPT format (<tt>CONFIG_EFI_PARTITION=y</tt>).

Without CONFIG_LBD=y you can't even create the filesystem, but without <tt>CONFIG_EFI_PARTITION=y</tt> it works fine until you reboot at which point the partition will disappear. Note that you need to enable the <tt>CONFIG_PARTITION_ADVANCED</tt> option before you can set <tt>CONFIG_EFI_PARTITION=y</tt>.

== Q: Why do I receive <tt>No space left on device</tt> after <tt>xfs_growfs</tt>? ==

After [http://oss.sgi.com/archives/xfs/2009-01/msg01023.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. This was an issue with the older "inode32" inode allocation mode, where inode allocation is restricted to lower filesysetm blocks. To fix this, [http://oss.sgi.com/archives/xfs/2009-01/msg01031.html Dave Chinner advised]:

The only way to fix this is to move data around to free up space
below 1TB. Find your oldest data (i.e. that was around before even
the first grow) and move it off the filesystem (move, not copy).
Then if you copy it back on, the data blocks will end up above 1TB
and that should leave you with plenty of space for inodes below 1TB.

A complete dump and restore will also fix the problem ;)

Alternately, you can add 'inode64' to your mount options to allow inodes to live above 1TB.

example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&forum=38 No space left on device on xfs filesystem with 7.7TB free]

However, 'inode64' has been the default behavior since kernel v3.7...

Unfortunately, v3.7 also added a bug present from kernel v3.7 to v3.17 which caused new allocation groups added by growfs to be unavailable for inode allocation. This was fixed by commit <tt>[http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9de67c3ba9ea961ba420573d56479d09d33a7587 9de67c3b xfs: allow inode allocations in post-growfs disk space.]</tt> in kernel v3.17.
Without that commit, the problem can be worked around by doing a "mount -o remount,inode64" after the growfs operation.

== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==

The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons.

Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.

== Q: How to get around a bad inode repair is unable to clean up ==

The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.

xfs_db -x -c 'inode XXX' -c 'write core.nextents 0' -c 'write core.size 0' /dev/hdXX

== Q: How to calculate the correct sunit,swidth values for optimal performance ==

XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.

These options can be sometimes autodetected (for example with md raid and recent enough kernel (>= 2.6.32) and xfsprogs (>= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.

The calculation of these values is quite simple:

su = <RAID controllers stripe size in BYTES (or KiBytes when used with k)>
sw = <# of data disks (don't count parity disks)>

So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use

su = 64k
sw = 6 (RAID-6 of 8 disks has 6 data disks)

A RAID stripe size of 256KB with a RAID-10 over 16 disks should use

su = 256k
sw = 8 (RAID-10 of 16 disks has 8 data disks)

Alternatively, you can use "sunit" instead of "su" and "swidth" instead of "sw" but then sunit/swidth values need to be specified in "number of 512B sectors"!

Note that <tt>xfs_info</tt> and <tt>mkfs.xfs</tt> interpret sunit and swidth as being specified in units of 512B sectors; that's unfortunately not the unit they're reported in, however.
<tt>xfs_info</tt> and <tt>mkfs.xfs</tt> report them in multiples of your basic block size (bsize) and not in 512B sectors.

Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.

When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.

== Q: Why doesn't NFS-exporting subdirectories of inode64-mounted filesystem work? ==

The default <tt>fsid</tt> type encodes only 32-bit of the inode number for subdirectory exports. However, exporting the root of the filesystem works, or using one of the non-default <tt>fsid</tt> types (<tt>fsid=uuid</tt> in <tt>/etc/exports</tt> with recent <tt>nfs-utils</tt>) should work as well. (Thanks, Christoph!)

== Q: What is the inode64 mount option for? ==

By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like "disk full" when you still have plenty space free, but there's no more place in the first TB to create a new inode. Also, performance sucks.

To come around this, use the inode64 mount options for filesystems >1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.

Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.

== Q: Can I just try the inode64 option to see if it helps me? ==

Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can't access files & dirs that have been created with an inode >32bit anymore.

== Q: Performance: mkfs.xfs -n size=64k option ==

Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:

Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a
directory entry is determined by the length of the name.

There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there's the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.

For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.

In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don't have any numbers on what the difference might be - I'm getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....

== Q: I want to tune my XFS filesystems for <something> ==

''Premature optimization is the root of all evil.'' - Donald Knuth

The standard answer you will get to this question is this: use the defaults.

There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to configure the filesystem appropriately.

There are a lot of "XFS tuning guides" that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don't expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.

In most cases, the only thing you need to to consider for <tt>mkfs.xfs</tt> is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the <tt>logbsize</tt> and <tt>delaylog</tt> mount options. Increasing <tt>logbsize</tt> reduces the number of journal IOs for a given workload, and <tt>delaylog</tt> will reduce them even further. The trade off for this increase in metadata performance is that more operations may be "missing" after recovery if the system crashes while actively making modifications.

As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.

== Q: Which factors influence the memory usage of xfs_repair? ==

This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).

# xfs_repair -n -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2096.
#

xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,
of which 2,097,152KB is needed for tracking free space.
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)

Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:

# xfs_repair -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2289.

That is now needs at least another 200MB of RAM to run.

The numbers reported by xfs_repair are the absolute minimum required and approximate at that;
more RAM than this may be required to complete successfully.
Also, if you only give xfs_repair the minimum required RAM, it will be slow;
for best repair performance, the more RAM you can give it the better.

== Q: Why some files of my filesystem shows as "?????????? ? ? ? ? ? filename" ? ==

If ls -l shows you a listing as

# ?????????? ? ? ? ? ? file1
?????????? ? ? ? ? ? file2
?????????? ? ? ? ? ? file3
?????????? ? ? ? ? ? file4

and errors like:
# ls /pathtodir/
ls: cannot access /pathtodir/file1: Invalid argument
ls: cannot access /pathtodir/file2: Invalid argument
ls: cannot access /pathtodir/file3: Invalid argument
ls: cannot access /pathtodir/file4: Invalid argument

or even:
# failed to stat /pathtodir/file1

It is very probable your filesystem must be mounted with inode64
# mount -oremount,inode64 /dev/diskpart /mnt/xfs

should make it work ok again.
If it works, add the option to fstab.

== Q: The xfs_db "frag" command says I'm over 50%. Is that bad? ==

It depends. It's important to know how the value is calculated. xfs_db looks at the extents in all files, and returns:

(actual extents - ideal extents) / actual extents

This means that if, for example, you have an average of 2 extents per file, you'll get an answer of 50%. 4 extents per file would give you 75%. This may or may not be a problem, especially depending on the size of the files in question. (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented). The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.

Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:
[[Image:Frag_factor.png|500px]]

== Q: I'm getting "Internal error xfs_sb_read_verify" errors when I try to run xfs_growfs under kernels v3.10 through v3.12 ==

This may happen when running xfs_growfs under a v3.10-v3.12 kernel,
if the filesystem was previously grown under a kernel prior to v3.8.

Old kernel versions prior to v3.8 did not zero the empty part of
new secondary superblocks when growing the filesystem with xfs_growfs.

Kernels v3.10 and later began detecting this non-zero part of the
superblock as corruption, and emit the

Internal error xfs_sb_read_verify

error message.

Kernels v3.13 and later are more forgiving about this - if the non-zero
data is found on a Version 4 superblock, it will not be flagged as
corruption.

The problematic secondary superblocks may be repaired by using an xfs_repair
version 3.2.0-alpha1 or above.

The relevant kernelspace commits are as follows:

v3.8 1375cb6 xfs: growfs: don't read garbage for new secondary superblocks <- fixed underlying problem
v3.10 04a1e6c xfs: add CRC checks to the superblock <- detected old underlying problem
v3.13 10e6e65 xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields <- is more forgiving of old underlying problem

This commit allows xfs_repair to detect and correct the problem:

v3.2.0-alpha1 cbd7508 xfs_repair: zero out unused parts of superblocks

== Q: Why do files on XFS use more data blocks than expected? ==

The XFS speculative preallocation algorithm allocates extra blocks beyond end of file (EOF) to minimize file fragmentation during buffered write workloads. Workloads that benefit from this behaviour include slowly growing files, concurrent writers and mixed reader/writer workloads. It also provides fragmentation resistance in situations where memory pressure prevents adequate buffering of dirty data to allow formation of large contiguous regions of data in memory.

This post-EOF block allocation is accounted identically to blocks within EOF. It is visible in 'st_blocks' counts via stat() system calls, accounted as globally allocated space and against quotas that apply to the associated file. The space is reported by various userspace utilities (stat, du, df, ls) and thus provides a common source of confusion for administrators. Post-EOF blocks are temporary in most situations and are usually reclaimed via several possible mechanisms in XFS.

See the FAQ entry on speculative preallocation for details.

== Q: What is speculative preallocation? ==

XFS speculatively preallocates post-EOF blocks on file extending writes in anticipation of future extending writes. The size of a preallocation is dynamic and depends on the runtime state of the file and fs. Generally speaking, preallocation is disabled for very small files and preallocation sizes grow as files grow larger.

Preallocations are capped to the maximum extent size supported by the filesystem. Preallocation size is throttled automatically as the filesystem approaches low free space conditions or other allocation limits on a file (such as a quota).

In most cases, speculative preallocation is automatically reclaimed when a file is closed. Applications that repeatedly trigger preallocation and reclaim cycles (e.g., this is common in file server or log file workloads) can cause fragmentation. Therefore, this pattern is detected and causes the preallocation to persist beyond the lifecycle of the file descriptor.

== Q: How can I speed up or avoid delayed removal of speculative preallocation? ==

Linux 3.8 (and later) includes a scanner to perform background trimming of files with lingering post-EOF preallocations. The scanner bypasses dirty files to avoid interference with ongoing writes. A 5 minute scan interval is used by default and can be adjusted via the following file (value in seconds):

/proc/sys/fs/xfs/speculative_prealloc_lifetime

== Q: Is speculative preallocation permanent? ==

Preallocated blocks are normally reclaimed on file close, inode reclaim, unmount or in the background once file write activity subsides. They can be explicitly made permanent via fallocate or a similar interface. They can be implicitly made permanent in situations where file size is extended beyond a range of post-EOF blocks (i.e., via an extending truncate) or following a crash. In the event of a crash, the in-memory state used to track and reclaim the speculative preallocation is lost.

== Q: My workload has known characteristics - can I disable speculative preallocation or tune it to an optimal fixed size? ==

Speculative preallocation can not be disabled but XFS can be tuned to a fixed allocation size with the 'allocsize=' mount option. Speculative preallocation is not dynamically resized when the allocsize mount option is set and thus the potential for fragmentation is increased. Use 'allocsize=64k' to revert to the default XFS behavior prior to support for dynamic speculative preallocation.

== Q: mount (or umount) takes minutes or even hours - what could be the reason ? ==

In some cases xfs log (journal) can become quite big. For example if it accumulates many entries and doesn't get chance to apply these to disk (due to lockup, crash, hard reset etc). xfs will try to reapply these at mount (in dmesg: "Starting recovery (logdev: internal)").

That process with big log to be reapplied can take very long time (minutes or even hours). Similar problem can happen with unmount taking hours when there are hundreds of thousands of dirty inode in memory that need to be flushed to disk.

(http://oss.sgi.com/pipermail/xfs/2015-October/044457.html)

== Q: Which I/O scheduler for XFS? ==

=== On rotational disks without hardware raid ===

''CFQ'': not great for XFS parallelism:

< dchinner> it doesn't allow other threads to get IO issued immediately after the first one
< dchinner> it waits, instead, for a timeslice to expire before moving to the IO of a different process.
< dchinner> so instead of interleaving the IO of multiple jobs in a single sweep across the disk,
it enforces single threaded access to the disk

''deadline'': good option, doesn't have such problem

Note that some kernels have block multiqueue enabled which (currently - 08/2016) doesn't support I/O schedulers at all thus there is no optimisation and reordering IO for best seek order, so disable blk-mq for rotational disks (see CONFIG_SCSI_MQ_DEFAULT, CONFIG_DM_MQ_DEFAULT options and use_blk_mq parameter for scsi-mod/dm-mod kernel modules).

Also hardware raid can be smart enough to cache and reorder I/O requests thus additional layer of reordering
(like Linux I/O scheduler) can potentially conflict and make performance worse. If you have such raid card
then try method described below.

=== SSD disks or rotational disks but with hardware raid card that has cache enabled ===

''Block multiqueue'' enabled (and thus no I/O scheduler at all) or block multiqueue disabled and ''noop'' or ''deadline'' I/O scheduler activated is good solution. SSD disks don't really need I/O schedulers while smart raid cards do I/O ordering on their own.

Note that if your raid is very dumb and/or has no cache enabled then it likely cannot reorder I/O requests and thus it could benefit from I/O scheduler.

XFS FAQ

2016-08-10T21:24:04Z

Cattelan: /* On rotational disks without hardware raid */

Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]

Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.

== Q: Where can I find documentation about XFS? ==

The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.

You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the '''<nowiki>#xfs</nowiki>''' IRC channel on ''irc.freenode.net''.

== Q: Where can I find documentation about ACLs? ==

Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/

The '''acl(5)''' manual page is also quite extensive.

== Q: Where can I find information about the internals of XFS? ==

An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.

Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.

== Q: What partition type should I use for XFS on Linux? ==

Linux native filesystem (83).

== Q: What mount options does XFS have? ==

There are a number of mount options influencing XFS filesystems - refer to the '''mount(8)''' manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])

== Q: Is there any relation between the XFS utilities and the kernel version? ==

No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.

== Q: Does it run on platforms other than i386? ==

XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.

== Q: Quota: Do quotas work on XFS? ==

Yes.

To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/ http://sourceforge.net/projects/linuxquota/] or use '''xfs_quota(8)'''.

== Q: Quota: What's project quota? ==

The project quota is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.

== Q: Quota: Can group quota and project quota be used at the same time? ==

No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.

== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==

To be answered.

== Q: Are there any dump/restore tools for XFS? ==

'''xfsdump(8)''' and '''xfsrestore(8)''' are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.

== Q: Does LILO work with XFS? ==

This depends on where you install LILO.

Yes, for MBR (Master Boot Record) installations.

No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.

== Q: Does GRUB work with XFS? ==

There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.

== Q: Can XFS be used for a root filesystem? ==

Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the "rootflags=" kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit "logdev=" specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]

== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==

Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be "clean" when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.

== Q: Is there a way to make a XFS filesystem larger or smaller? ==

You can ''NOT'' make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.

An XFS filesystem may be enlarged by using '''xfs_growfs(8)'''.

If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the ''exact same'' starting point. Run '''xfs_growfs''' to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.

Using XFS filesystems on top of a volume manager makes this a lot easier.

== Q: What information should I include when reporting a problem? ==

What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:

* kernel version (uname -a)
* xfsprogs version (xfs_repair -V)
* number of CPUs
* contents of /proc/meminfo
* contents of /proc/mounts
* contents of /proc/partitions
* RAID layout (hardware and/or software)
* LVM configuration
* type of disks you are using
* write cache status of drives
* size of BBWC and mode it is running in
* xfs_info output on the filesystem in question
* dmesg output showing all error messages and stack traces

Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:

# iostat -x -d -m 5
# vmstat 5

can give us insight into the IO and memory utilisation of your machine at the time of the problem.

If the filesystem is hanging, then capture the output of the dmesg command after running:

# echo w > /proc/sysrq-trigger
# dmesg

will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.

And for advanced users, capturing an event trace using '''trace-cmd''' (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it's a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:

# trace-cmd record -e xfs\*

before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:

# trace-cmd report > trace_report.txt

Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.

If you have a problem with '''xfs_repair(8)''', make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using '''xfs_metadump(8)''' (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.

== Q: Mounting an XFS filesystem does not work - what is wrong? ==

If mount prints an error message something like:

mount: /dev/hda5 has wrong major or minor number

you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the "-t xfs" option on mount or the "xfs" option in <tt>/etc/fstab</tt>.

If you get something like:

mount: wrong fs type, bad option, bad superblock on /dev/sda1,
or too many mounted file systems

Refer to your system log file (<tt>/var/log/messages</tt>) for a detailed diagnostic message from the kernel.

== Q: Does the filesystem have an undelete capability? ==

There is no undelete in XFS.

However, if an inode is unlinked but neither it nor its associated data blocks get immediately re-used and overwritten, there is some small chance to recover the file from the disk.

''photorec'', ''xfs_irecover'' or ''xfsr'' are some tools which attempt to do this, with varying success.

There are also commercial data recovery services and closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS] which claims to recover data, although this has not been tested by the XFS developers.

As always, the best advice is to keep good backups.

== Q: How can I backup a XFS filesystem and ACLs? ==

You can backup a XFS filesystem with utilities like '''xfsdump(8)''' and standard '''tar(1)''' for standard files. If you want to backup ACLs you will need to use '''xfsdump''' or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (> version 3.1.4) or [http://rsync.samba.org/ rsync] (>= version 3.0.0) to backup ACLs and EAs. '''xfsdump''' can also be integrated with [http://www.amanda.org/ amanda(8)].

== Q: I see applications returning error 990 or "Structure needs cleaning", what is wrong? ==

The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], "Structure needs cleaning."

The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.

There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.

You can use xfs_repair to remedy the problem (with the file system unmounted).

== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==

Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.

XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.

Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you'll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the '''xfs_bmap(8)''' command).

== Q: What is the problem with the write cache on journaled filesystems? ==

Many drives use a write back cache in order to speed up the performance of writes. However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk. Further, the drive can de-stage data from the write cache to the platters in any order that it chooses. This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk. When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.

With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information. In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.

With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued. A powerfail "only" loses data in the cache but no essential ordering is violated, and corruption will not occur.

With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance. But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.

== Q: How can I tell if I have the disk write cache enabled? ==

For SCSI/SATA:

* Look in dmesg(8) output for a driver line, such as: "SCSI device sda: drive cache: write back"
* <nowiki># sginfo -c /dev/sda | grep -i 'write cache' </nowiki>

For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):

* <nowiki># hdparm -I /dev/sda</nowiki> and look under "Enabled Supported" for "Write cache"

For RAID controllers:

* See the section about RAID controllers below

== Q: How can I address the problem with the disk write cache? ==

=== Disabling the disk write back cache. ===

For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]): 

* <nowiki># hdparm -W0 /dev/sda</nowiki> # hdparm -W0 /dev/hda
* <nowiki># blktool /dev/sda wcache off</nowiki> # blktool /dev/hda wcache off

For SCSI:

* Using sginfo(8) which is a little tedious It takes 3 steps. For example:
*# <nowiki>#sginfo -c /dev/sda</nowiki> which gives a list of attribute names and values
*# <nowiki>#sginfo -cX /dev/sda</nowiki> which gives an array of cache values which you must match up with from step 1, e.g. 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0
*# <nowiki>#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0</nowiki> allows you to reset the value of the cache attributes.

For RAID controllers:

* See the section about RAID controllers below

This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled. 

=== Using an external log. ===

Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will '''not''' solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won't be able to guarantee that if the metadata is on a drive with the write cache enabled.

In fact using an external log will disable XFS' write barrier support.

=== Write barrier support. ===

Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with "nobarrier". Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:

* "Disabling barriers, not supported with external log device"
* "Disabling barriers, not supported by the underlying device"
* "Disabling barriers, trial barrier write failed"

If the filesystem is mounted with an external log device then we currently don't support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn't support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.

== Q. Should barriers be enabled with storage which has a persistent write cache? ==

Many hardware RAIDs have a persistent write cache which is preserved across power failure, interface resets, system crashes, etc. The same may be true of some SSD devices. This sort of hardware should report to the operating system that no flushes are required, and in that case barriers will not be issued, even without the "nobarrier" option. Quoting Christoph Hellwig [http://oss.sgi.com/archives/xfs/2015-12/msg00281.html on the xfs list],
If the device does not need cache flushes it should not report requiring
flushes, in which case nobarrier will be a noop. Or to phrase it
differently: If nobarrier makes a difference skipping it is not safe.
On modern kernels with hardware which properly reports write cache behavior, there is no need to change barrier options at mount time.

== Q. Which settings does my RAID controller need ? ==

It's hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:

Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory "[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]") which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.

If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.

* onboard RAID controllers: there are so many different types it's hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn't even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.

* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86);

* Adaptec: allows setting individual drives cache
arcconf setcache <disk> wb|wt
wb=write back, which means write cache on, wt=write through, which means write cache off. So "wt" should be chosen.

* Areca: In archttp under "System Controls" -> "System Configuration" there's the option "Disk Write Cache Mode" (defaults "Auto")

"Off": disk write cache is turned off

"On": disk write cache is enabled, this is not safe for your data but fast

"Auto": If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to "On", because neither controller cache nor disk cache is safe so you don't seem to care about your data and just want high speed (which you get then).

That's a very sensible default so you can let it "Auto" or enforce "Off" to be sure.

* LSI MegaRAID: allows setting individual disks cache:
MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL # flushes the controller cache
MegaCli -LDGetProp -Cache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the controller cache settings
MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the disk cache settings (for all phys. disks in logical disk)
MegaCli -LDSetProp -EnDskCache|DisDskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # set disk cache setting

* Xyratex: from the docs: "Write cache includes the disk drive cache and controller cache.". So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.

== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==

The biggest problem is that those products seem to also virtualize disk
writes in a way that even barriers don't work any more, which means even
a fsync is not reliable. Tests confirm that unplugging the power from
such a system even with RAID controller with battery backed cache and
hard disk cache turned off (which is safe on a normal host) you can
destroy a database within the virtual machine (client, domU whatever you
call it).

In qemu you can specify cache=off on the line specifying the virtual
disk. For others information is missing.

== Q: What is the issue with directory corruption in Linux 2.6.17? ==

In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some "sparse" endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.

''Update: the fix is included in 2.6.17.7 and later kernels.''

To add insult to injury, '''xfs_repair(8)''' is currently not correcting these directories on detection of this corrupt state either. This '''xfs_repair''' issue is actively being worked on, and a fixed version will be available shortly.

''Update: a fixed '''xfs_repair''' is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.''

No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.

'''xfs_repair -n''' should be able to detect any directory corruption.

Until a fixed '''xfs_repair''' binary is available, one can make use of the '''xfs_db(8)''' command to mark the problem directory for removal (see the example below). A subsequent '''xfs_repair''' invocation will remove the directory and move all contents into "lost+found", named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
core.mode = 040755
core.version = 2
core.format = 3 (btree)
...
xfs_db> write core.mode 0
xfs_db> quit
</nowiki>

A subsequent '''xfs_repair''' will clear the directory, and add new entries (named by inode number) in lost+found.

The easiest way to map inode numbers to full paths is via '''xfs_ncheck(8)'''<nowiki>: </nowiki>

<nowiki>
# xfs_ncheck -i 14101 -i 14102 /dev/sdXXX
14101 full/path/mumble_fratz_foo_bar_1495
14102 full/path/mumble_fratz_foo_bar_1494
</nowiki>

Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
...
next_unlinked = null
u.bmbt.level = 1
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[0]
u.bmbt.ptrs[1] = 1:3628
xfs_db> fsblock 3628
xfs_db> type bmapbtd
xfs_db> print
magic = 0x424d4150
level = 0
numrecs = 19
leftsib = null
rightsib = null
recs[1-19] = [startoff,startblock,blockcount,extentflag]
1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]
5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]
9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]
12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]
15:[33554436,3488,8,0] 16:[33554444,3629,4,0]
17:[33554448,3748,4,0] 18:[33554452,3900,4,0]
19:[67108864,3364,4,0]
</nowiki>

At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the '''xfs_db''' dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:

...
xfs_db> dblock 20
xfs_db> print
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0
dhdr.bestfree[0].length = 0
dhdr.bestfree[1].offset = 0
dhdr.bestfree[1].length = 0
dhdr.bestfree[2].offset = 0
dhdr.bestfree[2].length = 0
du[0].inumber = 13937
du[0].namelen = 25
du[0].name = "mumble_fratz_foo_bar_1595"
du[0].tag = 0x10
du[1].inumber = 13938
du[1].namelen = 25
du[1].name = "mumble_fratz_foo_bar_1594"
du[1].tag = 0x38
...

So, here we can see that inode number 13938 matches up with name "mumble_fratz_foo_bar_1594". Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at "lost+found" (once '''xfs_repair''' has removed the corrupt directory).

== Q: Why does my > 2TB XFS partition disappear when I reboot ? ==

Strictly speaking this is not an XFS problem.

To support > 2TB partitions you need two things: a kernel that supports large block devices (<tt>CONFIG_LBD=y</tt>) and a partition table format that can hold large partitions. The default DOS partition tables don't. The best partition format for
> 2TB partitions is the EFI GPT format (<tt>CONFIG_EFI_PARTITION=y</tt>).

Without CONFIG_LBD=y you can't even create the filesystem, but without <tt>CONFIG_EFI_PARTITION=y</tt> it works fine until you reboot at which point the partition will disappear. Note that you need to enable the <tt>CONFIG_PARTITION_ADVANCED</tt> option before you can set <tt>CONFIG_EFI_PARTITION=y</tt>.

== Q: Why do I receive <tt>No space left on device</tt> after <tt>xfs_growfs</tt>? ==

After [http://oss.sgi.com/archives/xfs/2009-01/msg01023.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. This was an issue with the older "inode32" inode allocation mode, where inode allocation is restricted to lower filesysetm blocks. To fix this, [http://oss.sgi.com/archives/xfs/2009-01/msg01031.html Dave Chinner advised]:

The only way to fix this is to move data around to free up space
below 1TB. Find your oldest data (i.e. that was around before even
the first grow) and move it off the filesystem (move, not copy).
Then if you copy it back on, the data blocks will end up above 1TB
and that should leave you with plenty of space for inodes below 1TB.

A complete dump and restore will also fix the problem ;)

Alternately, you can add 'inode64' to your mount options to allow inodes to live above 1TB.

example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&forum=38 No space left on device on xfs filesystem with 7.7TB free]

However, 'inode64' has been the default behavior since kernel v3.7...

Unfortunately, v3.7 also added a bug present from kernel v3.7 to v3.17 which caused new allocation groups added by growfs to be unavailable for inode allocation. This was fixed by commit <tt>[http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9de67c3ba9ea961ba420573d56479d09d33a7587 9de67c3b xfs: allow inode allocations in post-growfs disk space.]</tt> in kernel v3.17.
Without that commit, the problem can be worked around by doing a "mount -o remount,inode64" after the growfs operation.

== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==

The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons.

Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.

== Q: How to get around a bad inode repair is unable to clean up ==

The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.

xfs_db -x -c 'inode XXX' -c 'write core.nextents 0' -c 'write core.size 0' /dev/hdXX

== Q: How to calculate the correct sunit,swidth values for optimal performance ==

XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.

These options can be sometimes autodetected (for example with md raid and recent enough kernel (>= 2.6.32) and xfsprogs (>= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.

The calculation of these values is quite simple:

su = <RAID controllers stripe size in BYTES (or KiBytes when used with k)>
sw = <# of data disks (don't count parity disks)>

So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use

su = 64k
sw = 6 (RAID-6 of 8 disks has 6 data disks)

A RAID stripe size of 256KB with a RAID-10 over 16 disks should use

su = 256k
sw = 8 (RAID-10 of 16 disks has 8 data disks)

Alternatively, you can use "sunit" instead of "su" and "swidth" instead of "sw" but then sunit/swidth values need to be specified in "number of 512B sectors"!

Note that <tt>xfs_info</tt> and <tt>mkfs.xfs</tt> interpret sunit and swidth as being specified in units of 512B sectors; that's unfortunately not the unit they're reported in, however.
<tt>xfs_info</tt> and <tt>mkfs.xfs</tt> report them in multiples of your basic block size (bsize) and not in 512B sectors.

Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.

When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.

== Q: Why doesn't NFS-exporting subdirectories of inode64-mounted filesystem work? ==

The default <tt>fsid</tt> type encodes only 32-bit of the inode number for subdirectory exports. However, exporting the root of the filesystem works, or using one of the non-default <tt>fsid</tt> types (<tt>fsid=uuid</tt> in <tt>/etc/exports</tt> with recent <tt>nfs-utils</tt>) should work as well. (Thanks, Christoph!)

== Q: What is the inode64 mount option for? ==

By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like "disk full" when you still have plenty space free, but there's no more place in the first TB to create a new inode. Also, performance sucks.

To come around this, use the inode64 mount options for filesystems >1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.

Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.

== Q: Can I just try the inode64 option to see if it helps me? ==

Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can't access files & dirs that have been created with an inode >32bit anymore.

== Q: Performance: mkfs.xfs -n size=64k option ==

Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:

Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a
directory entry is determined by the length of the name.

There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there's the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.

For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.

In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don't have any numbers on what the difference might be - I'm getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....

== Q: I want to tune my XFS filesystems for <something> ==

''Premature optimization is the root of all evil.'' - Donald Knuth

The standard answer you will get to this question is this: use the defaults.

There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to configure the filesystem appropriately.

There are a lot of "XFS tuning guides" that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don't expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.

In most cases, the only thing you need to to consider for <tt>mkfs.xfs</tt> is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the <tt>logbsize</tt> and <tt>delaylog</tt> mount options. Increasing <tt>logbsize</tt> reduces the number of journal IOs for a given workload, and <tt>delaylog</tt> will reduce them even further. The trade off for this increase in metadata performance is that more operations may be "missing" after recovery if the system crashes while actively making modifications.

As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.

== Q: Which factors influence the memory usage of xfs_repair? ==

This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).

# xfs_repair -n -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2096.
#

xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,
of which 2,097,152KB is needed for tracking free space.
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)

Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:

# xfs_repair -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2289.

That is now needs at least another 200MB of RAM to run.

The numbers reported by xfs_repair are the absolute minimum required and approximate at that;
more RAM than this may be required to complete successfully.
Also, if you only give xfs_repair the minimum required RAM, it will be slow;
for best repair performance, the more RAM you can give it the better.

== Q: Why some files of my filesystem shows as "?????????? ? ? ? ? ? filename" ? ==

If ls -l shows you a listing as

# ?????????? ? ? ? ? ? file1
?????????? ? ? ? ? ? file2
?????????? ? ? ? ? ? file3
?????????? ? ? ? ? ? file4

and errors like:
# ls /pathtodir/
ls: cannot access /pathtodir/file1: Invalid argument
ls: cannot access /pathtodir/file2: Invalid argument
ls: cannot access /pathtodir/file3: Invalid argument
ls: cannot access /pathtodir/file4: Invalid argument

or even:
# failed to stat /pathtodir/file1

It is very probable your filesystem must be mounted with inode64
# mount -oremount,inode64 /dev/diskpart /mnt/xfs

should make it work ok again.
If it works, add the option to fstab.

== Q: The xfs_db "frag" command says I'm over 50%. Is that bad? ==

It depends. It's important to know how the value is calculated. xfs_db looks at the extents in all files, and returns:

(actual extents - ideal extents) / actual extents

This means that if, for example, you have an average of 2 extents per file, you'll get an answer of 50%. 4 extents per file would give you 75%. This may or may not be a problem, especially depending on the size of the files in question. (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented). The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.

Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:
[[Image:Frag_factor.png|500px]]

== Q: I'm getting "Internal error xfs_sb_read_verify" errors when I try to run xfs_growfs under kernels v3.10 through v3.12 ==

This may happen when running xfs_growfs under a v3.10-v3.12 kernel,
if the filesystem was previously grown under a kernel prior to v3.8.

Old kernel versions prior to v3.8 did not zero the empty part of
new secondary superblocks when growing the filesystem with xfs_growfs.

Kernels v3.10 and later began detecting this non-zero part of the
superblock as corruption, and emit the

Internal error xfs_sb_read_verify

error message.

Kernels v3.13 and later are more forgiving about this - if the non-zero
data is found on a Version 4 superblock, it will not be flagged as
corruption.

The problematic secondary superblocks may be repaired by using an xfs_repair
version 3.2.0-alpha1 or above.

The relevant kernelspace commits are as follows:

v3.8 1375cb6 xfs: growfs: don't read garbage for new secondary superblocks <- fixed underlying problem
v3.10 04a1e6c xfs: add CRC checks to the superblock <- detected old underlying problem
v3.13 10e6e65 xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields <- is more forgiving of old underlying problem

This commit allows xfs_repair to detect and correct the problem:

v3.2.0-alpha1 cbd7508 xfs_repair: zero out unused parts of superblocks

== Q: Why do files on XFS use more data blocks than expected? ==

The XFS speculative preallocation algorithm allocates extra blocks beyond end of file (EOF) to minimize file fragmentation during buffered write workloads. Workloads that benefit from this behaviour include slowly growing files, concurrent writers and mixed reader/writer workloads. It also provides fragmentation resistance in situations where memory pressure prevents adequate buffering of dirty data to allow formation of large contiguous regions of data in memory.

This post-EOF block allocation is accounted identically to blocks within EOF. It is visible in 'st_blocks' counts via stat() system calls, accounted as globally allocated space and against quotas that apply to the associated file. The space is reported by various userspace utilities (stat, du, df, ls) and thus provides a common source of confusion for administrators. Post-EOF blocks are temporary in most situations and are usually reclaimed via several possible mechanisms in XFS.

See the FAQ entry on speculative preallocation for details.

== Q: What is speculative preallocation? ==

XFS speculatively preallocates post-EOF blocks on file extending writes in anticipation of future extending writes. The size of a preallocation is dynamic and depends on the runtime state of the file and fs. Generally speaking, preallocation is disabled for very small files and preallocation sizes grow as files grow larger.

Preallocations are capped to the maximum extent size supported by the filesystem. Preallocation size is throttled automatically as the filesystem approaches low free space conditions or other allocation limits on a file (such as a quota).

In most cases, speculative preallocation is automatically reclaimed when a file is closed. Applications that repeatedly trigger preallocation and reclaim cycles (e.g., this is common in file server or log file workloads) can cause fragmentation. Therefore, this pattern is detected and causes the preallocation to persist beyond the lifecycle of the file descriptor.

== Q: How can I speed up or avoid delayed removal of speculative preallocation? ==

Linux 3.8 (and later) includes a scanner to perform background trimming of files with lingering post-EOF preallocations. The scanner bypasses dirty files to avoid interference with ongoing writes. A 5 minute scan interval is used by default and can be adjusted via the following file (value in seconds):

/proc/sys/fs/xfs/speculative_prealloc_lifetime

== Q: Is speculative preallocation permanent? ==

Preallocated blocks are normally reclaimed on file close, inode reclaim, unmount or in the background once file write activity subsides. They can be explicitly made permanent via fallocate or a similar interface. They can be implicitly made permanent in situations where file size is extended beyond a range of post-EOF blocks (i.e., via an extending truncate) or following a crash. In the event of a crash, the in-memory state used to track and reclaim the speculative preallocation is lost.

== Q: My workload has known characteristics - can I disable speculative preallocation or tune it to an optimal fixed size? ==

Speculative preallocation can not be disabled but XFS can be tuned to a fixed allocation size with the 'allocsize=' mount option. Speculative preallocation is not dynamically resized when the allocsize mount option is set and thus the potential for fragmentation is increased. Use 'allocsize=64k' to revert to the default XFS behavior prior to support for dynamic speculative preallocation.

== Q: mount (or umount) takes minutes or even hours - what could be the reason ? ==

In some cases xfs log (journal) can become quite big. For example if it accumulates many entries and doesn't get chance to apply these to disk (due to lockup, crash, hard reset etc). xfs will try to reapply these at mount (in dmesg: "Starting recovery (logdev: internal)").

That process with big log to be reapplied can take very long time (minutes or even hours). Similar problem can happen with unmount taking hours when there are hundreds of thousands of dirty inode in memory that need to be flushed to disk.

(http://oss.sgi.com/pipermail/xfs/2015-October/044457.html)

== Q: Which I/O scheduler for XFS? ==

=== On rotational disks without hardware raid ===

''CFQ'': not great for XFS parallelism:

< dchinner> it doesn't allow other threads to get IO issued immediately after the first one
< dchinner> it waits, instead, for a timeslice to expire before moving to the IO of a different process.
< dchinner> so instead of interleaving the IO of multiple jobs in a single sweep across the disk,
it enforces single threaded access to the disk

''deadline'': good option, doesn't have such problem

Note that some kernels have block multiqueue enabled which (currently - 08/2016) doesn't support I/O schedulers at all thus there is no optimisation and reordering IO for best seek order, so disable blk-mq for rotational disks (see CONFIG_SCSI_MQ_DEFAULT, CONFIG_DM_MQ_DEFAULT options and use_blk_mq parameter for scsi-mod/dm-mod kernel modules).

Also hardware raid can be smart enough to cache and reorder I/O requests thus additional layer of reordering
(like Linux I/O scheduler) can potentially conflict and make performance worse. If you have such raid card
then try method described below.

=== SSD disks or rotational disks but with hardware raid card that has cache enabled ===

Block multiqueue enabled (and thus no I/O scheduler at all) or block multiqueue disabled and noop or deadline I/O scheduler activated is good solution. SSD disks don't really need I/O schedulers while smart raid cards do I/O ordering on their own.

Note that if your raid is very dumb and/or has no cache enabled then it likely cannot reorder I/O requests and thus it could benefit from I/O scheduler.

XFS FAQ

2016-08-10T21:22:55Z

Cattelan: trying to explain which I/O scheduler is good choice for xfs. Based on dchinner knowledge (hopefully translated without errors into FAQ entry).

Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]

Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.

== Q: Where can I find documentation about XFS? ==

The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.

You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the '''<nowiki>#xfs</nowiki>''' IRC channel on ''irc.freenode.net''.

== Q: Where can I find documentation about ACLs? ==

Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/

The '''acl(5)''' manual page is also quite extensive.

== Q: Where can I find information about the internals of XFS? ==

An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.

Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.

== Q: What partition type should I use for XFS on Linux? ==

Linux native filesystem (83).

== Q: What mount options does XFS have? ==

There are a number of mount options influencing XFS filesystems - refer to the '''mount(8)''' manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])

== Q: Is there any relation between the XFS utilities and the kernel version? ==

No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.

== Q: Does it run on platforms other than i386? ==

XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.

== Q: Quota: Do quotas work on XFS? ==

Yes.

To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/ http://sourceforge.net/projects/linuxquota/] or use '''xfs_quota(8)'''.

== Q: Quota: What's project quota? ==

The project quota is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.

== Q: Quota: Can group quota and project quota be used at the same time? ==

No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.

== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==

To be answered.

== Q: Are there any dump/restore tools for XFS? ==

'''xfsdump(8)''' and '''xfsrestore(8)''' are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.

== Q: Does LILO work with XFS? ==

This depends on where you install LILO.

Yes, for MBR (Master Boot Record) installations.

No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.

== Q: Does GRUB work with XFS? ==

There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.

== Q: Can XFS be used for a root filesystem? ==

Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the "rootflags=" kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit "logdev=" specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]

== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==

Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be "clean" when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.

== Q: Is there a way to make a XFS filesystem larger or smaller? ==

You can ''NOT'' make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.

An XFS filesystem may be enlarged by using '''xfs_growfs(8)'''.

If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the ''exact same'' starting point. Run '''xfs_growfs''' to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.

Using XFS filesystems on top of a volume manager makes this a lot easier.

== Q: What information should I include when reporting a problem? ==

What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:

* kernel version (uname -a)
* xfsprogs version (xfs_repair -V)
* number of CPUs
* contents of /proc/meminfo
* contents of /proc/mounts
* contents of /proc/partitions
* RAID layout (hardware and/or software)
* LVM configuration
* type of disks you are using
* write cache status of drives
* size of BBWC and mode it is running in
* xfs_info output on the filesystem in question
* dmesg output showing all error messages and stack traces

Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:

# iostat -x -d -m 5
# vmstat 5

can give us insight into the IO and memory utilisation of your machine at the time of the problem.

If the filesystem is hanging, then capture the output of the dmesg command after running:

# echo w > /proc/sysrq-trigger
# dmesg

will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.

And for advanced users, capturing an event trace using '''trace-cmd''' (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it's a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:

# trace-cmd record -e xfs\*

before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:

# trace-cmd report > trace_report.txt

Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.

If you have a problem with '''xfs_repair(8)''', make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using '''xfs_metadump(8)''' (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.

== Q: Mounting an XFS filesystem does not work - what is wrong? ==

If mount prints an error message something like:

mount: /dev/hda5 has wrong major or minor number

you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the "-t xfs" option on mount or the "xfs" option in <tt>/etc/fstab</tt>.

If you get something like:

mount: wrong fs type, bad option, bad superblock on /dev/sda1,
or too many mounted file systems

Refer to your system log file (<tt>/var/log/messages</tt>) for a detailed diagnostic message from the kernel.

== Q: Does the filesystem have an undelete capability? ==

There is no undelete in XFS.

However, if an inode is unlinked but neither it nor its associated data blocks get immediately re-used and overwritten, there is some small chance to recover the file from the disk.

''photorec'', ''xfs_irecover'' or ''xfsr'' are some tools which attempt to do this, with varying success.

There are also commercial data recovery services and closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS] which claims to recover data, although this has not been tested by the XFS developers.

As always, the best advice is to keep good backups.

== Q: How can I backup a XFS filesystem and ACLs? ==

You can backup a XFS filesystem with utilities like '''xfsdump(8)''' and standard '''tar(1)''' for standard files. If you want to backup ACLs you will need to use '''xfsdump''' or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (> version 3.1.4) or [http://rsync.samba.org/ rsync] (>= version 3.0.0) to backup ACLs and EAs. '''xfsdump''' can also be integrated with [http://www.amanda.org/ amanda(8)].

== Q: I see applications returning error 990 or "Structure needs cleaning", what is wrong? ==

The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], "Structure needs cleaning."

The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.

There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.

You can use xfs_repair to remedy the problem (with the file system unmounted).

== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==

Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.

XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.

Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you'll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the '''xfs_bmap(8)''' command).

== Q: What is the problem with the write cache on journaled filesystems? ==

Many drives use a write back cache in order to speed up the performance of writes. However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk. Further, the drive can de-stage data from the write cache to the platters in any order that it chooses. This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk. When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.

With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information. In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.

With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued. A powerfail "only" loses data in the cache but no essential ordering is violated, and corruption will not occur.

With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance. But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.

== Q: How can I tell if I have the disk write cache enabled? ==

For SCSI/SATA:

* Look in dmesg(8) output for a driver line, such as: "SCSI device sda: drive cache: write back"
* <nowiki># sginfo -c /dev/sda | grep -i 'write cache' </nowiki>

For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):

* <nowiki># hdparm -I /dev/sda</nowiki> and look under "Enabled Supported" for "Write cache"

For RAID controllers:

* See the section about RAID controllers below

== Q: How can I address the problem with the disk write cache? ==

=== Disabling the disk write back cache. ===

For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]): 

* <nowiki># hdparm -W0 /dev/sda</nowiki> # hdparm -W0 /dev/hda
* <nowiki># blktool /dev/sda wcache off</nowiki> # blktool /dev/hda wcache off

For SCSI:

* Using sginfo(8) which is a little tedious It takes 3 steps. For example:
*# <nowiki>#sginfo -c /dev/sda</nowiki> which gives a list of attribute names and values
*# <nowiki>#sginfo -cX /dev/sda</nowiki> which gives an array of cache values which you must match up with from step 1, e.g. 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0
*# <nowiki>#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0</nowiki> allows you to reset the value of the cache attributes.

For RAID controllers:

* See the section about RAID controllers below

This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled. 

=== Using an external log. ===

Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will '''not''' solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won't be able to guarantee that if the metadata is on a drive with the write cache enabled.

In fact using an external log will disable XFS' write barrier support.

=== Write barrier support. ===

Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with "nobarrier". Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:

* "Disabling barriers, not supported with external log device"
* "Disabling barriers, not supported by the underlying device"
* "Disabling barriers, trial barrier write failed"

If the filesystem is mounted with an external log device then we currently don't support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn't support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.

== Q. Should barriers be enabled with storage which has a persistent write cache? ==

Many hardware RAIDs have a persistent write cache which is preserved across power failure, interface resets, system crashes, etc. The same may be true of some SSD devices. This sort of hardware should report to the operating system that no flushes are required, and in that case barriers will not be issued, even without the "nobarrier" option. Quoting Christoph Hellwig [http://oss.sgi.com/archives/xfs/2015-12/msg00281.html on the xfs list],
If the device does not need cache flushes it should not report requiring
flushes, in which case nobarrier will be a noop. Or to phrase it
differently: If nobarrier makes a difference skipping it is not safe.
On modern kernels with hardware which properly reports write cache behavior, there is no need to change barrier options at mount time.

== Q. Which settings does my RAID controller need ? ==

It's hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:

Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory "[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]") which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.

If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.

* onboard RAID controllers: there are so many different types it's hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn't even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.

* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86);

* Adaptec: allows setting individual drives cache
arcconf setcache <disk> wb|wt
wb=write back, which means write cache on, wt=write through, which means write cache off. So "wt" should be chosen.

* Areca: In archttp under "System Controls" -> "System Configuration" there's the option "Disk Write Cache Mode" (defaults "Auto")

"Off": disk write cache is turned off

"On": disk write cache is enabled, this is not safe for your data but fast

"Auto": If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to "On", because neither controller cache nor disk cache is safe so you don't seem to care about your data and just want high speed (which you get then).

That's a very sensible default so you can let it "Auto" or enforce "Off" to be sure.

* LSI MegaRAID: allows setting individual disks cache:
MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL # flushes the controller cache
MegaCli -LDGetProp -Cache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the controller cache settings
MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the disk cache settings (for all phys. disks in logical disk)
MegaCli -LDSetProp -EnDskCache|DisDskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # set disk cache setting

* Xyratex: from the docs: "Write cache includes the disk drive cache and controller cache.". So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.

== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==

The biggest problem is that those products seem to also virtualize disk
writes in a way that even barriers don't work any more, which means even
a fsync is not reliable. Tests confirm that unplugging the power from
such a system even with RAID controller with battery backed cache and
hard disk cache turned off (which is safe on a normal host) you can
destroy a database within the virtual machine (client, domU whatever you
call it).

In qemu you can specify cache=off on the line specifying the virtual
disk. For others information is missing.

== Q: What is the issue with directory corruption in Linux 2.6.17? ==

In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some "sparse" endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.

''Update: the fix is included in 2.6.17.7 and later kernels.''

To add insult to injury, '''xfs_repair(8)''' is currently not correcting these directories on detection of this corrupt state either. This '''xfs_repair''' issue is actively being worked on, and a fixed version will be available shortly.

''Update: a fixed '''xfs_repair''' is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.''

No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.

'''xfs_repair -n''' should be able to detect any directory corruption.

Until a fixed '''xfs_repair''' binary is available, one can make use of the '''xfs_db(8)''' command to mark the problem directory for removal (see the example below). A subsequent '''xfs_repair''' invocation will remove the directory and move all contents into "lost+found", named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
core.mode = 040755
core.version = 2
core.format = 3 (btree)
...
xfs_db> write core.mode 0
xfs_db> quit
</nowiki>

A subsequent '''xfs_repair''' will clear the directory, and add new entries (named by inode number) in lost+found.

The easiest way to map inode numbers to full paths is via '''xfs_ncheck(8)'''<nowiki>: </nowiki>

<nowiki>
# xfs_ncheck -i 14101 -i 14102 /dev/sdXXX
14101 full/path/mumble_fratz_foo_bar_1495
14102 full/path/mumble_fratz_foo_bar_1494
</nowiki>

Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
...
next_unlinked = null
u.bmbt.level = 1
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[0]
u.bmbt.ptrs[1] = 1:3628
xfs_db> fsblock 3628
xfs_db> type bmapbtd
xfs_db> print
magic = 0x424d4150
level = 0
numrecs = 19
leftsib = null
rightsib = null
recs[1-19] = [startoff,startblock,blockcount,extentflag]
1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]
5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]
9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]
12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]
15:[33554436,3488,8,0] 16:[33554444,3629,4,0]
17:[33554448,3748,4,0] 18:[33554452,3900,4,0]
19:[67108864,3364,4,0]
</nowiki>

At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the '''xfs_db''' dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:

...
xfs_db> dblock 20
xfs_db> print
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0
dhdr.bestfree[0].length = 0
dhdr.bestfree[1].offset = 0
dhdr.bestfree[1].length = 0
dhdr.bestfree[2].offset = 0
dhdr.bestfree[2].length = 0
du[0].inumber = 13937
du[0].namelen = 25
du[0].name = "mumble_fratz_foo_bar_1595"
du[0].tag = 0x10
du[1].inumber = 13938
du[1].namelen = 25
du[1].name = "mumble_fratz_foo_bar_1594"
du[1].tag = 0x38
...

So, here we can see that inode number 13938 matches up with name "mumble_fratz_foo_bar_1594". Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at "lost+found" (once '''xfs_repair''' has removed the corrupt directory).

== Q: Why does my > 2TB XFS partition disappear when I reboot ? ==

Strictly speaking this is not an XFS problem.

To support > 2TB partitions you need two things: a kernel that supports large block devices (<tt>CONFIG_LBD=y</tt>) and a partition table format that can hold large partitions. The default DOS partition tables don't. The best partition format for
> 2TB partitions is the EFI GPT format (<tt>CONFIG_EFI_PARTITION=y</tt>).

Without CONFIG_LBD=y you can't even create the filesystem, but without <tt>CONFIG_EFI_PARTITION=y</tt> it works fine until you reboot at which point the partition will disappear. Note that you need to enable the <tt>CONFIG_PARTITION_ADVANCED</tt> option before you can set <tt>CONFIG_EFI_PARTITION=y</tt>.

== Q: Why do I receive <tt>No space left on device</tt> after <tt>xfs_growfs</tt>? ==

After [http://oss.sgi.com/archives/xfs/2009-01/msg01023.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. This was an issue with the older "inode32" inode allocation mode, where inode allocation is restricted to lower filesysetm blocks. To fix this, [http://oss.sgi.com/archives/xfs/2009-01/msg01031.html Dave Chinner advised]:

The only way to fix this is to move data around to free up space
below 1TB. Find your oldest data (i.e. that was around before even
the first grow) and move it off the filesystem (move, not copy).
Then if you copy it back on, the data blocks will end up above 1TB
and that should leave you with plenty of space for inodes below 1TB.

A complete dump and restore will also fix the problem ;)

Alternately, you can add 'inode64' to your mount options to allow inodes to live above 1TB.

example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&forum=38 No space left on device on xfs filesystem with 7.7TB free]

However, 'inode64' has been the default behavior since kernel v3.7...

Unfortunately, v3.7 also added a bug present from kernel v3.7 to v3.17 which caused new allocation groups added by growfs to be unavailable for inode allocation. This was fixed by commit <tt>[http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9de67c3ba9ea961ba420573d56479d09d33a7587 9de67c3b xfs: allow inode allocations in post-growfs disk space.]</tt> in kernel v3.17.
Without that commit, the problem can be worked around by doing a "mount -o remount,inode64" after the growfs operation.

== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==

The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons.

Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.

== Q: How to get around a bad inode repair is unable to clean up ==

The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.

xfs_db -x -c 'inode XXX' -c 'write core.nextents 0' -c 'write core.size 0' /dev/hdXX

== Q: How to calculate the correct sunit,swidth values for optimal performance ==

XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.

These options can be sometimes autodetected (for example with md raid and recent enough kernel (>= 2.6.32) and xfsprogs (>= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.

The calculation of these values is quite simple:

su = <RAID controllers stripe size in BYTES (or KiBytes when used with k)>
sw = <# of data disks (don't count parity disks)>

So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use

su = 64k
sw = 6 (RAID-6 of 8 disks has 6 data disks)

A RAID stripe size of 256KB with a RAID-10 over 16 disks should use

su = 256k
sw = 8 (RAID-10 of 16 disks has 8 data disks)

Alternatively, you can use "sunit" instead of "su" and "swidth" instead of "sw" but then sunit/swidth values need to be specified in "number of 512B sectors"!

Note that <tt>xfs_info</tt> and <tt>mkfs.xfs</tt> interpret sunit and swidth as being specified in units of 512B sectors; that's unfortunately not the unit they're reported in, however.
<tt>xfs_info</tt> and <tt>mkfs.xfs</tt> report them in multiples of your basic block size (bsize) and not in 512B sectors.

Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.

When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.

== Q: Why doesn't NFS-exporting subdirectories of inode64-mounted filesystem work? ==

The default <tt>fsid</tt> type encodes only 32-bit of the inode number for subdirectory exports. However, exporting the root of the filesystem works, or using one of the non-default <tt>fsid</tt> types (<tt>fsid=uuid</tt> in <tt>/etc/exports</tt> with recent <tt>nfs-utils</tt>) should work as well. (Thanks, Christoph!)

== Q: What is the inode64 mount option for? ==

By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like "disk full" when you still have plenty space free, but there's no more place in the first TB to create a new inode. Also, performance sucks.

To come around this, use the inode64 mount options for filesystems >1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.

Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.

== Q: Can I just try the inode64 option to see if it helps me? ==

Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can't access files & dirs that have been created with an inode >32bit anymore.

== Q: Performance: mkfs.xfs -n size=64k option ==

Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:

Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a
directory entry is determined by the length of the name.

There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there's the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.

For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.

In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don't have any numbers on what the difference might be - I'm getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....

== Q: I want to tune my XFS filesystems for <something> ==

''Premature optimization is the root of all evil.'' - Donald Knuth

The standard answer you will get to this question is this: use the defaults.

There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to configure the filesystem appropriately.

There are a lot of "XFS tuning guides" that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don't expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.

In most cases, the only thing you need to to consider for <tt>mkfs.xfs</tt> is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the <tt>logbsize</tt> and <tt>delaylog</tt> mount options. Increasing <tt>logbsize</tt> reduces the number of journal IOs for a given workload, and <tt>delaylog</tt> will reduce them even further. The trade off for this increase in metadata performance is that more operations may be "missing" after recovery if the system crashes while actively making modifications.

As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.

== Q: Which factors influence the memory usage of xfs_repair? ==

This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).

# xfs_repair -n -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2096.
#

xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,
of which 2,097,152KB is needed for tracking free space.
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)

Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:

# xfs_repair -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2289.

That is now needs at least another 200MB of RAM to run.

The numbers reported by xfs_repair are the absolute minimum required and approximate at that;
more RAM than this may be required to complete successfully.
Also, if you only give xfs_repair the minimum required RAM, it will be slow;
for best repair performance, the more RAM you can give it the better.

== Q: Why some files of my filesystem shows as "?????????? ? ? ? ? ? filename" ? ==

If ls -l shows you a listing as

# ?????????? ? ? ? ? ? file1
?????????? ? ? ? ? ? file2
?????????? ? ? ? ? ? file3
?????????? ? ? ? ? ? file4

and errors like:
# ls /pathtodir/
ls: cannot access /pathtodir/file1: Invalid argument
ls: cannot access /pathtodir/file2: Invalid argument
ls: cannot access /pathtodir/file3: Invalid argument
ls: cannot access /pathtodir/file4: Invalid argument

or even:
# failed to stat /pathtodir/file1

It is very probable your filesystem must be mounted with inode64
# mount -oremount,inode64 /dev/diskpart /mnt/xfs

should make it work ok again.
If it works, add the option to fstab.

== Q: The xfs_db "frag" command says I'm over 50%. Is that bad? ==

It depends. It's important to know how the value is calculated. xfs_db looks at the extents in all files, and returns:

(actual extents - ideal extents) / actual extents

This means that if, for example, you have an average of 2 extents per file, you'll get an answer of 50%. 4 extents per file would give you 75%. This may or may not be a problem, especially depending on the size of the files in question. (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented). The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.

Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:
[[Image:Frag_factor.png|500px]]

== Q: I'm getting "Internal error xfs_sb_read_verify" errors when I try to run xfs_growfs under kernels v3.10 through v3.12 ==

This may happen when running xfs_growfs under a v3.10-v3.12 kernel,
if the filesystem was previously grown under a kernel prior to v3.8.

Old kernel versions prior to v3.8 did not zero the empty part of
new secondary superblocks when growing the filesystem with xfs_growfs.

Kernels v3.10 and later began detecting this non-zero part of the
superblock as corruption, and emit the

Internal error xfs_sb_read_verify

error message.

Kernels v3.13 and later are more forgiving about this - if the non-zero
data is found on a Version 4 superblock, it will not be flagged as
corruption.

The problematic secondary superblocks may be repaired by using an xfs_repair
version 3.2.0-alpha1 or above.

The relevant kernelspace commits are as follows:

v3.8 1375cb6 xfs: growfs: don't read garbage for new secondary superblocks <- fixed underlying problem
v3.10 04a1e6c xfs: add CRC checks to the superblock <- detected old underlying problem
v3.13 10e6e65 xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields <- is more forgiving of old underlying problem

This commit allows xfs_repair to detect and correct the problem:

v3.2.0-alpha1 cbd7508 xfs_repair: zero out unused parts of superblocks

== Q: Why do files on XFS use more data blocks than expected? ==

The XFS speculative preallocation algorithm allocates extra blocks beyond end of file (EOF) to minimize file fragmentation during buffered write workloads. Workloads that benefit from this behaviour include slowly growing files, concurrent writers and mixed reader/writer workloads. It also provides fragmentation resistance in situations where memory pressure prevents adequate buffering of dirty data to allow formation of large contiguous regions of data in memory.

This post-EOF block allocation is accounted identically to blocks within EOF. It is visible in 'st_blocks' counts via stat() system calls, accounted as globally allocated space and against quotas that apply to the associated file. The space is reported by various userspace utilities (stat, du, df, ls) and thus provides a common source of confusion for administrators. Post-EOF blocks are temporary in most situations and are usually reclaimed via several possible mechanisms in XFS.

See the FAQ entry on speculative preallocation for details.

== Q: What is speculative preallocation? ==

XFS speculatively preallocates post-EOF blocks on file extending writes in anticipation of future extending writes. The size of a preallocation is dynamic and depends on the runtime state of the file and fs. Generally speaking, preallocation is disabled for very small files and preallocation sizes grow as files grow larger.

Preallocations are capped to the maximum extent size supported by the filesystem. Preallocation size is throttled automatically as the filesystem approaches low free space conditions or other allocation limits on a file (such as a quota).

In most cases, speculative preallocation is automatically reclaimed when a file is closed. Applications that repeatedly trigger preallocation and reclaim cycles (e.g., this is common in file server or log file workloads) can cause fragmentation. Therefore, this pattern is detected and causes the preallocation to persist beyond the lifecycle of the file descriptor.

== Q: How can I speed up or avoid delayed removal of speculative preallocation? ==

Linux 3.8 (and later) includes a scanner to perform background trimming of files with lingering post-EOF preallocations. The scanner bypasses dirty files to avoid interference with ongoing writes. A 5 minute scan interval is used by default and can be adjusted via the following file (value in seconds):

/proc/sys/fs/xfs/speculative_prealloc_lifetime

== Q: Is speculative preallocation permanent? ==

Preallocated blocks are normally reclaimed on file close, inode reclaim, unmount or in the background once file write activity subsides. They can be explicitly made permanent via fallocate or a similar interface. They can be implicitly made permanent in situations where file size is extended beyond a range of post-EOF blocks (i.e., via an extending truncate) or following a crash. In the event of a crash, the in-memory state used to track and reclaim the speculative preallocation is lost.

== Q: My workload has known characteristics - can I disable speculative preallocation or tune it to an optimal fixed size? ==

Speculative preallocation can not be disabled but XFS can be tuned to a fixed allocation size with the 'allocsize=' mount option. Speculative preallocation is not dynamically resized when the allocsize mount option is set and thus the potential for fragmentation is increased. Use 'allocsize=64k' to revert to the default XFS behavior prior to support for dynamic speculative preallocation.

== Q: mount (or umount) takes minutes or even hours - what could be the reason ? ==

In some cases xfs log (journal) can become quite big. For example if it accumulates many entries and doesn't get chance to apply these to disk (due to lockup, crash, hard reset etc). xfs will try to reapply these at mount (in dmesg: "Starting recovery (logdev: internal)").

That process with big log to be reapplied can take very long time (minutes or even hours). Similar problem can happen with unmount taking hours when there are hundreds of thousands of dirty inode in memory that need to be flushed to disk.

(http://oss.sgi.com/pipermail/xfs/2015-October/044457.html)

== Q: Which I/O scheduler for XFS? ==

=== On rotational disks without hardware raid ===

''CFQ'': not great for XFS parallelism:

< dchinner> it doesn't allow other threads to get IO issued immediately after the first one
< dchinner> it waits, instead, for a timeslice to expire before moving to the IO of a different process.
< dchinner> so instead of interleaving the IO of multiple jobs in a single sweep across the disk, it enforces single threaded access to the disk

''deadline'': good option, doesn't have such problem

Note that some kernels have block multiqueue enabled which (currently - 08/2016) doesn't support I/O schedulers at all thus there is no optimisation and reordering IO for best seek order, so disable blk-mq for rotational disks (see CONFIG_SCSI_MQ_DEFAULT, CONFIG_DM_MQ_DEFAULT options and use_blk_mq parameter for scsi-mod/dm-mod kernel modules).

Also hardware raid can be smart enough to cache and reorder I/O requests thus additional layer of reordering
(like Linux I/O scheduler) can potentially conflict and make performance worse. If you have such raid card
then try method described below.

=== SSD disks or rotational disks but with hardware raid card that has cache enabled ===

Block multiqueue enabled (and thus no I/O scheduler at all) or block multiqueue disabled and noop or deadline I/O scheduler activated is good solution. SSD disks don't really need I/O schedulers while smart raid cards do I/O ordering on their own.

Note that if your raid is very dumb and/or has no cache enabled then it likely cannot reorder I/O requests and thus it could benefit from I/O scheduler.

Xfs.org:Terms of Service

2015-12-22T02:26:28Z

Cattelan: Created page with "Please be kind to all who enter. Share openly. Don't spam anybody."

Please be kind to all who enter.
Share openly.
Don't spam anybody.

XFS FAQ

2015-12-13T08:32:48Z

Cattelan: Other link also fixed

Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]

Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.

== Q: Where can I find documentation about XFS? ==

The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.

You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the '''<nowiki>#xfs</nowiki>''' IRC channel on ''irc.freenode.net''.

== Q: Where can I find documentation about ACLs? ==

Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/

The '''acl(5)''' manual page is also quite extensive.

== Q: Where can I find information about the internals of XFS? ==

An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.

Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.

== Q: What partition type should I use for XFS on Linux? ==

Linux native filesystem (83).

== Q: What mount options does XFS have? ==

There are a number of mount options influencing XFS filesystems - refer to the '''mount(8)''' manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])

== Q: Is there any relation between the XFS utilities and the kernel version? ==

No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.

== Q: Does it run on platforms other than i386? ==

XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.

== Q: Quota: Do quotas work on XFS? ==

Yes.

To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/ http://sourceforge.net/projects/linuxquota/] or use '''xfs_quota(8)'''.

== Q: Quota: What's project quota? ==

The project quota is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.

== Q: Quota: Can group quota and project quota be used at the same time? ==

No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.

== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==

To be answered.

== Q: Are there any dump/restore tools for XFS? ==

'''xfsdump(8)''' and '''xfsrestore(8)''' are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.

== Q: Does LILO work with XFS? ==

This depends on where you install LILO.

Yes, for MBR (Master Boot Record) installations.

No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.

== Q: Does GRUB work with XFS? ==

There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.

== Q: Can XFS be used for a root filesystem? ==

Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the "rootflags=" kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit "logdev=" specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]

== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==

Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be "clean" when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.

== Q: Is there a way to make a XFS filesystem larger or smaller? ==

You can ''NOT'' make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.

An XFS filesystem may be enlarged by using '''xfs_growfs(8)'''.

If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the ''exact same'' starting point. Run '''xfs_growfs''' to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.

Using XFS filesystems on top of a volume manager makes this a lot easier.

== Q: What information should I include when reporting a problem? ==

What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:

* kernel version (uname -a)
* xfsprogs version (xfs_repair -V)
* number of CPUs
* contents of /proc/meminfo
* contents of /proc/mounts
* contents of /proc/partitions
* RAID layout (hardware and/or software)
* LVM configuration
* type of disks you are using
* write cache status of drives
* size of BBWC and mode it is running in
* xfs_info output on the filesystem in question
* dmesg output showing all error messages and stack traces

Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:

# iostat -x -d -m 5
# vmstat 5

can give us insight into the IO and memory utilisation of your machine at the time of the problem.

If the filesystem is hanging, then capture the output of the dmesg command after running:

# echo w > /proc/sysrq-trigger
# dmesg

will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.

And for advanced users, capturing an event trace using '''trace-cmd''' (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it's a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:

# trace-cmd record -e xfs\*

before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:

# trace-cmd report > trace_report.txt

Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.

If you have a problem with '''xfs_repair(8)''', make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using '''xfs_metadump(8)''' (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.

== Q: Mounting an XFS filesystem does not work - what is wrong? ==

If mount prints an error message something like:

mount: /dev/hda5 has wrong major or minor number

you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the "-t xfs" option on mount or the "xfs" option in <tt>/etc/fstab</tt>.

If you get something like:

mount: wrong fs type, bad option, bad superblock on /dev/sda1,
or too many mounted file systems

Refer to your system log file (<tt>/var/log/messages</tt>) for a detailed diagnostic message from the kernel.

== Q: Does the filesystem have an undelete capability? ==

There is no undelete in XFS.

However, if an inode is unlinked but neither it nor its associated data blocks get immediately re-used and overwritten, there is some small chance to recover the file from the disk.

''photorec'', ''xfs_irecover'' or ''xfsr'' are some tools which attempt to do this, with varying success.

There are also commercial data recovery services and closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS] which claims to recover data, although this has not been tested by the XFS developers.

As always, the best advice is to keep good backups.

== Q: How can I backup a XFS filesystem and ACLs? ==

You can backup a XFS filesystem with utilities like '''xfsdump(8)''' and standard '''tar(1)''' for standard files. If you want to backup ACLs you will need to use '''xfsdump''' or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (> version 3.1.4) or [http://rsync.samba.org/ rsync] (>= version 3.0.0) to backup ACLs and EAs. '''xfsdump''' can also be integrated with [http://www.amanda.org/ amanda(8)].

== Q: I see applications returning error 990 or "Structure needs cleaning", what is wrong? ==

The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], "Structure needs cleaning."

The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.

There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.

You can use xfs_repair to remedy the problem (with the file system unmounted).

== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==

Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.

XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.

Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you'll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the '''xfs_bmap(8)''' command).

== Q: What is the problem with the write cache on journaled filesystems? ==

Many drives use a write back cache in order to speed up the performance of writes. However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk. Further, the drive can de-stage data from the write cache to the platters in any order that it chooses. This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk. When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.

With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information. In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.

With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued. A powerfail "only" loses data in the cache but no essential ordering is violated, and corruption will not occur.

With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance. But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.

== Q: How can I tell if I have the disk write cache enabled? ==

For SCSI/SATA:

* Look in dmesg(8) output for a driver line, such as: "SCSI device sda: drive cache: write back"
* <nowiki># sginfo -c /dev/sda | grep -i 'write cache' </nowiki>

For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):

* <nowiki># hdparm -I /dev/sda</nowiki> and look under "Enabled Supported" for "Write cache"

For RAID controllers:

* See the section about RAID controllers below

== Q: How can I address the problem with the disk write cache? ==

=== Disabling the disk write back cache. ===

For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]): 

* <nowiki># hdparm -W0 /dev/sda</nowiki> # hdparm -W0 /dev/hda
* <nowiki># blktool /dev/sda wcache off</nowiki> # blktool /dev/hda wcache off

For SCSI:

* Using sginfo(8) which is a little tedious It takes 3 steps. For example:
*# <nowiki>#sginfo -c /dev/sda</nowiki> which gives a list of attribute names and values
*# <nowiki>#sginfo -cX /dev/sda</nowiki> which gives an array of cache values which you must match up with from step 1, e.g. 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0
*# <nowiki>#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0</nowiki> allows you to reset the value of the cache attributes.

For RAID controllers:

* See the section about RAID controllers below

This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled. 

=== Using an external log. ===

Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will '''not''' solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won't be able to guarantee that if the metadata is on a drive with the write cache enabled.

In fact using an external log will disable XFS' write barrier support.

=== Write barrier support. ===

Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with "nobarrier". Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:

* "Disabling barriers, not supported with external log device"
* "Disabling barriers, not supported by the underlying device"
* "Disabling barriers, trial barrier write failed"

If the filesystem is mounted with an external log device then we currently don't support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn't support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.

== Q. Should barriers be enabled with storage which has a persistent write cache? ==

Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with "nobarrier", assuming your RAID controller is infallible and not resetting randomly like some common ones do. But take care about the hard disk write cache, which should be off.

== Q. Which settings does my RAID controller need ? ==

It's hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:

Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory "[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]") which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.

If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.

* onboard RAID controllers: there are so many different types it's hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn't even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.

* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86);

* Adaptec: allows setting individual drives cache
arcconf setcache <disk> wb|wt
wb=write back, which means write cache on, wt=write through, which means write cache off. So "wt" should be chosen.

* Areca: In archttp under "System Controls" -> "System Configuration" there's the option "Disk Write Cache Mode" (defaults "Auto")

"Off": disk write cache is turned off

"On": disk write cache is enabled, this is not safe for your data but fast

"Auto": If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to "On", because neither controller cache nor disk cache is safe so you don't seem to care about your data and just want high speed (which you get then).

That's a very sensible default so you can let it "Auto" or enforce "Off" to be sure.

* LSI MegaRAID: allows setting individual disks cache:
MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL # flushes the controller cache
MegaCli -LDGetProp -Cache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the controller cache settings
MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the disk cache settings (for all phys. disks in logical disk)
MegaCli -LDSetProp -EnDskCache|DisDskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # set disk cache setting

* Xyratex: from the docs: "Write cache includes the disk drive cache and controller cache.". So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.

== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==

The biggest problem is that those products seem to also virtualize disk
writes in a way that even barriers don't work any more, which means even
a fsync is not reliable. Tests confirm that unplugging the power from
such a system even with RAID controller with battery backed cache and
hard disk cache turned off (which is safe on a normal host) you can
destroy a database within the virtual machine (client, domU whatever you
call it).

In qemu you can specify cache=off on the line specifying the virtual
disk. For others information is missing.

== Q: What is the issue with directory corruption in Linux 2.6.17? ==

In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some "sparse" endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.

''Update: the fix is included in 2.6.17.7 and later kernels.''

To add insult to injury, '''xfs_repair(8)''' is currently not correcting these directories on detection of this corrupt state either. This '''xfs_repair''' issue is actively being worked on, and a fixed version will be available shortly.

''Update: a fixed '''xfs_repair''' is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.''

No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.

'''xfs_repair -n''' should be able to detect any directory corruption.

Until a fixed '''xfs_repair''' binary is available, one can make use of the '''xfs_db(8)''' command to mark the problem directory for removal (see the example below). A subsequent '''xfs_repair''' invocation will remove the directory and move all contents into "lost+found", named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
core.mode = 040755
core.version = 2
core.format = 3 (btree)
...
xfs_db> write core.mode 0
xfs_db> quit
</nowiki>

A subsequent '''xfs_repair''' will clear the directory, and add new entries (named by inode number) in lost+found.

The easiest way to map inode numbers to full paths is via '''xfs_ncheck(8)'''<nowiki>: </nowiki>

<nowiki>
# xfs_ncheck -i 14101 -i 14102 /dev/sdXXX
14101 full/path/mumble_fratz_foo_bar_1495
14102 full/path/mumble_fratz_foo_bar_1494
</nowiki>

Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
...
next_unlinked = null
u.bmbt.level = 1
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[0]
u.bmbt.ptrs[1] = 1:3628
xfs_db> fsblock 3628
xfs_db> type bmapbtd
xfs_db> print
magic = 0x424d4150
level = 0
numrecs = 19
leftsib = null
rightsib = null
recs[1-19] = [startoff,startblock,blockcount,extentflag]
1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]
5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]
9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]
12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]
15:[33554436,3488,8,0] 16:[33554444,3629,4,0]
17:[33554448,3748,4,0] 18:[33554452,3900,4,0]
19:[67108864,3364,4,0]
</nowiki>

At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the '''xfs_db''' dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:

...
xfs_db> dblock 20
xfs_db> print
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0
dhdr.bestfree[0].length = 0
dhdr.bestfree[1].offset = 0
dhdr.bestfree[1].length = 0
dhdr.bestfree[2].offset = 0
dhdr.bestfree[2].length = 0
du[0].inumber = 13937
du[0].namelen = 25
du[0].name = "mumble_fratz_foo_bar_1595"
du[0].tag = 0x10
du[1].inumber = 13938
du[1].namelen = 25
du[1].name = "mumble_fratz_foo_bar_1594"
du[1].tag = 0x38
...

So, here we can see that inode number 13938 matches up with name "mumble_fratz_foo_bar_1594". Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at "lost+found" (once '''xfs_repair''' has removed the corrupt directory).

== Q: Why does my > 2TB XFS partition disappear when I reboot ? ==

Strictly speaking this is not an XFS problem.

To support > 2TB partitions you need two things: a kernel that supports large block devices (<tt>CONFIG_LBD=y</tt>) and a partition table format that can hold large partitions. The default DOS partition tables don't. The best partition format for
> 2TB partitions is the EFI GPT format (<tt>CONFIG_EFI_PARTITION=y</tt>).

Without CONFIG_LBD=y you can't even create the filesystem, but without <tt>CONFIG_EFI_PARTITION=y</tt> it works fine until you reboot at which point the partition will disappear. Note that you need to enable the <tt>CONFIG_PARTITION_ADVANCED</tt> option before you can set <tt>CONFIG_EFI_PARTITION=y</tt>.

== Q: Why do I receive <tt>No space left on device</tt> after <tt>xfs_growfs</tt>? ==

After [http://oss.sgi.com/archives/xfs/2009-01/msg01023.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. This was an issue with the older "inode32" inode allocation mode, where inode allocation is restricted to lower filesysetm blocks. To fix this, [http://oss.sgi.com/archives/xfs/2009-01/msg01031.html Dave Chinner advised]:

The only way to fix this is to move data around to free up space
below 1TB. Find your oldest data (i.e. that was around before even
the first grow) and move it off the filesystem (move, not copy).
Then if you copy it back on, the data blocks will end up above 1TB
and that should leave you with plenty of space for inodes below 1TB.

A complete dump and restore will also fix the problem ;)

Alternately, you can add 'inode64' to your mount options to allow inodes to live above 1TB.

example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&forum=38 No space left on device on xfs filesystem with 7.7TB free]

However, 'inode64' has been the default behavior since kernel v3.7...

Unfortunately, v3.7 also added a bug present from kernel v3.7 to v3.17 which caused new allocation groups added by growfs to be unavailable for inode allocation. This was fixed by commit <tt>[http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9de67c3ba9ea961ba420573d56479d09d33a7587 9de67c3b xfs: allow inode allocations in post-growfs disk space.]</tt> in kernel v3.17.
Without that commit, the problem can be worked around by doing a "mount -o remount,inode64" after the growfs operation.

== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==

The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons.

Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.

== Q: How to get around a bad inode repair is unable to clean up ==

The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.

xfs_db -x -c 'inode XXX' -c 'write core.nextents 0' -c 'write core.size 0' /dev/hdXX

== Q: How to calculate the correct sunit,swidth values for optimal performance ==

XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.

These options can be sometimes autodetected (for example with md raid and recent enough kernel (>= 2.6.32) and xfsprogs (>= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.

The calculation of these values is quite simple:

su = <RAID controllers stripe size in BYTES (or KiBytes when used with k)>
sw = <# of data disks (don't count parity disks)>

So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use

su = 64k
sw = 6 (RAID-6 of 8 disks has 6 data disks)

A RAID stripe size of 256KB with a RAID-10 over 16 disks should use

su = 256k
sw = 8 (RAID-10 of 16 disks has 8 data disks)

Alternatively, you can use "sunit" instead of "su" and "swidth" instead of "sw" but then sunit/swidth values need to be specified in "number of 512B sectors"!

Note that <tt>xfs_info</tt> and <tt>mkfs.xfs</tt> interpret sunit and swidth as being specified in units of 512B sectors; that's unfortunately not the unit they're reported in, however.
<tt>xfs_info</tt> and <tt>mkfs.xfs</tt> report them in multiples of your basic block size (bsize) and not in 512B sectors.

Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.

When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.

== Q: Why doesn't NFS-exporting subdirectories of inode64-mounted filesystem work? ==

The default <tt>fsid</tt> type encodes only 32-bit of the inode number for subdirectory exports. However, exporting the root of the filesystem works, or using one of the non-default <tt>fsid</tt> types (<tt>fsid=uuid</tt> in <tt>/etc/exports</tt> with recent <tt>nfs-utils</tt>) should work as well. (Thanks, Christoph!)

== Q: What is the inode64 mount option for? ==

By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like "disk full" when you still have plenty space free, but there's no more place in the first TB to create a new inode. Also, performance sucks.

To come around this, use the inode64 mount options for filesystems >1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.

Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.

== Q: Can I just try the inode64 option to see if it helps me? ==

Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can't access files & dirs that have been created with an inode >32bit anymore.

== Q: Performance: mkfs.xfs -n size=64k option ==

Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:

Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a
directory entry is determined by the length of the name.

There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there's the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.

For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.

In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don't have any numbers on what the difference might be - I'm getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....

== Q: I want to tune my XFS filesystems for <something> ==

''Premature optimization is the root of all evil.'' - Donald Knuth

The standard answer you will get to this question is this: use the defaults.

There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to configure the filesystem appropriately.

There are a lot of "XFS tuning guides" that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don't expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.

In most cases, the only thing you need to to consider for <tt>mkfs.xfs</tt> is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the <tt>logbsize</tt> and <tt>delaylog</tt> mount options. Increasing <tt>logbsize</tt> reduces the number of journal IOs for a given workload, and <tt>delaylog</tt> will reduce them even further. The trade off for this increase in metadata performance is that more operations may be "missing" after recovery if the system crashes while actively making modifications.

As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.

== Q: Which factors influence the memory usage of xfs_repair? ==

This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).

# xfs_repair -n -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2096.
#

xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,
of which 2,097,152KB is needed for tracking free space.
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)

Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:

# xfs_repair -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2289.

That is now needs at least another 200MB of RAM to run.

The numbers reported by xfs_repair are the absolute minimum required and approximate at that;
more RAM than this may be required to complete successfully.
Also, if you only give xfs_repair the minimum required RAM, it will be slow;
for best repair performance, the more RAM you can give it the better.

== Q: Why some files of my filesystem shows as "?????????? ? ? ? ? ? filename" ? ==

If ls -l shows you a listing as

# ?????????? ? ? ? ? ? file1
?????????? ? ? ? ? ? file2
?????????? ? ? ? ? ? file3
?????????? ? ? ? ? ? file4

and errors like:
# ls /pathtodir/
ls: cannot access /pathtodir/file1: Invalid argument
ls: cannot access /pathtodir/file2: Invalid argument
ls: cannot access /pathtodir/file3: Invalid argument
ls: cannot access /pathtodir/file4: Invalid argument

or even:
# failed to stat /pathtodir/file1

It is very probable your filesystem must be mounted with inode64
# mount -oremount,inode64 /dev/diskpart /mnt/xfs

should make it work ok again.
If it works, add the option to fstab.

== Q: The xfs_db "frag" command says I'm over 50%. Is that bad? ==

It depends. It's important to know how the value is calculated. xfs_db looks at the extents in all files, and returns:

(actual extents - ideal extents) / actual extents

This means that if, for example, you have an average of 2 extents per file, you'll get an answer of 50%. 4 extents per file would give you 75%. This may or may not be a problem, especially depending on the size of the files in question. (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented). The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.

Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:
[[Image:Frag_factor.png|500px]]

== Q: I'm getting "Internal error xfs_sb_read_verify" errors when I try to run xfs_growfs under kernels v3.10 through v3.12 ==

This may happen when running xfs_growfs under a v3.10-v3.12 kernel,
if the filesystem was previously grown under a kernel prior to v3.8.

Old kernel versions prior to v3.8 did not zero the empty part of
new secondary superblocks when growing the filesystem with xfs_growfs.

Kernels v3.10 and later began detecting this non-zero part of the
superblock as corruption, and emit the

Internal error xfs_sb_read_verify

error message.

Kernels v3.13 and later are more forgiving about this - if the non-zero
data is found on a Version 4 superblock, it will not be flagged as
corruption.

The problematic secondary superblocks may be repaired by using an xfs_repair
version 3.2.0-alpha1 or above.

The relevant kernelspace commits are as follows:

v3.8 1375cb6 xfs: growfs: don't read garbage for new secondary superblocks <- fixed underlying problem
v3.10 04a1e6c xfs: add CRC checks to the superblock <- detected old underlying problem
v3.13 10e6e65 xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields <- is more forgiving of old underlying problem

This commit allows xfs_repair to detect and correct the problem:

v3.2.0-alpha1 cbd7508 xfs_repair: zero out unused parts of superblocks

== Q: Why do files on XFS use more data blocks than expected? ==

The XFS speculative preallocation algorithm allocates extra blocks beyond end of file (EOF) to minimize file fragmentation during buffered write workloads. Workloads that benefit from this behaviour include slowly growing files, concurrent writers and mixed reader/writer workloads. It also provides fragmentation resistance in situations where memory pressure prevents adequate buffering of dirty data to allow formation of large contiguous regions of data in memory.

This post-EOF block allocation is accounted identically to blocks within EOF. It is visible in 'st_blocks' counts via stat() system calls, accounted as globally allocated space and against quotas that apply to the associated file. The space is reported by various userspace utilities (stat, du, df, ls) and thus provides a common source of confusion for administrators. Post-EOF blocks are temporary in most situations and are usually reclaimed via several possible mechanisms in XFS.

See the FAQ entry on speculative preallocation for details.

== Q: What is speculative preallocation? ==

XFS speculatively preallocates post-EOF blocks on file extending writes in anticipation of future extending writes. The size of a preallocation is dynamic and depends on the runtime state of the file and fs. Generally speaking, preallocation is disabled for very small files and preallocation sizes grow as files grow larger.

Preallocations are capped to the maximum extent size supported by the filesystem. Preallocation size is throttled automatically as the filesystem approaches low free space conditions or other allocation limits on a file (such as a quota).

In most cases, speculative preallocation is automatically reclaimed when a file is closed. Applications that repeatedly trigger preallocation and reclaim cycles (e.g., this is common in file server or log file workloads) can cause fragmentation. Therefore, this pattern is detected and causes the preallocation to persist beyond the lifecycle of the file descriptor.

== Q: How can I speed up or avoid delayed removal of speculative preallocation? ==

Linux 3.8 (and later) includes a scanner to perform background trimming of files with lingering post-EOF preallocations. The scanner bypasses dirty files to avoid interference with ongoing writes. A 5 minute scan interval is used by default and can be adjusted via the following file (value in seconds):

/proc/sys/fs/xfs/speculative_prealloc_lifetime

== Q: Is speculative preallocation permanent? ==

Preallocated blocks are normally reclaimed on file close, inode reclaim, unmount or in the background once file write activity subsides. They can be explicitly made permanent via fallocate or a similar interface. They can be implicitly made permanent in situations where file size is extended beyond a range of post-EOF blocks (i.e., via an extending truncate) or following a crash. In the event of a crash, the in-memory state used to track and reclaim the speculative preallocation is lost.

== Q: My workload has known characteristics - can I disable speculative preallocation or tune it to an optimal fixed size? ==

Speculative preallocation can not be disabled but XFS can be tuned to a fixed allocation size with the 'allocsize=' mount option. Speculative preallocation is not dynamically resized when the allocsize mount option is set and thus the potential for fragmentation is increased. Use 'allocsize=64k' to revert to the default XFS behavior prior to support for dynamic speculative preallocation.

== Q: mount (or umount) takes minutes or even hours - what could be the reason ? ==

In some cases xfs log (journal) can become quite big. For example if it accumulates many entries and doesn't get chance to apply these to disk (due to lockup, crash, hard reset etc). xfs will try to reapply these at mount (in dmesg: "Starting recovery (logdev: internal)").

That process with big log to be reapplied can take very long time (minutes or even hours). Similar problem can happen with unmount taking hours when there are hundreds of thousands of dirty inode in memory that need to be flushed to disk.

(http://oss.sgi.com/pipermail/xfs/2015-October/044457.html)

XFS FAQ

2015-12-13T08:18:18Z

Cattelan: Fix url

Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]

Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.

== Q: Where can I find documentation about XFS? ==

The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.

You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the '''<nowiki>#xfs</nowiki>''' IRC channel on ''irc.freenode.net''.

== Q: Where can I find documentation about ACLs? ==

Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/

The '''acl(5)''' manual page is also quite extensive.

== Q: Where can I find information about the internals of XFS? ==

An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.

Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.

== Q: What partition type should I use for XFS on Linux? ==

Linux native filesystem (83).

== Q: What mount options does XFS have? ==

There are a number of mount options influencing XFS filesystems - refer to the '''mount(8)''' manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])

== Q: Is there any relation between the XFS utilities and the kernel version? ==

No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.

== Q: Does it run on platforms other than i386? ==

XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.

== Q: Quota: Do quotas work on XFS? ==

Yes.

To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/ http://sourceforge.net/projects/linuxquota/] or use '''xfs_quota(8)'''.

== Q: Quota: What's project quota? ==

The project quota is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.

== Q: Quota: Can group quota and project quota be used at the same time? ==

No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.

== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==

To be answered.

== Q: Are there any dump/restore tools for XFS? ==

'''xfsdump(8)''' and '''xfsrestore(8)''' are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.

== Q: Does LILO work with XFS? ==

This depends on where you install LILO.

Yes, for MBR (Master Boot Record) installations.

No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.

== Q: Does GRUB work with XFS? ==

There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.

== Q: Can XFS be used for a root filesystem? ==

Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the "rootflags=" kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit "logdev=" specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]

== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==

Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be "clean" when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.

== Q: Is there a way to make a XFS filesystem larger or smaller? ==

You can ''NOT'' make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.

An XFS filesystem may be enlarged by using '''xfs_growfs(8)'''.

If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the ''exact same'' starting point. Run '''xfs_growfs''' to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.

Using XFS filesystems on top of a volume manager makes this a lot easier.

== Q: What information should I include when reporting a problem? ==

What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:

* kernel version (uname -a)
* xfsprogs version (xfs_repair -V)
* number of CPUs
* contents of /proc/meminfo
* contents of /proc/mounts
* contents of /proc/partitions
* RAID layout (hardware and/or software)
* LVM configuration
* type of disks you are using
* write cache status of drives
* size of BBWC and mode it is running in
* xfs_info output on the filesystem in question
* dmesg output showing all error messages and stack traces

Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:

# iostat -x -d -m 5
# vmstat 5

can give us insight into the IO and memory utilisation of your machine at the time of the problem.

If the filesystem is hanging, then capture the output of the dmesg command after running:

# echo w > /proc/sysrq-trigger
# dmesg

will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.

And for advanced users, capturing an event trace using '''trace-cmd''' (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it's a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:

# trace-cmd record -e xfs\*

before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:

# trace-cmd report > trace_report.txt

Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.

If you have a problem with '''xfs_repair(8)''', make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using '''xfs_metadump(8)''' (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.

== Q: Mounting an XFS filesystem does not work - what is wrong? ==

If mount prints an error message something like:

mount: /dev/hda5 has wrong major or minor number

you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the "-t xfs" option on mount or the "xfs" option in <tt>/etc/fstab</tt>.

If you get something like:

mount: wrong fs type, bad option, bad superblock on /dev/sda1,
or too many mounted file systems

Refer to your system log file (<tt>/var/log/messages</tt>) for a detailed diagnostic message from the kernel.

== Q: Does the filesystem have an undelete capability? ==

There is no undelete in XFS.

However, if an inode is unlinked but neither it nor its associated data blocks get immediately re-used and overwritten, there is some small chance to recover the file from the disk.

''photorec'', ''xfs_irecover'' or ''xfsr'' are some tools which attempt to do this, with varying success.

There are also commercial data recovery services and closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS] which claims to recover data, although this has not been tested by the XFS developers.

As always, the best advice is to keep good backups.

== Q: How can I backup a XFS filesystem and ACLs? ==

You can backup a XFS filesystem with utilities like '''xfsdump(8)''' and standard '''tar(1)''' for standard files. If you want to backup ACLs you will need to use '''xfsdump''' or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (> version 3.1.4) or [http://rsync.samba.org/ rsync] (>= version 3.0.0) to backup ACLs and EAs. '''xfsdump''' can also be integrated with [http://www.amanda.org/ amanda(8)].

== Q: I see applications returning error 990 or "Structure needs cleaning", what is wrong? ==

The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], "Structure needs cleaning."

The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.

There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.

You can use xfs_repair to remedy the problem (with the file system unmounted).

== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==

Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.

XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.

Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you'll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the '''xfs_bmap(8)''' command).

== Q: What is the problem with the write cache on journaled filesystems? ==

Many drives use a write back cache in order to speed up the performance of writes. However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk. Further, the drive can de-stage data from the write cache to the platters in any order that it chooses. This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk. When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.

With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information. In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.

With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued. A powerfail "only" loses data in the cache but no essential ordering is violated, and corruption will not occur.

With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance. But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.

== Q: How can I tell if I have the disk write cache enabled? ==

For SCSI/SATA:

* Look in dmesg(8) output for a driver line, such as: "SCSI device sda: drive cache: write back"
* <nowiki># sginfo -c /dev/sda | grep -i 'write cache' </nowiki>

For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):

* <nowiki># hdparm -I /dev/sda</nowiki> and look under "Enabled Supported" for "Write cache"

For RAID controllers:

* See the section about RAID controllers below

== Q: How can I address the problem with the disk write cache? ==

=== Disabling the disk write back cache. ===

For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]): 

* <nowiki># hdparm -W0 /dev/sda</nowiki> # hdparm -W0 /dev/hda
* <nowiki># blktool /dev/sda wcache off</nowiki> # blktool /dev/hda wcache off

For SCSI:

* Using sginfo(8) which is a little tedious It takes 3 steps. For example:
*# <nowiki>#sginfo -c /dev/sda</nowiki> which gives a list of attribute names and values
*# <nowiki>#sginfo -cX /dev/sda</nowiki> which gives an array of cache values which you must match up with from step 1, e.g. 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0
*# <nowiki>#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0</nowiki> allows you to reset the value of the cache attributes.

For RAID controllers:

* See the section about RAID controllers below

This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled. 

=== Using an external log. ===

Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will '''not''' solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won't be able to guarantee that if the metadata is on a drive with the write cache enabled.

In fact using an external log will disable XFS' write barrier support.

=== Write barrier support. ===

Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with "nobarrier". Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:

* "Disabling barriers, not supported with external log device"
* "Disabling barriers, not supported by the underlying device"
* "Disabling barriers, trial barrier write failed"

If the filesystem is mounted with an external log device then we currently don't support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn't support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.

== Q. Should barriers be enabled with storage which has a persistent write cache? ==

Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with "nobarrier", assuming your RAID controller is infallible and not resetting randomly like some common ones do. But take care about the hard disk write cache, which should be off.

== Q. Which settings does my RAID controller need ? ==

It's hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:

Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory "[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]") which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.

If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.

* onboard RAID controllers: there are so many different types it's hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn't even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.

* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86);

* Adaptec: allows setting individual drives cache
arcconf setcache <disk> wb|wt
wb=write back, which means write cache on, wt=write through, which means write cache off. So "wt" should be chosen.

* Areca: In archttp under "System Controls" -> "System Configuration" there's the option "Disk Write Cache Mode" (defaults "Auto")

"Off": disk write cache is turned off

"On": disk write cache is enabled, this is not safe for your data but fast

"Auto": If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to "On", because neither controller cache nor disk cache is safe so you don't seem to care about your data and just want high speed (which you get then).

That's a very sensible default so you can let it "Auto" or enforce "Off" to be sure.

* LSI MegaRAID: allows setting individual disks cache:
MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL # flushes the controller cache
MegaCli -LDGetProp -Cache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the controller cache settings
MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the disk cache settings (for all phys. disks in logical disk)
MegaCli -LDSetProp -EnDskCache|DisDskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # set disk cache setting

* Xyratex: from the docs: "Write cache includes the disk drive cache and controller cache.". So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.

== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==

The biggest problem is that those products seem to also virtualize disk
writes in a way that even barriers don't work any more, which means even
a fsync is not reliable. Tests confirm that unplugging the power from
such a system even with RAID controller with battery backed cache and
hard disk cache turned off (which is safe on a normal host) you can
destroy a database within the virtual machine (client, domU whatever you
call it).

In qemu you can specify cache=off on the line specifying the virtual
disk. For others information is missing.

== Q: What is the issue with directory corruption in Linux 2.6.17? ==

In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some "sparse" endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.

''Update: the fix is included in 2.6.17.7 and later kernels.''

To add insult to injury, '''xfs_repair(8)''' is currently not correcting these directories on detection of this corrupt state either. This '''xfs_repair''' issue is actively being worked on, and a fixed version will be available shortly.

''Update: a fixed '''xfs_repair''' is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.''

No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.

'''xfs_repair -n''' should be able to detect any directory corruption.

Until a fixed '''xfs_repair''' binary is available, one can make use of the '''xfs_db(8)''' command to mark the problem directory for removal (see the example below). A subsequent '''xfs_repair''' invocation will remove the directory and move all contents into "lost+found", named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
core.mode = 040755
core.version = 2
core.format = 3 (btree)
...
xfs_db> write core.mode 0
xfs_db> quit
</nowiki>

A subsequent '''xfs_repair''' will clear the directory, and add new entries (named by inode number) in lost+found.

The easiest way to map inode numbers to full paths is via '''xfs_ncheck(8)'''<nowiki>: </nowiki>

<nowiki>
# xfs_ncheck -i 14101 -i 14102 /dev/sdXXX
14101 full/path/mumble_fratz_foo_bar_1495
14102 full/path/mumble_fratz_foo_bar_1494
</nowiki>

Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
...
next_unlinked = null
u.bmbt.level = 1
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[0]
u.bmbt.ptrs[1] = 1:3628
xfs_db> fsblock 3628
xfs_db> type bmapbtd
xfs_db> print
magic = 0x424d4150
level = 0
numrecs = 19
leftsib = null
rightsib = null
recs[1-19] = [startoff,startblock,blockcount,extentflag]
1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]
5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]
9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]
12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]
15:[33554436,3488,8,0] 16:[33554444,3629,4,0]
17:[33554448,3748,4,0] 18:[33554452,3900,4,0]
19:[67108864,3364,4,0]
</nowiki>

At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the '''xfs_db''' dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:

...
xfs_db> dblock 20
xfs_db> print
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0
dhdr.bestfree[0].length = 0
dhdr.bestfree[1].offset = 0
dhdr.bestfree[1].length = 0
dhdr.bestfree[2].offset = 0
dhdr.bestfree[2].length = 0
du[0].inumber = 13937
du[0].namelen = 25
du[0].name = "mumble_fratz_foo_bar_1595"
du[0].tag = 0x10
du[1].inumber = 13938
du[1].namelen = 25
du[1].name = "mumble_fratz_foo_bar_1594"
du[1].tag = 0x38
...

So, here we can see that inode number 13938 matches up with name "mumble_fratz_foo_bar_1594". Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at "lost+found" (once '''xfs_repair''' has removed the corrupt directory).

== Q: Why does my > 2TB XFS partition disappear when I reboot ? ==

Strictly speaking this is not an XFS problem.

To support > 2TB partitions you need two things: a kernel that supports large block devices (<tt>CONFIG_LBD=y</tt>) and a partition table format that can hold large partitions. The default DOS partition tables don't. The best partition format for
> 2TB partitions is the EFI GPT format (<tt>CONFIG_EFI_PARTITION=y</tt>).

Without CONFIG_LBD=y you can't even create the filesystem, but without <tt>CONFIG_EFI_PARTITION=y</tt> it works fine until you reboot at which point the partition will disappear. Note that you need to enable the <tt>CONFIG_PARTITION_ADVANCED</tt> option before you can set <tt>CONFIG_EFI_PARTITION=y</tt>.

== Q: Why do I receive <tt>No space left on device</tt> after <tt>xfs_growfs</tt>? ==

After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. This was an issue with the older "inode32" inode allocation mode, where inode allocation is restricted to lower filesysetm blocks. To fix this, [http://oss.sgi.com/archives/xfs/2009-01/msg01031.html Dave Chinner advised]:

The only way to fix this is to move data around to free up space
below 1TB. Find your oldest data (i.e. that was around before even
the first grow) and move it off the filesystem (move, not copy).
Then if you copy it back on, the data blocks will end up above 1TB
and that should leave you with plenty of space for inodes below 1TB.

A complete dump and restore will also fix the problem ;)

Alternately, you can add 'inode64' to your mount options to allow inodes to live above 1TB.

example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&forum=38 No space left on device on xfs filesystem with 7.7TB free]

However, 'inode64' has been the default behavior since kernel v3.7...

Unfortunately, v3.7 also added a bug present from kernel v3.7 to v3.17 which caused new allocation groups added by growfs to be unavailable for inode allocation. This was fixed by commit <tt>[http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9de67c3ba9ea961ba420573d56479d09d33a7587 9de67c3b xfs: allow inode allocations in post-growfs disk space.]</tt> in kernel v3.17.
Without that commit, the problem can be worked around by doing a "mount -o remount,inode64" after the growfs operation.

== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==

The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons.

Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.

== Q: How to get around a bad inode repair is unable to clean up ==

The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.

xfs_db -x -c 'inode XXX' -c 'write core.nextents 0' -c 'write core.size 0' /dev/hdXX

== Q: How to calculate the correct sunit,swidth values for optimal performance ==

XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.

These options can be sometimes autodetected (for example with md raid and recent enough kernel (>= 2.6.32) and xfsprogs (>= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.

The calculation of these values is quite simple:

su = <RAID controllers stripe size in BYTES (or KiBytes when used with k)>
sw = <# of data disks (don't count parity disks)>

So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use

su = 64k
sw = 6 (RAID-6 of 8 disks has 6 data disks)

A RAID stripe size of 256KB with a RAID-10 over 16 disks should use

su = 256k
sw = 8 (RAID-10 of 16 disks has 8 data disks)

Alternatively, you can use "sunit" instead of "su" and "swidth" instead of "sw" but then sunit/swidth values need to be specified in "number of 512B sectors"!

Note that <tt>xfs_info</tt> and <tt>mkfs.xfs</tt> interpret sunit and swidth as being specified in units of 512B sectors; that's unfortunately not the unit they're reported in, however.
<tt>xfs_info</tt> and <tt>mkfs.xfs</tt> report them in multiples of your basic block size (bsize) and not in 512B sectors.

Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.

When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.

== Q: Why doesn't NFS-exporting subdirectories of inode64-mounted filesystem work? ==

The default <tt>fsid</tt> type encodes only 32-bit of the inode number for subdirectory exports. However, exporting the root of the filesystem works, or using one of the non-default <tt>fsid</tt> types (<tt>fsid=uuid</tt> in <tt>/etc/exports</tt> with recent <tt>nfs-utils</tt>) should work as well. (Thanks, Christoph!)

== Q: What is the inode64 mount option for? ==

By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like "disk full" when you still have plenty space free, but there's no more place in the first TB to create a new inode. Also, performance sucks.

To come around this, use the inode64 mount options for filesystems >1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.

Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.

== Q: Can I just try the inode64 option to see if it helps me? ==

Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can't access files & dirs that have been created with an inode >32bit anymore.

== Q: Performance: mkfs.xfs -n size=64k option ==

Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:

Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a
directory entry is determined by the length of the name.

There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there's the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.

For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.

In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don't have any numbers on what the difference might be - I'm getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....

== Q: I want to tune my XFS filesystems for <something> ==

''Premature optimization is the root of all evil.'' - Donald Knuth

The standard answer you will get to this question is this: use the defaults.

There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to configure the filesystem appropriately.

There are a lot of "XFS tuning guides" that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don't expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.

In most cases, the only thing you need to to consider for <tt>mkfs.xfs</tt> is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the <tt>logbsize</tt> and <tt>delaylog</tt> mount options. Increasing <tt>logbsize</tt> reduces the number of journal IOs for a given workload, and <tt>delaylog</tt> will reduce them even further. The trade off for this increase in metadata performance is that more operations may be "missing" after recovery if the system crashes while actively making modifications.

As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.

== Q: Which factors influence the memory usage of xfs_repair? ==

This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).

# xfs_repair -n -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2096.
#

xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,
of which 2,097,152KB is needed for tracking free space.
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)

Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:

# xfs_repair -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2289.

That is now needs at least another 200MB of RAM to run.

The numbers reported by xfs_repair are the absolute minimum required and approximate at that;
more RAM than this may be required to complete successfully.
Also, if you only give xfs_repair the minimum required RAM, it will be slow;
for best repair performance, the more RAM you can give it the better.

== Q: Why some files of my filesystem shows as "?????????? ? ? ? ? ? filename" ? ==

If ls -l shows you a listing as

# ?????????? ? ? ? ? ? file1
?????????? ? ? ? ? ? file2
?????????? ? ? ? ? ? file3
?????????? ? ? ? ? ? file4

and errors like:
# ls /pathtodir/
ls: cannot access /pathtodir/file1: Invalid argument
ls: cannot access /pathtodir/file2: Invalid argument
ls: cannot access /pathtodir/file3: Invalid argument
ls: cannot access /pathtodir/file4: Invalid argument

or even:
# failed to stat /pathtodir/file1

It is very probable your filesystem must be mounted with inode64
# mount -oremount,inode64 /dev/diskpart /mnt/xfs

should make it work ok again.
If it works, add the option to fstab.

== Q: The xfs_db "frag" command says I'm over 50%. Is that bad? ==

It depends. It's important to know how the value is calculated. xfs_db looks at the extents in all files, and returns:

(actual extents - ideal extents) / actual extents

This means that if, for example, you have an average of 2 extents per file, you'll get an answer of 50%. 4 extents per file would give you 75%. This may or may not be a problem, especially depending on the size of the files in question. (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented). The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.

Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:
[[Image:Frag_factor.png|500px]]

== Q: I'm getting "Internal error xfs_sb_read_verify" errors when I try to run xfs_growfs under kernels v3.10 through v3.12 ==

This may happen when running xfs_growfs under a v3.10-v3.12 kernel,
if the filesystem was previously grown under a kernel prior to v3.8.

Old kernel versions prior to v3.8 did not zero the empty part of
new secondary superblocks when growing the filesystem with xfs_growfs.

Kernels v3.10 and later began detecting this non-zero part of the
superblock as corruption, and emit the

Internal error xfs_sb_read_verify

error message.

Kernels v3.13 and later are more forgiving about this - if the non-zero
data is found on a Version 4 superblock, it will not be flagged as
corruption.

The problematic secondary superblocks may be repaired by using an xfs_repair
version 3.2.0-alpha1 or above.

The relevant kernelspace commits are as follows:

v3.8 1375cb6 xfs: growfs: don't read garbage for new secondary superblocks <- fixed underlying problem
v3.10 04a1e6c xfs: add CRC checks to the superblock <- detected old underlying problem
v3.13 10e6e65 xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields <- is more forgiving of old underlying problem

This commit allows xfs_repair to detect and correct the problem:

v3.2.0-alpha1 cbd7508 xfs_repair: zero out unused parts of superblocks

== Q: Why do files on XFS use more data blocks than expected? ==

The XFS speculative preallocation algorithm allocates extra blocks beyond end of file (EOF) to minimize file fragmentation during buffered write workloads. Workloads that benefit from this behaviour include slowly growing files, concurrent writers and mixed reader/writer workloads. It also provides fragmentation resistance in situations where memory pressure prevents adequate buffering of dirty data to allow formation of large contiguous regions of data in memory.

This post-EOF block allocation is accounted identically to blocks within EOF. It is visible in 'st_blocks' counts via stat() system calls, accounted as globally allocated space and against quotas that apply to the associated file. The space is reported by various userspace utilities (stat, du, df, ls) and thus provides a common source of confusion for administrators. Post-EOF blocks are temporary in most situations and are usually reclaimed via several possible mechanisms in XFS.

See the FAQ entry on speculative preallocation for details.

== Q: What is speculative preallocation? ==

XFS speculatively preallocates post-EOF blocks on file extending writes in anticipation of future extending writes. The size of a preallocation is dynamic and depends on the runtime state of the file and fs. Generally speaking, preallocation is disabled for very small files and preallocation sizes grow as files grow larger.

Preallocations are capped to the maximum extent size supported by the filesystem. Preallocation size is throttled automatically as the filesystem approaches low free space conditions or other allocation limits on a file (such as a quota).

In most cases, speculative preallocation is automatically reclaimed when a file is closed. Applications that repeatedly trigger preallocation and reclaim cycles (e.g., this is common in file server or log file workloads) can cause fragmentation. Therefore, this pattern is detected and causes the preallocation to persist beyond the lifecycle of the file descriptor.

== Q: How can I speed up or avoid delayed removal of speculative preallocation? ==

Linux 3.8 (and later) includes a scanner to perform background trimming of files with lingering post-EOF preallocations. The scanner bypasses dirty files to avoid interference with ongoing writes. A 5 minute scan interval is used by default and can be adjusted via the following file (value in seconds):

/proc/sys/fs/xfs/speculative_prealloc_lifetime

== Q: Is speculative preallocation permanent? ==

Preallocated blocks are normally reclaimed on file close, inode reclaim, unmount or in the background once file write activity subsides. They can be explicitly made permanent via fallocate or a similar interface. They can be implicitly made permanent in situations where file size is extended beyond a range of post-EOF blocks (i.e., via an extending truncate) or following a crash. In the event of a crash, the in-memory state used to track and reclaim the speculative preallocation is lost.

== Q: My workload has known characteristics - can I disable speculative preallocation or tune it to an optimal fixed size? ==

Speculative preallocation can not be disabled but XFS can be tuned to a fixed allocation size with the 'allocsize=' mount option. Speculative preallocation is not dynamically resized when the allocsize mount option is set and thus the potential for fragmentation is increased. Use 'allocsize=64k' to revert to the default XFS behavior prior to support for dynamic speculative preallocation.

== Q: mount (or umount) takes minutes or even hours - what could be the reason ? ==

In some cases xfs log (journal) can become quite big. For example if it accumulates many entries and doesn't get chance to apply these to disk (due to lockup, crash, hard reset etc). xfs will try to reapply these at mount (in dmesg: "Starting recovery (logdev: internal)").

That process with big log to be reapplied can take very long time (minutes or even hours). Similar problem can happen with unmount taking hours when there are hundreds of thousands of dirty inode in memory that need to be flushed to disk.

(http://oss.sgi.com/pipermail/xfs/2015-October/044457.html)

XFS FAQ

2015-10-23T07:33:03Z

Cattelan: long mount/umount

Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]

Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.

== Q: Where can I find documentation about XFS? ==

The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.

You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the '''<nowiki>#xfs</nowiki>''' IRC channel on ''irc.freenode.net''.

== Q: Where can I find documentation about ACLs? ==

Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/

The '''acl(5)''' manual page is also quite extensive.

== Q: Where can I find information about the internals of XFS? ==

An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.

Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.

== Q: What partition type should I use for XFS on Linux? ==

Linux native filesystem (83).

== Q: What mount options does XFS have? ==

There are a number of mount options influencing XFS filesystems - refer to the '''mount(8)''' manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])

== Q: Is there any relation between the XFS utilities and the kernel version? ==

No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.

== Q: Does it run on platforms other than i386? ==

XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.

== Q: Quota: Do quotas work on XFS? ==

Yes.

To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/ http://sourceforge.net/projects/linuxquota/] or use '''xfs_quota(8)'''.

== Q: Quota: What's project quota? ==

The project quota is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.

== Q: Quota: Can group quota and project quota be used at the same time? ==

No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.

== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==

To be answered.

== Q: Are there any dump/restore tools for XFS? ==

'''xfsdump(8)''' and '''xfsrestore(8)''' are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.

== Q: Does LILO work with XFS? ==

This depends on where you install LILO.

Yes, for MBR (Master Boot Record) installations.

No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.

== Q: Does GRUB work with XFS? ==

There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.

== Q: Can XFS be used for a root filesystem? ==

Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the "rootflags=" kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit "logdev=" specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]

== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==

Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be "clean" when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.

== Q: Is there a way to make a XFS filesystem larger or smaller? ==

You can ''NOT'' make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.

An XFS filesystem may be enlarged by using '''xfs_growfs(8)'''.

If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the ''exact same'' starting point. Run '''xfs_growfs''' to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.

Using XFS filesystems on top of a volume manager makes this a lot easier.

== Q: What information should I include when reporting a problem? ==

What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:

* kernel version (uname -a)
* xfsprogs version (xfs_repair -V)
* number of CPUs
* contents of /proc/meminfo
* contents of /proc/mounts
* contents of /proc/partitions
* RAID layout (hardware and/or software)
* LVM configuration
* type of disks you are using
* write cache status of drives
* size of BBWC and mode it is running in
* xfs_info output on the filesystem in question
* dmesg output showing all error messages and stack traces

Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:

# iostat -x -d -m 5
# vmstat 5

can give us insight into the IO and memory utilisation of your machine at the time of the problem.

If the filesystem is hanging, then capture the output of the dmesg command after running:

# echo w > /proc/sysrq-trigger
# dmesg

will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.

And for advanced users, capturing an event trace using '''trace-cmd''' (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it's a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:

# trace-cmd record -e xfs\*

before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:

# trace-cmd report > trace_report.txt

Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.

If you have a problem with '''xfs_repair(8)''', make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using '''xfs_metadump(8)''' (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.

== Q: Mounting an XFS filesystem does not work - what is wrong? ==

If mount prints an error message something like:

mount: /dev/hda5 has wrong major or minor number

you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the "-t xfs" option on mount or the "xfs" option in <tt>/etc/fstab</tt>.

If you get something like:

mount: wrong fs type, bad option, bad superblock on /dev/sda1,
or too many mounted file systems

Refer to your system log file (<tt>/var/log/messages</tt>) for a detailed diagnostic message from the kernel.

== Q: Does the filesystem have an undelete capability? ==

There is no undelete in XFS.

However, if an inode is unlinked but neither it nor its associated data blocks get immediately re-used and overwritten, there is some small chance to recover the file from the disk.

''photorec'', ''xfs_irecover'' or ''xfsr'' are some tools which attempt to do this, with varying success.

There are also commercial data recovery services and closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS] which claims to recover data, although this has not been tested by the XFS developers.

As always, the best advice is to keep good backups.

== Q: How can I backup a XFS filesystem and ACLs? ==

You can backup a XFS filesystem with utilities like '''xfsdump(8)''' and standard '''tar(1)''' for standard files. If you want to backup ACLs you will need to use '''xfsdump''' or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (> version 3.1.4) or [http://rsync.samba.org/ rsync] (>= version 3.0.0) to backup ACLs and EAs. '''xfsdump''' can also be integrated with [http://www.amanda.org/ amanda(8)].

== Q: I see applications returning error 990 or "Structure needs cleaning", what is wrong? ==

The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], "Structure needs cleaning."

The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.

There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.

You can use xfs_repair to remedy the problem (with the file system unmounted).

== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==

Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.

XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.

Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you'll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the '''xfs_bmap(8)''' command).

== Q: What is the problem with the write cache on journaled filesystems? ==

Many drives use a write back cache in order to speed up the performance of writes. However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk. Further, the drive can de-stage data from the write cache to the platters in any order that it chooses. This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk. When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.

With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information. In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.

With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued. A powerfail "only" loses data in the cache but no essential ordering is violated, and corruption will not occur.

With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance. But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.

== Q: How can I tell if I have the disk write cache enabled? ==

For SCSI/SATA:

* Look in dmesg(8) output for a driver line, such as: "SCSI device sda: drive cache: write back"
* <nowiki># sginfo -c /dev/sda | grep -i 'write cache' </nowiki>

For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):

* <nowiki># hdparm -I /dev/sda</nowiki> and look under "Enabled Supported" for "Write cache"

For RAID controllers:

* See the section about RAID controllers below

== Q: How can I address the problem with the disk write cache? ==

=== Disabling the disk write back cache. ===

For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]): 

* <nowiki># hdparm -W0 /dev/sda</nowiki> # hdparm -W0 /dev/hda
* <nowiki># blktool /dev/sda wcache off</nowiki> # blktool /dev/hda wcache off

For SCSI:

* Using sginfo(8) which is a little tedious It takes 3 steps. For example:
*# <nowiki>#sginfo -c /dev/sda</nowiki> which gives a list of attribute names and values
*# <nowiki>#sginfo -cX /dev/sda</nowiki> which gives an array of cache values which you must match up with from step 1, e.g. 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0
*# <nowiki>#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0</nowiki> allows you to reset the value of the cache attributes.

For RAID controllers:

* See the section about RAID controllers below

This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled. 

=== Using an external log. ===

Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will '''not''' solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won't be able to guarantee that if the metadata is on a drive with the write cache enabled.

In fact using an external log will disable XFS' write barrier support.

=== Write barrier support. ===

Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with "nobarrier". Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:

* "Disabling barriers, not supported with external log device"
* "Disabling barriers, not supported by the underlying device"
* "Disabling barriers, trial barrier write failed"

If the filesystem is mounted with an external log device then we currently don't support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn't support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.

== Q. Should barriers be enabled with storage which has a persistent write cache? ==

Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with "nobarrier", assuming your RAID controller is infallible and not resetting randomly like some common ones do. But take care about the hard disk write cache, which should be off.

== Q. Which settings does my RAID controller need ? ==

It's hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:

Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory "[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]") which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.

If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.

* onboard RAID controllers: there are so many different types it's hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn't even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.

* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86);

* Adaptec: allows setting individual drives cache
arcconf setcache <disk> wb|wt
wb=write back, which means write cache on, wt=write through, which means write cache off. So "wt" should be chosen.

* Areca: In archttp under "System Controls" -> "System Configuration" there's the option "Disk Write Cache Mode" (defaults "Auto")

"Off": disk write cache is turned off

"On": disk write cache is enabled, this is not safe for your data but fast

"Auto": If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to "On", because neither controller cache nor disk cache is safe so you don't seem to care about your data and just want high speed (which you get then).

That's a very sensible default so you can let it "Auto" or enforce "Off" to be sure.

* LSI MegaRAID: allows setting individual disks cache:
MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL # flushes the controller cache
MegaCli -LDGetProp -Cache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the controller cache settings
MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # shows the disk cache settings (for all phys. disks in logical disk)
MegaCli -LDSetProp -EnDskCache|DisDskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL # set disk cache setting

* Xyratex: from the docs: "Write cache includes the disk drive cache and controller cache.". So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.

== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==

The biggest problem is that those products seem to also virtualize disk
writes in a way that even barriers don't work any more, which means even
a fsync is not reliable. Tests confirm that unplugging the power from
such a system even with RAID controller with battery backed cache and
hard disk cache turned off (which is safe on a normal host) you can
destroy a database within the virtual machine (client, domU whatever you
call it).

In qemu you can specify cache=off on the line specifying the virtual
disk. For others information is missing.

== Q: What is the issue with directory corruption in Linux 2.6.17? ==

In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some "sparse" endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.

''Update: the fix is included in 2.6.17.7 and later kernels.''

To add insult to injury, '''xfs_repair(8)''' is currently not correcting these directories on detection of this corrupt state either. This '''xfs_repair''' issue is actively being worked on, and a fixed version will be available shortly.

''Update: a fixed '''xfs_repair''' is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.''

No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.

'''xfs_repair -n''' should be able to detect any directory corruption.

Until a fixed '''xfs_repair''' binary is available, one can make use of the '''xfs_db(8)''' command to mark the problem directory for removal (see the example below). A subsequent '''xfs_repair''' invocation will remove the directory and move all contents into "lost+found", named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
core.mode = 040755
core.version = 2
core.format = 3 (btree)
...
xfs_db> write core.mode 0
xfs_db> quit
</nowiki>

A subsequent '''xfs_repair''' will clear the directory, and add new entries (named by inode number) in lost+found.

The easiest way to map inode numbers to full paths is via '''xfs_ncheck(8)'''<nowiki>: </nowiki>

<nowiki>
# xfs_ncheck -i 14101 -i 14102 /dev/sdXXX
14101 full/path/mumble_fratz_foo_bar_1495
14102 full/path/mumble_fratz_foo_bar_1494
</nowiki>

Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
...
next_unlinked = null
u.bmbt.level = 1
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[0]
u.bmbt.ptrs[1] = 1:3628
xfs_db> fsblock 3628
xfs_db> type bmapbtd
xfs_db> print
magic = 0x424d4150
level = 0
numrecs = 19
leftsib = null
rightsib = null
recs[1-19] = [startoff,startblock,blockcount,extentflag]
1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]
5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]
9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]
12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]
15:[33554436,3488,8,0] 16:[33554444,3629,4,0]
17:[33554448,3748,4,0] 18:[33554452,3900,4,0]
19:[67108864,3364,4,0]
</nowiki>

At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the '''xfs_db''' dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:

...
xfs_db> dblock 20
xfs_db> print
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0
dhdr.bestfree[0].length = 0
dhdr.bestfree[1].offset = 0
dhdr.bestfree[1].length = 0
dhdr.bestfree[2].offset = 0
dhdr.bestfree[2].length = 0
du[0].inumber = 13937
du[0].namelen = 25
du[0].name = "mumble_fratz_foo_bar_1595"
du[0].tag = 0x10
du[1].inumber = 13938
du[1].namelen = 25
du[1].name = "mumble_fratz_foo_bar_1594"
du[1].tag = 0x38
...

So, here we can see that inode number 13938 matches up with name "mumble_fratz_foo_bar_1594". Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at "lost+found" (once '''xfs_repair''' has removed the corrupt directory).

== Q: Why does my > 2TB XFS partition disappear when I reboot ? ==

Strictly speaking this is not an XFS problem.

To support > 2TB partitions you need two things: a kernel that supports large block devices (<tt>CONFIG_LBD=y</tt>) and a partition table format that can hold large partitions. The default DOS partition tables don't. The best partition format for
> 2TB partitions is the EFI GPT format (<tt>CONFIG_EFI_PARTITION=y</tt>).

Without CONFIG_LBD=y you can't even create the filesystem, but without <tt>CONFIG_EFI_PARTITION=y</tt> it works fine until you reboot at which point the partition will disappear. Note that you need to enable the <tt>CONFIG_PARTITION_ADVANCED</tt> option before you can set <tt>CONFIG_EFI_PARTITION=y</tt>.

== Q: Why do I receive <tt>No space left on device</tt> after <tt>xfs_growfs</tt>? ==

After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. This was an issue with the older "inode32" inode allocation mode, where inode allocation is restricted to lower filesysetm blocks. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:

The only way to fix this is to move data around to free up space
below 1TB. Find your oldest data (i.e. that was around before even
the first grow) and move it off the filesystem (move, not copy).
Then if you copy it back on, the data blocks will end up above 1TB
and that should leave you with plenty of space for inodes below 1TB.

A complete dump and restore will also fix the problem ;)

Alternately, you can add 'inode64' to your mount options to allow inodes to live above 1TB.

example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&forum=38 No space left on device on xfs filesystem with 7.7TB free]

However, 'inode64' has been the default behavior since kernel v3.7...

Unfortunately, v3.7 also added a bug present from kernel v3.7 to v3.17 which caused new allocation groups added by growfs to be unavailable for inode allocation. This was fixed by commit <tt>[http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9de67c3ba9ea961ba420573d56479d09d33a7587 9de67c3b xfs: allow inode allocations in post-growfs disk space.]</tt> in kernel v3.17.
Without that commit, the problem can be worked around by doing a "mount -o remount,inode64" after the growfs operation.

== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==

The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons.

Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.

== Q: How to get around a bad inode repair is unable to clean up ==

The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.

xfs_db -x -c 'inode XXX' -c 'write core.nextents 0' -c 'write core.size 0' /dev/hdXX

== Q: How to calculate the correct sunit,swidth values for optimal performance ==

XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.

These options can be sometimes autodetected (for example with md raid and recent enough kernel (>= 2.6.32) and xfsprogs (>= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.

The calculation of these values is quite simple:

su = <RAID controllers stripe size in BYTES (or KiBytes when used with k)>
sw = <# of data disks (don't count parity disks)>

So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use

su = 64k
sw = 6 (RAID-6 of 8 disks has 6 data disks)

A RAID stripe size of 256KB with a RAID-10 over 16 disks should use

su = 256k
sw = 8 (RAID-10 of 16 disks has 8 data disks)

Alternatively, you can use "sunit" instead of "su" and "swidth" instead of "sw" but then sunit/swidth values need to be specified in "number of 512B sectors"!

Note that <tt>xfs_info</tt> and <tt>mkfs.xfs</tt> interpret sunit and swidth as being specified in units of 512B sectors; that's unfortunately not the unit they're reported in, however.
<tt>xfs_info</tt> and <tt>mkfs.xfs</tt> report them in multiples of your basic block size (bsize) and not in 512B sectors.

Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.

When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.

== Q: Why doesn't NFS-exporting subdirectories of inode64-mounted filesystem work? ==

The default <tt>fsid</tt> type encodes only 32-bit of the inode number for subdirectory exports. However, exporting the root of the filesystem works, or using one of the non-default <tt>fsid</tt> types (<tt>fsid=uuid</tt> in <tt>/etc/exports</tt> with recent <tt>nfs-utils</tt>) should work as well. (Thanks, Christoph!)

== Q: What is the inode64 mount option for? ==

By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like "disk full" when you still have plenty space free, but there's no more place in the first TB to create a new inode. Also, performance sucks.

To come around this, use the inode64 mount options for filesystems >1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.

Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.

== Q: Can I just try the inode64 option to see if it helps me? ==

Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can't access files & dirs that have been created with an inode >32bit anymore.

== Q: Performance: mkfs.xfs -n size=64k option ==

Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:

Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a
directory entry is determined by the length of the name.

There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there's the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.

For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.

In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don't have any numbers on what the difference might be - I'm getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....

== Q: I want to tune my XFS filesystems for <something> ==

''Premature optimization is the root of all evil.'' - Donald Knuth

The standard answer you will get to this question is this: use the defaults.

There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to configure the filesystem appropriately.

There are a lot of "XFS tuning guides" that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don't expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.

In most cases, the only thing you need to to consider for <tt>mkfs.xfs</tt> is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the <tt>logbsize</tt> and <tt>delaylog</tt> mount options. Increasing <tt>logbsize</tt> reduces the number of journal IOs for a given workload, and <tt>delaylog</tt> will reduce them even further. The trade off for this increase in metadata performance is that more operations may be "missing" after recovery if the system crashes while actively making modifications.

As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.

== Q: Which factors influence the memory usage of xfs_repair? ==

This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).

# xfs_repair -n -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2096.
#

xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,
of which 2,097,152KB is needed for tracking free space.
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)

Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:

# xfs_repair -vv -m 1 /dev/vda
Phase 1 - find and verify superblock...
- max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152
Required memory for repair is greater that the maximum specified
with the -m option. Please increase it to at least 2289.

That is now needs at least another 200MB of RAM to run.

The numbers reported by xfs_repair are the absolute minimum required and approximate at that;
more RAM than this may be required to complete successfully.
Also, if you only give xfs_repair the minimum required RAM, it will be slow;
for best repair performance, the more RAM you can give it the better.

== Q: Why some files of my filesystem shows as "?????????? ? ? ? ? ? filename" ? ==

If ls -l shows you a listing as

# ?????????? ? ? ? ? ? file1
?????????? ? ? ? ? ? file2
?????????? ? ? ? ? ? file3
?????????? ? ? ? ? ? file4

and errors like:
# ls /pathtodir/
ls: cannot access /pathtodir/file1: Invalid argument
ls: cannot access /pathtodir/file2: Invalid argument
ls: cannot access /pathtodir/file3: Invalid argument
ls: cannot access /pathtodir/file4: Invalid argument

or even:
# failed to stat /pathtodir/file1

It is very probable your filesystem must be mounted with inode64
# mount -oremount,inode64 /dev/diskpart /mnt/xfs

should make it work ok again.
If it works, add the option to fstab.

== Q: The xfs_db "frag" command says I'm over 50%. Is that bad? ==

It depends. It's important to know how the value is calculated. xfs_db looks at the extents in all files, and returns:

(actual extents - ideal extents) / actual extents

This means that if, for example, you have an average of 2 extents per file, you'll get an answer of 50%. 4 extents per file would give you 75%. This may or may not be a problem, especially depending on the size of the files in question. (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented). The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.

Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:
[[Image:Frag_factor.png|500px]]

== Q: I'm getting "Internal error xfs_sb_read_verify" errors when I try to run xfs_growfs under kernels v3.10 through v3.12 ==

This may happen when running xfs_growfs under a v3.10-v3.12 kernel,
if the filesystem was previously grown under a kernel prior to v3.8.

Old kernel versions prior to v3.8 did not zero the empty part of
new secondary superblocks when growing the filesystem with xfs_growfs.

Kernels v3.10 and later began detecting this non-zero part of the
superblock as corruption, and emit the

Internal error xfs_sb_read_verify

error message.

Kernels v3.13 and later are more forgiving about this - if the non-zero
data is found on a Version 4 superblock, it will not be flagged as
corruption.

The problematic secondary superblocks may be repaired by using an xfs_repair
version 3.2.0-alpha1 or above.

The relevant kernelspace commits are as follows:

v3.8 1375cb6 xfs: growfs: don't read garbage for new secondary superblocks <- fixed underlying problem
v3.10 04a1e6c xfs: add CRC checks to the superblock <- detected old underlying problem
v3.13 10e6e65 xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields <- is more forgiving of old underlying problem

This commit allows xfs_repair to detect and correct the problem:

v3.2.0-alpha1 cbd7508 xfs_repair: zero out unused parts of superblocks

== Q: Why do files on XFS use more data blocks than expected? ==

The XFS speculative preallocation algorithm allocates extra blocks beyond end of file (EOF) to minimize file fragmentation during buffered write workloads. Workloads that benefit from this behaviour include slowly growing files, concurrent writers and mixed reader/writer workloads. It also provides fragmentation resistance in situations where memory pressure prevents adequate buffering of dirty data to allow formation of large contiguous regions of data in memory.

This post-EOF block allocation is accounted identically to blocks within EOF. It is visible in 'st_blocks' counts via stat() system calls, accounted as globally allocated space and against quotas that apply to the associated file. The space is reported by various userspace utilities (stat, du, df, ls) and thus provides a common source of confusion for administrators. Post-EOF blocks are temporary in most situations and are usually reclaimed via several possible mechanisms in XFS.

See the FAQ entry on speculative preallocation for details.

== Q: What is speculative preallocation? ==

XFS speculatively preallocates post-EOF blocks on file extending writes in anticipation of future extending writes. The size of a preallocation is dynamic and depends on the runtime state of the file and fs. Generally speaking, preallocation is disabled for very small files and preallocation sizes grow as files grow larger.

Preallocations are capped to the maximum extent size supported by the filesystem. Preallocation size is throttled automatically as the filesystem approaches low free space conditions or other allocation limits on a file (such as a quota).

In most cases, speculative preallocation is automatically reclaimed when a file is closed. Applications that repeatedly trigger preallocation and reclaim cycles (e.g., this is common in file server or log file workloads) can cause fragmentation. Therefore, this pattern is detected and causes the preallocation to persist beyond the lifecycle of the file descriptor.

== Q: How can I speed up or avoid delayed removal of speculative preallocation? ==

Linux 3.8 (and later) includes a scanner to perform background trimming of files with lingering post-EOF preallocations. The scanner bypasses dirty files to avoid interference with ongoing writes. A 5 minute scan interval is used by default and can be adjusted via the following file (value in seconds):

/proc/sys/fs/xfs/speculative_prealloc_lifetime

== Q: Is speculative preallocation permanent? ==

Preallocated blocks are normally reclaimed on file close, inode reclaim, unmount or in the background once file write activity subsides. They can be explicitly made permanent via fallocate or a similar interface. They can be implicitly made permanent in situations where file size is extended beyond a range of post-EOF blocks (i.e., via an extending truncate) or following a crash. In the event of a crash, the in-memory state used to track and reclaim the speculative preallocation is lost.

== Q: My workload has known characteristics - can I disable speculative preallocation or tune it to an optimal fixed size? ==

Speculative preallocation can not be disabled but XFS can be tuned to a fixed allocation size with the 'allocsize=' mount option. Speculative preallocation is not dynamically resized when the allocsize mount option is set and thus the potential for fragmentation is increased. Use 'allocsize=64k' to revert to the default XFS behavior prior to support for dynamic speculative preallocation.

== Q: mount (or umount) takes minutes or even hours - what could be the reason ? ==

In some cases xfs log (journal) can become quite big. For example if it accumulates many entries and doesn't get chance to apply these to disk (due to lockup, crash, hard reset etc). xfs will try to reapply these at mount (in dmesg: "Starting recovery (logdev: internal)").

That process with big log to be reapplied can take very long time (minutes or even hours). Similar problem can happen with unmount taking hours when there are hundreds of thousands of dirty inode in memory that need to be flushed to disk.

(http://oss.sgi.com/pipermail/xfs/2015-October/044457.html)

Getting the latest source code

2015-10-15T09:50:49Z

Cattelan: added xfsprogs-dev repository information

== XFS Released/Stable source ==

* '''Mainline kernels'''
:XFS has been maintained in the official Linux kernel [http://www.kernel.org/ kernel trees] starting with [http://lkml.org/lkml/2003/12/8/35 Linux 2.4] and is frequently updated with the latest stable fixes and features from the XFS development team.

* '''Vendor kernels'''
:All modern Linux distributions include support for XFS.

* '''XFS userspace'''
:[ftp://oss.sgi.com/projects/xfs source code tarballs] of the xfs userspace tools. These tarballs form the basis of the xfsprogs packages found in Linux distributions.

== Development and bleeding edge Development ==

* [[XFS git howto]]

=== Current XFS kernel source ===

* [https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/ xfs]

$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git

Note: the old kernel tree on [http://oss.sgi.com/cgi-bin/gitweb.cgi oss.sgi.com] is no longer kept up to date with the master tree on kernel.org.

=== XFS user space tools ===
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfsprogs.git;a=summary xfsprogs]

git clone git://oss.sgi.com/xfs/cmds/xfsprogs

* [https://git.kernel.org/cgit/fs/xfs/xfsprogs-dev.git/ xfsprogs development version]

git clone git://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git

A few packages are needed to compile <tt>xfsprogs</tt>, depending on your package manager:

apt-get install libtool automake gettext libblkid-dev uuid-dev
yum install libtool automake gettext libblkid-devel libuuid-devel

=== XFS dump ===
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfsdump.git;a=summary xfsdump]
$ git clone git://oss.sgi.com/xfs/cmds/xfsdump

=== XFS tests ===
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfstests.git;a=summary xfstests]
$ git clone git://oss.sgi.com/xfs/cmds/xfstests

=== DMAPI user space tools ===
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/dmapi.git;a=summary dmapi]
$ git clone git://oss.sgi.com/xfs/cmds/dmapi

=== git-cvsimport generated trees ===

The Git trees are automated mirrored copies of the CVS trees using [http://www.kernel.org/pub/software/scm/git/docs/git-cvsimport.html git-cvsimport].
Since git-cvsimport utilized the tool [http://www.cobite.com/cvsps/ cvsps] to recreate the atomic commits of ptools or "mod" it is easier to see the entire change that was committed using git.

* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=summary linux-2.6-xfs-from-cvs]
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-cmds.git;a=summary xfs-cmds]

Before building in the <tt>xfsdump</tt> or <tt>dmapi</tt> directories (after building <tt>xfsprogs</tt>), you will need to run:
# cd xfsprogs
# make install-dev
to create <tt>/usr/include/xfs</tt> and install appropriate files there.

Before building in the xfstests directory, you will need to run:
# cd xfsprogs
# make install-qa
to install a somewhat larger set of files in <tt>/usr/include/xfs</tt>.

== XFS cvs trees ==

The cvs trees were created using a script that converted sgi's internal
ptools repository to a cvs repository, so the cvs trees were considered read only.

At this point all new development is being managed by the git trees thus the cvs trees
are no longer active in terms of current development and should only be used
for reference.

* [[XFS CVS howto]]

Quota check parallelization

2015-06-28T00:34:04Z

Cattelan: Created page with "Currently quota check done at mount time is not parallelized. Parallelizing would make time needed for mount to finish much smaller."

Currently quota check done at mount time is not parallelized.

Parallelizing would make time needed for mount to finish much smaller.

Ideas for XFS

2015-06-28T00:05:29Z

Cattelan: /* Future Directions for XFS */

= Future Directions for XFS =

Dave Chinner's ideas from 2008:

* [[Improving inode Caching]]

* [[Improving Metadata Performance By Reducing Journal Overhead]]

* [[Reliable Detection and Repair of Metadata Corruption]]

Other ideas:

* [[Splitting project quota support from group quota support]]

* [[Assigning project quota to a linux container]]

* [[Quota check parallelization]]

* [[Support discarding of unused sectors]] (status: completed)

* Superblock flag for when 64-bit inodes are present (see [http://oss.sgi.com/pipermail/xfs/2009-May/041379.html xfs: regarding the inode64 mount option])

* Wishlist: Please integrate ''xfs_irecover'' or provide [http://www.who.is.free.fr/wiki/doku.php?id=recover inode recovery feature]

* [[Host Aware SMR architecture]]

User talk:Cattelan

2012-10-19T14:34:54Z

Cattelan: /* XFS File Inode number is changing using the utilities */ new section

Xfs.org talk:Anshul.kundra

2012-10-15T14:50:11Z

Cattelan: Anshul.kundra moved page User:Anshul.kundra to XFS.org talk:Anshul.kundra: Query

To Developers
I am using xfs utilities xfs_db, Suppose I want to corrupt a particular inode (inode no 131 ) in the file system present then will it be possible as per the manuals blocktrash in xfs_db randomly corrupt the blocks according to the type specified but if I use the xfs_db command like this

xfs_db "blockget -i <inode number (131)> " -c "blocktrash -t inode " device name

Then can it be possibe that the block I am corrupting is for inode with number 131. If it takes random blocks (inode ) then there will be no medium to corrupt the inode of our choice

I have tried but can be able to corrupt specific Inode can anybody correct me if I am using a wrong way for corrupting the inode

Thanks & Regards
Anshul Kundra

Xfs.org talk:Anshul.kundra

2012-10-15T14:49:32Z

Cattelan: xfs_db corrupting specific inode

User:Spamuser

2012-07-24T22:02:19Z

Cattelan: Cattelan moved page User:Arlene0E to User:Spamuser: Automatically moved page while merging the user "Arlene0E" to "Spamuser"

My name's Elmo Salazar, I'm super excited about pitching in as a writer for this site! My day job is in publishing, but in my heart I'm just an internet junkie. While my company does a relatively good job keeping me off the internet at work, I'm positive I can help.

Feel free to surf my web-site ... [http://mobisocial.stanford.edu/wiki/index.php?title=User:IvanJwj download firefox free]

User:WikiSysop

2012-07-24T19:17:44Z

Cattelan: New page: Wiki Sysop

Wiki Sysop

User talk:Cattelan

2012-01-11T15:44:33Z

Cattelan: XFS information query

To Developers ,
I have read about the new member named as xfs_extdelta that is passed in different xfs internal routines i.e xfs_bmapi , In the 2.4 versions instead of using it is just passed as NULL can anyone provide info regarding that where to initialize and if I pass it NULl then is there any adverse effect of it

XFS_IOCORE_RT not been used in 2.6 version , so if instead of this flag I will pass XFS_IOCORE_EXCL it will be ok or will cause any crash or adverse effects or either there is any alternative present to sought out from these two problems

Regards
Anshul Kundra
HCL TECHNOLOGIES
ERS

Xfs.org:Community Portal

2011-04-05T23:11:55Z

Cattelan:

Add some content here

Checking for update info

New email stuff checking

XFS FAQ

2011-03-27T12:05:40Z

Cattelan: /* Q: Why do I receive No space left on device after xfs_growfs? */

Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]

Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.

== Q: Where can I find documentation about XFS? ==

The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.

You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the '''<nowiki>#xfs</nowiki>''' IRC channel on ''irc.freenode.net''.

== Q: Where can I find documentation about ACLs? ==

Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/

The '''acl(5)''' manual page is also quite extensive.

== Q: Where can I find information about the internals of XFS? ==

An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.

Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.

== Q: What partition type should I use for XFS on Linux? ==

Linux native filesystem (83).

== Q: What mount options does XFS have? ==

There are a number of mount options influencing XFS filesystems - refer to the '''mount(8)''' manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])

== Q: Is there any relation between the XFS utilities and the kernel version? ==

No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.

== Q: Does it run on platforms other than i386? ==

XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.

== Q: Quota: Do quotas work on XFS? ==

Yes.

To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/ http://sourceforge.net/projects/linuxquota/] or use '''xfs_quota(8)'''.

== Q: Quota: What's project quota? ==

The project quota is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.

== Q: Quota: Can group quota and project quota be used at the same time? ==

No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.

== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==

To be answered.

== Q: Are there any dump/restore tools for XFS? ==

'''xfsdump(8)''' and '''xfsrestore(8)''' are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.

== Q: Does LILO work with XFS? ==

This depends on where you install LILO.

Yes, for MBR (Master Boot Record) installations.

No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.

== Q: Does GRUB work with XFS? ==

There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.

== Q: Can XFS be used for a root filesystem? ==

Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the "rootflags=" kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit "logdev=" specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]

== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==

Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be "clean" when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.

== Q: Is there a way to make a XFS filesystem larger or smaller? ==

You can ''NOT'' make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.

An XFS filesystem may be enlarged by using '''xfs_growfs(8)'''.

If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the ''exact same'' starting point. Run '''xfs_growfs''' to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.

Using XFS filesystems on top of a volume manager makes this a lot easier.

== Q: What information should I include when reporting a problem? ==

Things to include are what version of XFS you are using, if this is a CVS version of what date and version of the kernel. If you have problems with userland packages please report the version of the package you are using.

If the problem relates to a particular filesystem, the output from the '''xfs_info(8)''' command and any '''mount(8)''' options in use will also be useful to the developers.

If you experience an oops, please run it through '''ksymoops''' so that it can be interpreted.

If you have a filesystem that cannot be repaired, make sure you have xfsprogs 2.9.0 or later and run '''xfs_metadump(8)''' to capture the metadata (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.

== Q: Mounting an XFS filesystem does not work - what is wrong? ==

If mount prints an error message something like:

mount: /dev/hda5 has wrong major or minor number

you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the "-t xfs" option on mount or the "xfs" option in <tt>/etc/fstab</tt>.

If you get something like:

mount: wrong fs type, bad option, bad superblock on /dev/sda1,
or too many mounted file systems

Refer to your system log file (<tt>/var/log/messages</tt>) for a detailed diagnostic message from the kernel.

== Q: Does the filesystem have an undelete capability? ==

There is no undelete in XFS. However at least some XFS driver implementations does not wipe file information nodes completely so there are chance to recover files with specialized commercial software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS].
In this kind of XFS driver implementation it does not re-use directory entries immediately so there are chance to get back recently deleted files even with their real names.

This applies to most recent Linux distributions, as well as to most popular NAS boxes that use embedded linux and XFS file system.

Anyway, the best is to always keep backups.

== Q: How can I backup a XFS filesystem and ACLs? ==

You can backup a XFS filesystem with utilities like '''xfsdump(8)''' and standard '''tar(1)''' for standard files. If you want to backup ACLs you will need to use '''xfsdump''' or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (> version 3.1.4) or [http://rsync.samba.org/ rsync] (>= version 3.0.0) to backup ACLs and EAs. '''xfsdump''' can also be integrated with [http://www.amanda.org/ amanda(8)].

== Q: I see applications returning error 990 or "Structure needs cleaning", what is wrong? ==

The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], "Structure needs cleaning."

The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.

There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.

You can use xfs_check and xfs_repair to remedy the problem (with the file system unmounted).

== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==

Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.

XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.

Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you'll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the '''xfs_bmap(8)''' command).

== Q: What is the problem with the write cache on journaled filesystems? ==

Many drives use a write back cache in order to speed up the performance of writes. However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk. Further, the drive can de-stage data from the write cache to the platters in any order that it chooses. This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk. When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.

With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information. In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.

With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued. A powerfail "only" loses data in the cache but no essential ordering is violated, and corruption will not occur.

With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance. But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.

== Q: How can I tell if I have the disk write cache enabled? ==

For SCSI/SATA:

* Look in dmesg(8) output for a driver line, such as: "SCSI device sda: drive cache: write back"
* <nowiki># sginfo -c /dev/sda | grep -i 'write cache' </nowiki>

For PATA/SATA (although for SATA this only works on a recent kernel with ATA command passthrough):

* <nowiki># hdparm -I /dev/sda</nowiki> and look under "Enabled Supported" for "Write cache"

For RAID controllers:

* See the section about RAID controllers below

== Q: How can I address the problem with the disk write cache? ==

=== Disabling the disk write back cache. ===

For SATA/PATA(IDE): (although for SATA this only works on a recent kernel with ATA command passthrough): 

* <nowiki># hdparm -W0 /dev/sda</nowiki> # hdparm -W0 /dev/hda
* <nowiki># blktool /dev/sda wcache off</nowiki> # blktool /dev/hda wcache off

For SCSI:

* Using sginfo(8) which is a little tedious It takes 3 steps. For example:
*# <nowiki>#sginfo -c /dev/sda</nowiki> which gives a list of attribute names and values
*# <nowiki>#sginfo -cX /dev/sda</nowiki> which gives an array of cache values which you must match up with from step 1, e.g. 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0
*# <nowiki>#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0</nowiki> allows you to reset the value of the cache attributes.

For RAID controllers:

* See the section about RAID controllers below

This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled. 

=== Using an external log. ===

Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will '''not''' solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won't be able to guarantee that if the metadata is on a drive with the write cache enabled.

In fact using an external log will disable XFS' write barrier support.

=== Write barrier support. ===

Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with "nobarrier". Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:

* "Disabling barriers, not supported with external log device"
* "Disabling barriers, not supported by the underlying device"
* "Disabling barriers, trial barrier write failed"

If the filesystem is mounted with an external log device then we currently don't support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn't support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.

== Q. Should barriers be enabled with storage which has a persistent write cache? ==

Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with "nobarrier". But take care about the hard disk write cache, which should be off.

== Q. Which settings does my RAID controller need ? ==

It's hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:

Real RAID controllers (not those found onboard of mainboards) normally have a battery backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory "[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]") which is used for buffering writes to improve speed. Even if it's battery backed, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents in that case.

* onboard RAID controllers: there are so many different types it's hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn't even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.

* 3ware: /cX/uX set cache=off, see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf , page 86

* Adaptec: allows setting individual drives cache
arcconf setcache <disk> wb|wt
wb=write back, which means write cache on, wt=write through, which means write cache off. So "wt" should be chosen.

* Areca: In archttp under "System Controls" -> "System Configuration" there's the option "Disk Write Cache Mode" (defaults "Auto")

"Off": disk write cache is turned off

"On": disk write cache is enabled, this is not safe for your data but fast

"Auto": If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to "On", because neither controller cache nor disk cache is safe so you don't seem to care about your data and just want high speed (which you get then).

That's a very sensible default so you can let it "Auto" or enforce "Off" to be sure.

* LSI MegaRAID: allows setting individual disks cache:
MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL -EnDskCache|DisDskCache

* Xyratex: from the docs: "Write cache includes the disk drive cache and controller cache.". So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.

== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==

The biggest problem is that those products seem to also virtualize disk
writes in a way that even barriers don't work any more, which means even
a fsync is not reliable. Tests confirm that unplugging the power from
such a system even with RAID controller with battery backed cache and
hard disk cache turned off (which is safe on a normal host) you can
destroy a database within the virtual machine (client, domU whatever you
call it).

In qemu you can specify cache=off on the line specifying the virtual
disk. For others information is missing.

== Q: What is the issue with directory corruption in Linux 2.6.17? ==

In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some "sparse" endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.

''Update: the fix is included in 2.6.17.7 and later kernels.''

To add insult to injury, '''xfs_repair(8)''' is currently not correcting these directories on detection of this corrupt state either. This '''xfs_repair''' issue is actively being worked on, and a fixed version will be available shortly.

''Update: a fixed '''xfs_repair''' is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.''

No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.

The '''xfs_check''' tool, or '''xfs_repair -n''', should be able to detect any directory corruption.

Until a fixed '''xfs_repair''' binary is available, one can make use of the '''xfs_db(8)''' command to mark the problem directory for removal (see the example below). A subsequent '''xfs_repair''' invocation will remove the directory and move all contents into "lost+found", named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
core.mode = 040755
core.version = 2
core.format = 3 (btree)
...
xfs_db> write core.mode 0
xfs_db> quit
</nowiki>

A subsequent '''xfs_repair''' will clear the directory, and add new entries (named by inode number) in lost+found.

The easiest way to map inode numbers to full paths is via '''xfs_ncheck(8)'''<nowiki>: </nowiki>

<nowiki>
# xfs_ncheck -i 14101 -i 14102 /dev/sdXXX
14101 full/path/mumble_fratz_foo_bar_1495
14102 full/path/mumble_fratz_foo_bar_1494
</nowiki>

Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
...
next_unlinked = null
u.bmbt.level = 1
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[0]
u.bmbt.ptrs[1] = 1:3628
xfs_db> fsblock 3628
xfs_db> type bmapbtd
xfs_db> print
magic = 0x424d4150
level = 0
numrecs = 19
leftsib = null
rightsib = null
recs[1-19] = [startoff,startblock,blockcount,extentflag]
1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]
5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]
9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]
12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]
15:[33554436,3488,8,0] 16:[33554444,3629,4,0]
17:[33554448,3748,4,0] 18:[33554452,3900,4,0]
19:[67108864,3364,4,0]
</nowiki>

At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the '''xfs_db''' dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:

...
xfs_db> dblock 20
xfs_db> print
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0
dhdr.bestfree[0].length = 0
dhdr.bestfree[1].offset = 0
dhdr.bestfree[1].length = 0
dhdr.bestfree[2].offset = 0
dhdr.bestfree[2].length = 0
du[0].inumber = 13937
du[0].namelen = 25
du[0].name = "mumble_fratz_foo_bar_1595"
du[0].tag = 0x10
du[1].inumber = 13938
du[1].namelen = 25
du[1].name = "mumble_fratz_foo_bar_1594"
du[1].tag = 0x38
...

So, here we can see that inode number 13938 matches up with name "mumble_fratz_foo_bar_1594". Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at "lost+found" (once '''xfs_repair''' has removed the corrupt directory).

== Q: Why does my > 2TB XFS partition disappear when I reboot ? ==

Strictly speaking this is not an XFS problem.

To support > 2TB partitions you need two things: a kernel that supports large block devices (<tt>CONFIG_LBD=y</tt>) and a partition table format that can hold large partitions. The default DOS partition tables don't. The best partition format for
> 2TB partitions is the EFI GPT format (<tt>CONFIG_EFI_PARTITION=y</tt>).

Without CONFIG_LBD=y you can't even create the filesystem, but without <tt>CONFIG_EFI_PARTITION=y</tt> it works fine until you reboot at which point the partition will disappear. Note that you need to enable the <tt>CONFIG_PARTITION_ADVANCED</tt> option before you can set <tt>CONFIG_EFI_PARTITION=y</tt>.

== Q: Why do I receive <tt>No space left on device</tt> after <tt>xfs_growfs</tt>? ==

After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:

The only way to fix this is to move data around to free up space
below 1TB. Find your oldest data (i.e. that was around before even
the first grow) and move it off the filesystem (move, not copy).
Then if you copy it back on, the data blocks will end up above 1TB
and that should leave you with plenty of space for inodes below 1TB.

A complete dump and restore will also fix the problem ;)

Also, you can add 'inode64' to your mount options to allow inodes to live above 1TB.

example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&forum=38 | No space left on device on xfs filesystem with 7.7TB free]

== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==

The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons.

Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.

== Q: How to get around a bad inode repair is unable to clean up ==

The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.

xfs_db -x -c 'inode XXX' -c 'write core.nextents 0' -c 'write core.size 0' /dev/hdXX

== Q: How to calculate the correct sunit,swidth values for optimal performance ==

XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.

These options can be sometimes autodetected (for example with md raid and recent enough kernel (>= 2.6.32) and xfsprogs (>= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.

The calculation of these values is quite simple:

su = <RAID controllers stripe size in BYTES (or KiBytes when used with k)>
sw = <# of data disks (don't count parity disks)>

So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use

su = 64k
sw = 6 (RAID-6 of 8 disks has 6 data disks)

A RAID stripe size of 256KB with a RAID-10 over 16 disks should use

su = 256k
sw = 8 (RAID-10 of 16 disks has 8 data disks)

Alternatively, you can use "sunit" instead of "su" and "swidth" instead of "sw" but then sunit/swidth values need to be specified in "number of 512B sectors"!

Note that <tt>xfs_info</tt> and <tt>mkfs.xfs</tt> interpret sunit and swidth as being specified in units of 512B sectors; that's unfortunately not the unit they're reported in, however.
<tt>xfs_info</tt> and <tt>mkfs.xfs</tt> report them in multiples of your basic block size (bsize) and not in 512B sectors.

Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.

When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.

== Q: Why doesn't NFS-exporting subdirectories of inode64-mounted filesystem work? ==

The default <tt>fsid</tt> type encodes only 32-bit of the inode number for subdirectory exports. However, exporting the root of the filesystem works, or using one of the non-default <tt>fsid</tt> types (<tt>fsid=uuid</tt> in <tt>/etc/exports</tt> with recent <tt>nfs-utils</tt>) should work as well. (Thanks, Christoph!)

== Q: What is the inode64 mount option for? ==

By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like "disk full" when you still have plenty space free, but there's no more place in the first TB to create a new inode. Also, performance sucks.

To come around this, use the inode64 mount options for filesystems >1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.

Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.

== Q: Can I just try the inode64 option to see if it helps me? ==

Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can't access files & dirs that have been created with an inode >32bit anymore.

== Q: Performance: mkfs.xfs -n size=64k option ==

Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:

Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a
directory entry is determined by the length of the name.

There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there's the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.

For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.

In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don't have any numbers on what the difference might be - I'm getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....

== Q: I want to tune my XFS filesystems for <something> ==

The standard answer you will get to this question is this: use the defaults.

There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to configure the filesystem appropriately.

There are a lot of "XFS tuning guides" that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don't expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.

In most cases, the only thing you need to to consider for <tt>mkfs.xfs</tt> is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the <tt>logbsize</tt> and <tt>delaylog</tt> mount options. Increasing <tt>logbsize</tt> reduces the number of journal IOs for a given workload, and <tt>delaylog</tt> will reduce them even further. The trade off for this increase in metadata performance is that more operations may be "missing" after recovery if the system crashes while actively making modifications.

XFS FAQ

2011-03-27T12:04:25Z

Cattelan: /* Q: Why do I receive No space left on device after xfs_growfs? */

Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]

Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.

== Q: Where can I find documentation about XFS? ==

The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.

You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the '''<nowiki>#xfs</nowiki>''' IRC channel on ''irc.freenode.net''.

== Q: Where can I find documentation about ACLs? ==

Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/

The '''acl(5)''' manual page is also quite extensive.

== Q: Where can I find information about the internals of XFS? ==

An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.

Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.

== Q: What partition type should I use for XFS on Linux? ==

Linux native filesystem (83).

== Q: What mount options does XFS have? ==

There are a number of mount options influencing XFS filesystems - refer to the '''mount(8)''' manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])

== Q: Is there any relation between the XFS utilities and the kernel version? ==

No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.

== Q: Does it run on platforms other than i386? ==

XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.

== Q: Quota: Do quotas work on XFS? ==

Yes.

To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/ http://sourceforge.net/projects/linuxquota/] or use '''xfs_quota(8)'''.

== Q: Quota: What's project quota? ==

The project quota is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.

== Q: Quota: Can group quota and project quota be used at the same time? ==

No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.

== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==

To be answered.

== Q: Are there any dump/restore tools for XFS? ==

'''xfsdump(8)''' and '''xfsrestore(8)''' are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.

== Q: Does LILO work with XFS? ==

This depends on where you install LILO.

Yes, for MBR (Master Boot Record) installations.

No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.

== Q: Does GRUB work with XFS? ==

There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.

== Q: Can XFS be used for a root filesystem? ==

Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the "rootflags=" kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit "logdev=" specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]

== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==

Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be "clean" when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.

== Q: Is there a way to make a XFS filesystem larger or smaller? ==

You can ''NOT'' make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.

An XFS filesystem may be enlarged by using '''xfs_growfs(8)'''.

If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the ''exact same'' starting point. Run '''xfs_growfs''' to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.

Using XFS filesystems on top of a volume manager makes this a lot easier.

== Q: What information should I include when reporting a problem? ==

Things to include are what version of XFS you are using, if this is a CVS version of what date and version of the kernel. If you have problems with userland packages please report the version of the package you are using.

If the problem relates to a particular filesystem, the output from the '''xfs_info(8)''' command and any '''mount(8)''' options in use will also be useful to the developers.

If you experience an oops, please run it through '''ksymoops''' so that it can be interpreted.

If you have a filesystem that cannot be repaired, make sure you have xfsprogs 2.9.0 or later and run '''xfs_metadump(8)''' to capture the metadata (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.

== Q: Mounting an XFS filesystem does not work - what is wrong? ==

If mount prints an error message something like:

mount: /dev/hda5 has wrong major or minor number

you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the "-t xfs" option on mount or the "xfs" option in <tt>/etc/fstab</tt>.

If you get something like:

mount: wrong fs type, bad option, bad superblock on /dev/sda1,
or too many mounted file systems

Refer to your system log file (<tt>/var/log/messages</tt>) for a detailed diagnostic message from the kernel.

== Q: Does the filesystem have an undelete capability? ==

There is no undelete in XFS. However at least some XFS driver implementations does not wipe file information nodes completely so there are chance to recover files with specialized commercial software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS].
In this kind of XFS driver implementation it does not re-use directory entries immediately so there are chance to get back recently deleted files even with their real names.

This applies to most recent Linux distributions, as well as to most popular NAS boxes that use embedded linux and XFS file system.

Anyway, the best is to always keep backups.

== Q: How can I backup a XFS filesystem and ACLs? ==

You can backup a XFS filesystem with utilities like '''xfsdump(8)''' and standard '''tar(1)''' for standard files. If you want to backup ACLs you will need to use '''xfsdump''' or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (> version 3.1.4) or [http://rsync.samba.org/ rsync] (>= version 3.0.0) to backup ACLs and EAs. '''xfsdump''' can also be integrated with [http://www.amanda.org/ amanda(8)].

== Q: I see applications returning error 990 or "Structure needs cleaning", what is wrong? ==

The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], "Structure needs cleaning."

The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.

There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.

You can use xfs_check and xfs_repair to remedy the problem (with the file system unmounted).

== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==

Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.

XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.

Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you'll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the '''xfs_bmap(8)''' command).

== Q: What is the problem with the write cache on journaled filesystems? ==

Many drives use a write back cache in order to speed up the performance of writes. However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk. Further, the drive can de-stage data from the write cache to the platters in any order that it chooses. This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk. When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.

With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information. In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.

With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued. A powerfail "only" loses data in the cache but no essential ordering is violated, and corruption will not occur.

With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance. But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.

== Q: How can I tell if I have the disk write cache enabled? ==

For SCSI/SATA:

* Look in dmesg(8) output for a driver line, such as: "SCSI device sda: drive cache: write back"
* <nowiki># sginfo -c /dev/sda | grep -i 'write cache' </nowiki>

For PATA/SATA (although for SATA this only works on a recent kernel with ATA command passthrough):

* <nowiki># hdparm -I /dev/sda</nowiki> and look under "Enabled Supported" for "Write cache"

For RAID controllers:

* See the section about RAID controllers below

== Q: How can I address the problem with the disk write cache? ==

=== Disabling the disk write back cache. ===

For SATA/PATA(IDE): (although for SATA this only works on a recent kernel with ATA command passthrough): 

* <nowiki># hdparm -W0 /dev/sda</nowiki> # hdparm -W0 /dev/hda
* <nowiki># blktool /dev/sda wcache off</nowiki> # blktool /dev/hda wcache off

For SCSI:

* Using sginfo(8) which is a little tedious It takes 3 steps. For example:
*# <nowiki>#sginfo -c /dev/sda</nowiki> which gives a list of attribute names and values
*# <nowiki>#sginfo -cX /dev/sda</nowiki> which gives an array of cache values which you must match up with from step 1, e.g. 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0
*# <nowiki>#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0</nowiki> allows you to reset the value of the cache attributes.

For RAID controllers:

* See the section about RAID controllers below

This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled. 

=== Using an external log. ===

Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will '''not''' solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won't be able to guarantee that if the metadata is on a drive with the write cache enabled.

In fact using an external log will disable XFS' write barrier support.

=== Write barrier support. ===

Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with "nobarrier". Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:

* "Disabling barriers, not supported with external log device"
* "Disabling barriers, not supported by the underlying device"
* "Disabling barriers, trial barrier write failed"

If the filesystem is mounted with an external log device then we currently don't support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn't support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.

== Q. Should barriers be enabled with storage which has a persistent write cache? ==

Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with "nobarrier". But take care about the hard disk write cache, which should be off.

== Q. Which settings does my RAID controller need ? ==

It's hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:

Real RAID controllers (not those found onboard of mainboards) normally have a battery backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory "[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]") which is used for buffering writes to improve speed. Even if it's battery backed, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents in that case.

* onboard RAID controllers: there are so many different types it's hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn't even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.

* 3ware: /cX/uX set cache=off, see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf , page 86

* Adaptec: allows setting individual drives cache
arcconf setcache <disk> wb|wt
wb=write back, which means write cache on, wt=write through, which means write cache off. So "wt" should be chosen.

* Areca: In archttp under "System Controls" -> "System Configuration" there's the option "Disk Write Cache Mode" (defaults "Auto")

"Off": disk write cache is turned off

"On": disk write cache is enabled, this is not safe for your data but fast

"Auto": If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to "On", because neither controller cache nor disk cache is safe so you don't seem to care about your data and just want high speed (which you get then).

That's a very sensible default so you can let it "Auto" or enforce "Off" to be sure.

* LSI MegaRAID: allows setting individual disks cache:
MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL -EnDskCache|DisDskCache

* Xyratex: from the docs: "Write cache includes the disk drive cache and controller cache.". So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.

== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==

The biggest problem is that those products seem to also virtualize disk
writes in a way that even barriers don't work any more, which means even
a fsync is not reliable. Tests confirm that unplugging the power from
such a system even with RAID controller with battery backed cache and
hard disk cache turned off (which is safe on a normal host) you can
destroy a database within the virtual machine (client, domU whatever you
call it).

In qemu you can specify cache=off on the line specifying the virtual
disk. For others information is missing.

== Q: What is the issue with directory corruption in Linux 2.6.17? ==

In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some "sparse" endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.

''Update: the fix is included in 2.6.17.7 and later kernels.''

To add insult to injury, '''xfs_repair(8)''' is currently not correcting these directories on detection of this corrupt state either. This '''xfs_repair''' issue is actively being worked on, and a fixed version will be available shortly.

''Update: a fixed '''xfs_repair''' is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.''

No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.

The '''xfs_check''' tool, or '''xfs_repair -n''', should be able to detect any directory corruption.

Until a fixed '''xfs_repair''' binary is available, one can make use of the '''xfs_db(8)''' command to mark the problem directory for removal (see the example below). A subsequent '''xfs_repair''' invocation will remove the directory and move all contents into "lost+found", named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
core.mode = 040755
core.version = 2
core.format = 3 (btree)
...
xfs_db> write core.mode 0
xfs_db> quit
</nowiki>

A subsequent '''xfs_repair''' will clear the directory, and add new entries (named by inode number) in lost+found.

The easiest way to map inode numbers to full paths is via '''xfs_ncheck(8)'''<nowiki>: </nowiki>

<nowiki>
# xfs_ncheck -i 14101 -i 14102 /dev/sdXXX
14101 full/path/mumble_fratz_foo_bar_1495
14102 full/path/mumble_fratz_foo_bar_1494
</nowiki>

Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:

<nowiki>
# xfs_db -x /dev/sdXXX
xfs_db> inode NNN
xfs_db> print
core.magic = 0x494e
...
next_unlinked = null
u.bmbt.level = 1
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[0]
u.bmbt.ptrs[1] = 1:3628
xfs_db> fsblock 3628
xfs_db> type bmapbtd
xfs_db> print
magic = 0x424d4150
level = 0
numrecs = 19
leftsib = null
rightsib = null
recs[1-19] = [startoff,startblock,blockcount,extentflag]
1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]
5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]
9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]
12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]
15:[33554436,3488,8,0] 16:[33554444,3629,4,0]
17:[33554448,3748,4,0] 18:[33554452,3900,4,0]
19:[67108864,3364,4,0]
</nowiki>

At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the '''xfs_db''' dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:

...
xfs_db> dblock 20
xfs_db> print
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0
dhdr.bestfree[0].length = 0
dhdr.bestfree[1].offset = 0
dhdr.bestfree[1].length = 0
dhdr.bestfree[2].offset = 0
dhdr.bestfree[2].length = 0
du[0].inumber = 13937
du[0].namelen = 25
du[0].name = "mumble_fratz_foo_bar_1595"
du[0].tag = 0x10
du[1].inumber = 13938
du[1].namelen = 25
du[1].name = "mumble_fratz_foo_bar_1594"
du[1].tag = 0x38
...

So, here we can see that inode number 13938 matches up with name "mumble_fratz_foo_bar_1594". Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at "lost+found" (once '''xfs_repair''' has removed the corrupt directory).

== Q: Why does my > 2TB XFS partition disappear when I reboot ? ==

Strictly speaking this is not an XFS problem.

To support > 2TB partitions you need two things: a kernel that supports large block devices (<tt>CONFIG_LBD=y</tt>) and a partition table format that can hold large partitions. The default DOS partition tables don't. The best partition format for
> 2TB partitions is the EFI GPT format (<tt>CONFIG_EFI_PARTITION=y</tt>).

Without CONFIG_LBD=y you can't even create the filesystem, but without <tt>CONFIG_EFI_PARTITION=y</tt> it works fine until you reboot at which point the partition will disappear. Note that you need to enable the <tt>CONFIG_PARTITION_ADVANCED</tt> option before you can set <tt>CONFIG_EFI_PARTITION=y</tt>.

== Q: Why do I receive <tt>No space left on device</tt> after <tt>xfs_growfs</tt>? ==

After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:

The only way to fix this is to move data around to free up space
below 1TB. Find your oldest data (i.e. that was around before even
the first grow) and move it off the filesystem (move, not copy).
Then if you copy it back on, the data blocks will end up above 1TB
and that should leave you with plenty of space for inodes below 1TB.

A complete dump and restore will also fix the problem ;)

Also, you can add 'inode64' to your mount options to allow inodes to live above 1TB.

example:[[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&forum=38|No space left on device on xfs filesystem with 7.7TB free]]

== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==

The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons.

Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.

== Q: How to get around a bad inode repair is unable to clean up ==

The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.

xfs_db -x -c 'inode XXX' -c 'write core.nextents 0' -c 'write core.size 0' /dev/hdXX

== Q: How to calculate the correct sunit,swidth values for optimal performance ==

XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.

These options can be sometimes autodetected (for example with md raid and recent enough kernel (>= 2.6.32) and xfsprogs (>= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.

The calculation of these values is quite simple:

su = <RAID controllers stripe size in BYTES (or KiBytes when used with k)>
sw = <# of data disks (don't count parity disks)>

So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use

su = 64k
sw = 6 (RAID-6 of 8 disks has 6 data disks)

A RAID stripe size of 256KB with a RAID-10 over 16 disks should use

su = 256k
sw = 8 (RAID-10 of 16 disks has 8 data disks)

Alternatively, you can use "sunit" instead of "su" and "swidth" instead of "sw" but then sunit/swidth values need to be specified in "number of 512B sectors"!

Note that <tt>xfs_info</tt> and <tt>mkfs.xfs</tt> interpret sunit and swidth as being specified in units of 512B sectors; that's unfortunately not the unit they're reported in, however.
<tt>xfs_info</tt> and <tt>mkfs.xfs</tt> report them in multiples of your basic block size (bsize) and not in 512B sectors.

Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.

When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.

== Q: Why doesn't NFS-exporting subdirectories of inode64-mounted filesystem work? ==

The default <tt>fsid</tt> type encodes only 32-bit of the inode number for subdirectory exports. However, exporting the root of the filesystem works, or using one of the non-default <tt>fsid</tt> types (<tt>fsid=uuid</tt> in <tt>/etc/exports</tt> with recent <tt>nfs-utils</tt>) should work as well. (Thanks, Christoph!)

== Q: What is the inode64 mount option for? ==

By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like "disk full" when you still have plenty space free, but there's no more place in the first TB to create a new inode. Also, performance sucks.

To come around this, use the inode64 mount options for filesystems >1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.

Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.

== Q: Can I just try the inode64 option to see if it helps me? ==

Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can't access files & dirs that have been created with an inode >32bit anymore.

== Q: Performance: mkfs.xfs -n size=64k option ==

Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:

Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a
directory entry is determined by the length of the name.

There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there's the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.

For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.

In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don't have any numbers on what the difference might be - I'm getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....

== Q: I want to tune my XFS filesystems for <something> ==

The standard answer you will get to this question is this: use the defaults.

There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to configure the filesystem appropriately.

There are a lot of "XFS tuning guides" that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don't expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.

In most cases, the only thing you need to to consider for <tt>mkfs.xfs</tt> is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the <tt>logbsize</tt> and <tt>delaylog</tt> mount options. Increasing <tt>logbsize</tt> reduces the number of journal IOs for a given workload, and <tt>delaylog</tt> will reduce them even further. The trade off for this increase in metadata performance is that more operations may be "missing" after recovery if the system crashes while actively making modifications.

Xfs.org:Administrators

2011-02-16T17:56:27Z

Cattelan:

cattelan
dgc
hch
sandeen

XFS Papers and Documentation

2010-08-23T14:37:12Z

Cattelan:

* File System Structure [http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html]

* User Guide [http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-US/html/index.html]

* XFS Labs [http://xfs.org/docs/xfsdocs-xml-dev/XFS_Labs/tmp/en-US/html/index.html]

* Someone managed to document <tt>/proc/fs/xfs/stat</tt>: [[Runtime_Stats|Runtime_Stats]]

The XFS team has been working on a training course aimed at developers, support staff and experienced users, that explores the internals and ondisk format of XFS.

* ''XFS Overview and Internals'' [[http://oss.sgi.com/projects/xfs/training/index.html Index]]

Barry Naujok has documented most of the XFS ondisk format, including examples on how to traverse the structure and diagnose ondisk problems:

* ''XFS Filesystem Structure'' [[http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf pdf]]

The October 2009 issue of the USENIX ;login: magazine published an article about XFS targeted at system administrators:

* ''XFS: The big storage file system for Linux'' [[http://oss.sgi.com/projects/xfs/papers/hellwig.pdf pdf]]

At the Ottawa Linux Symposium (July 2006), Dave Chinner presented a paper on filesystem scalability in Linux 2.6 kernels:

* ''High Bandwidth Filesystems on Large Systems'' (July 2006) [[http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf paper]] [[http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-presentation.pdf presentation]]

At linux.conf.au 2008 Dave Chinner gave a presentation about xfs_repair that he co-authored with Barry Naujok:

* Fixing XFS Filesystems Faster [[http://mirror.linux.org.au/pub/linux.conf.au/2008/slides/135-fixing_xfs_faster.pdf]]

In July 2006, SGI storage marketing updated the XFS datasheet:

* ''Open Source XFS for Linux'' [[http://oss.sgi.com/projects/xfs/datasheet.pdf pdf]]

At UKUUG 2003, Christoph Hellwig presented a talk on XFS:

* ''XFS for Linux'' (July 2003) [[http://oss.sgi.com/projects/xfs/papers/ukuug2003.pdf pdf]] [[http://verein.lst.de/~hch/talks/ukuug2003/ html]]

Originally published in Proceedings of the FREENIX Track: 2002 Usenix Annual Technical Conference:

* ''Filesystem Performance and Scalability in Linux 2.4.17'' (June 2002) [[http://oss.sgi.com/projects/xfs/papers/filesystem-perf-tm.pdf pdf]]

At the Ottawa Linux Symposium, an updated presentation on porting XFSÂ to Linux was given:

* ''Porting XFS to Linux'' (July 2000) [[http://oss.sgi.com/projects/xfs/papers/ols2000/ols-xfs.htm html]]

At the Atlanta Linux Showcase, SGI presented the following paper on the port of XFS to Linux:

* ''Porting the SGI XFS File System to Linux'' (October 1999) [[http://oss.sgi.com/projects/xfs/papers/als/als.ps ps]] [[http://oss.sgi.com/projects/xfs/papers/als/als.pdf pdf]]

At the 6th Linux Kongress & the Linux Storage Management Workshop (LSMW) in Germany in September, 1999, SGI had a few presentations including the following:

* ''SGI's port of XFS to Linux'' (September 1999) [[http://oss.sgi.com/projects/xfs/papers/linux_kongress/index.htm html]]
* ''Overview of DMF'' (September 1999) [[http://oss.sgi.com/projects/xfs/papers/DMF-over/index.htm html]]

At the LinuxWorld Conference & Expo in August 1999, SGI published:

* ''An Open Source XFS data sheet'' (August 1999) [[http://oss.sgi.com/projects/xfs/papers/xfs_GPL.pdf pdf]]

From the 1996 USENIX conference:

* ''An XFS white paper'' [[http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html html]]

=== Other historical articles, press-releases, etc ===

* IBM's ''Advanced Filesystem Implementor's Guide'' has a chapter ''Introducing XFS'' [[http://www-106.ibm.com/developerworks/library/l-fs9.html html]]

* An editorial titled ''Tired of fscking? Try a journaling filesystem!'', Freshmeat (February 2001) [[http://freshmeat.net/articles/view/212/ html]]

* ''Who give a fsck about filesystems'' provides an overview of the Linux 2.4 filesystems [[http://www.linuxuser.co.uk/articles/issue6/lu6-All_you_need_to_know_about-Filesystems.pdf html]]

* ''Journal File Systems'' in issue 55 of ''Linux Gazette'' provides a comparison of journaled filesystems.

* The original XFS beta release announcement was published in ''Linux Today'' (September 2000) [[http://linuxtoday.com/news_story.php3?ltsn=2000-09-26-017-04-OS-SW html]]

* ''XFS: It's worth the wait'' was published on ''EarthWeb'' (July 2000) [[http://networking.earthweb.com/netos/oslin/article/0,,12284_623661,00.html html]]

* An ''IRIX-XFS data sheet'' (July 1999) [[http://oss.sgi.com/projects/xfs/papers/IRIX_xfs_data_sheet.pdf pdf]]

* The ''Getting Started with XFS'' book (1994) [[http://oss.sgi.com/projects/xfs/papers/getting_started_with_xfs.pdf pdf]]

* Original ''XFS design documents'' (1993) ([http://oss.sgi.com/projects/xfs/design_docs/xfsdocs93_ps/ ps], [http://oss.sgi.com/projects/xfs/design_docs/xfsdocs93_pdf/ pdf])

Help:Contents

2010-05-08T20:21:06Z

Cattelan:

Please add

XFS FAQ

2009-11-06T22:55:33Z

Cattelan:

Consulting Resources

2009-09-09T19:24:50Z

Cattelan:

[http://www.digitalelves.com Digital Elves Inc]

Main Page

2009-07-06T21:29:15Z

Cattelan:

<div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#C5C5FF; align:right;">
Welcome to XFS.org. This site is set up to help with the XFS file system.
</div>

{| width="100%"
|-
|style="vertical-align:top" |


<div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#E2EAFF; align:right;">
== Information about XFS ==

* [http://oss.sgi.com/projects/xfs Main sgi xfs website]
* [[XFS FAQ]]
* [[XFS Status Updates]]
* [[XFS Papers and Documentation]]
* [[Linux Distributions shipping XFS]]
* [[XFS Rpm for RedHat]]
* [[XFS Companies]]
* [[OLD News]]
* [http://oss.sgi.com/projects/xfs/training/index.html Link to XFS training material]
* [http://en.wikipedia.org/wiki/XFS Wikipedia xfs page, good detailed information.]

</div>


<div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#fffff0; align:right; ">

== Professional XFS Consulting Services ==

[[Consulting Resources]]
</div>

| width="50%" style="vertical-align:top" |


<div style="margin:0; margin-top:10px; margin-right:10px; border:1px solid #dfdfdf; padding:0 1em 1em 1em; background-color:#F8F8FF; align:right;">
== XFS Developer Resources ==

* [[XFS email list and archives]]
* [http://oss.sgi.com/projects/xfs Main sgi xfs website]
* [http://oss.sgi.com/bugzilla/ Bugzilla @ oss.sgi.com]
* [http://bugzilla.kernel.org/ Bugzilla @ kernel.org]
* [[Getting the latest source code]]
* [[Unfinished work]]
* [[Shrinking Support]]
* [[Ideas for XFS]]
</div>

{{#meta: | u+4/+rib+YG96TifD0SN88xS84YSDm2cl61IU7ZIk9g= | verify-v1 }}

|}

Getting the latest source code

2009-05-13T15:50:07Z

Cattelan: /* XFS cvs trees */

== XFS Released/Stable source ==

* '''Mainline kernels''' XFS has been maintained in the official Linux kernel [http://www.kernel.org/ kernel trees] starting with [http://lkml.org/lkml/2003/12/8/35 Linux 2.4] and is frequently updated with the latest stable fixes and features from the SGI XFS development team.

* '''Vendor kernels''' All modern Linux distributions include support for XFS. SGI actively works with [http://www.suse.com/ SUSE] to provide a supported version of XFS in that distribution.

* '''XFS userspace''' Sgi also provides [ftp://oss.sgi.com/projects/xfs source code taballs] of the xfs userspace tools. These tarballs form the basis of the xfsprogs packages found in Linux distributions.

== Development and bleeding edge Development ==

[[XFS git howto]]

Development git trees

Current XFS kernel source
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=summary xfs]
<pre>$ git clone git://oss.sgi.com/xfs/xfs</pre>
XFS user space tools
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfsprogs.git;a=summary xfsprogs]
<pre>$ git clone git://oss.sgi.com/xfs/cmds/xfsprogs</pre>
XFS dump
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfsdump.git;a=summary xfsdump]
<pre>$ git clone git://oss.sgi.com/xfs/cmds/xfsdump</pre>
XFS tests
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfstests.git;a=summary xfstests]
<pre>$ git clone git://oss.sgi.com/xfs/cmds/xfstests</pre>
DMAPI user space tools
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/dmapi.git;a=summary dmapi]
<pre>$ git clone git://oss.sgi.com/xfs/cmds/dmapi</pre>

The Git trees are automated mirrored copies of the cvs trees using git-cvsimport.
Since git-cvsimport utilized the tool cvsps to recreate the atomic commits of ptools
or "mod" it is easier to see the entire change that was committed using git.

git-cvsimport generated trees.
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=linux-2.6-xfs-from-cvs/.git;a=summary linux-2.6-xfs-from-cvs]
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs-cmds/.git;a=summary xfs-cmds]

Before building in the xfsdump or dmapi directories (after building xfsprogs), you will need to run:
<pre># cd xfsprogs
# make install-dev</pre>
to create /usr/include/xfs and install appropriate files there.

Before building in the xfstests directory, you will need to run:
<pre># cd xfsprogs
# make install-qa</pre>
to install a somewhat larger set of files in /usr/include/xfs.

== XFS cvs trees ==

The cvs trees were created using a script that converted sgi's internal
ptools repository to a cvs repository, so the cvs trees were considered read only.

At this point all new development is being managed by the git trees thus the cvs trees
are not longer active in terms of current development and should only be used
for reference.

[[XFS CVS howto]]

XFS email list and archives

2009-01-09T04:07:42Z

Cattelan:

== XFS email list ==
Patches, comments, requests and questions should go to:

[mailto:xfs@oss.sgi.com xfs@oss.sgi.com]

[http://oss.sgi.com/archives/xfs List archives]

[http://oss.sgi.com/pipermail/xfs List archives using pipermail]

== Subscribing to the list ==

The easiest method is to use the mailman web interface.

[http://oss.sgi.com/mailman/listinfo/xfs http://oss.sgi.com/mailman/listinfo/xfs]

The email interface is also available by sending and email with the body:

<pre>subscribe</pre>

to [mailto:xfs-request@oss.sgi.com xfs-request@oss.sgi.com]

OLD News

2009-01-08T23:42:54Z

Cattelan: New page: {| width="100%" cellspacing="2" | bgcolor="#88ee88" valign="top" | July-2007 | bgcolor="#99cccc" valign="top" | Next r...

{| width="100%" cellspacing="2"
| bgcolor="#88ee88" valign="top" | July-2007 
| bgcolor="#99cccc" valign="top" | Next round of xfs_repair(8) performance improvements (xfsprogs 2.9.2 onwards). It substantially improves the I/O performance compared to previous versions through the use of batch reads, background metadata prefetching and smart priority based libxfs caching. Lost+found behaviour has also been changed and will not re-orphan existing lost+found files. 
|-
| bgcolor="#88ee88" valign="top" | July-2007 
| bgcolor="#99cccc" valign="top" | Added new data allocator called "filestreams" to 2.6.22 kernels. It allows a directory to reserve an allocation group for exclusive use by files created within that directory. Files being written in other directories will not use the same allocation group and so files within different directories will not interleave extents on disk. The reservation is only active while files are being created and written into the directory. Filestreams can be enabled filesystem wide with a "-o filestreams" mount option, or on a per directory basis with an xfs_io(8) chattr flag. 
|-
| bgcolor="#88ee88" valign="top" | June-2007 
| bgcolor="#99cccc" valign="top" | Added a new tool, xfs_metadump(8), to capture the metadata of a corrupted filesystem for implementing improvements to xfs_repair(8). 
|-
| bgcolor="#88ee88" valign="top" | May-2007 
| bgcolor="#99cccc" valign="top" | Released "Lazy Superblock Counters" which significantly improves transaction intensive workloads which operate on the free space and inode usage counters in the superblock. 
|-
| bgcolor="#88ee88" valign="top" | March-2007 
| bgcolor="#99cccc" valign="top" | Fixed the notorious "NULL files" problem after a crash. The fix improves the syncronisation between updates of the file size and writes that extend a file. 
|-
| bgcolor="#88ee88" valign="top" | January-2007 
| bgcolor="#99cccc" valign="top" | Added [training/index.html XFS Training course] (still under development) and the documentation for the [papers/xfs_filesystem_structure.pdf XFS ondisk format]. 
|-
| bgcolor="#88ee88" valign="top" | August-2006 
| bgcolor="#99cccc" valign="top" | Added the paper "High Bandwidth Filesystems on Large Systems" which was presented at the Ottawa Linux Symposium in July 2006. 
|-
| bgcolor="#88ee88" valign="top" | July-2006 
| bgcolor="#99cccc" valign="top" | Several bulkstat related performance improvements, mostly improving DMAPI scans for filesystems with many millions of inodes, but also generic readahead improvements that will help other bulkstat users (e.g. xfsdump and quota). 
|-
| bgcolor="#88ee88" valign="top" | July-2006 
| bgcolor="#99cccc" valign="top" | SLES10 is released, with full XFS and DMAPI support. 
|-
| bgcolor="#88ee88" valign="top" | June-2006 
| bgcolor="#99cccc" valign="top" | First round of xfs_repair(8) performance improvements, and buffer/inode caching now done in libxfs in preparation for a multi-threaded version of xfs_repair at a later date. 
|-
| bgcolor="#88ee88" valign="top" | May-2006 
| bgcolor="#99cccc" valign="top" | Additional inheritable inode flag (nodefrag) to allow specified inodes to be skipped by xfs_fsr(8). 
|-
| bgcolor="#88ee88" valign="top" | May-2006 
| bgcolor="#99cccc" valign="top" | Updated the [http://oss.sgi.com/projects/xfs/faq.html#wcache FAQ] to discuss device write barriers. 
|-
| bgcolor="#88ee88" valign="top" | Mar-2006 
| bgcolor="#99cccc" valign="top" | Incore extent management rework, more efficiently using memory when working with files with large numbers of extents. 
|-
| bgcolor="#88ee88" valign="top" | Feb-2006 
| bgcolor="#99cccc" valign="top" | Per-CPU superblock accounting, improving buffered I/O throughput significantly for parallel I/O workloads. 
|-
| bgcolor="#88ee88" valign="top" | Dec-2005 
| bgcolor="#99cccc" valign="top" | Rework the page writeout code paths within XFS to make better use of advances in 2.6 kernels (page clustering and larger block I/O requests). 
|-
| bgcolor="#88ee88" valign="top" | Dec-2005 
| bgcolor="#99cccc" valign="top" | SLES9 Service Pack 3 released with next set of XFS updates. 
|-
| bgcolor="#88ee88" valign="top" | Nov-2005 
| bgcolor="#99cccc" valign="top" | Support for block layer write barriers, allowing devices to be used with their write cache enabled (2.6+ kernels). 
|-
| bgcolor="#88ee88" valign="top" | Nov-2005 
| bgcolor="#99cccc" valign="top" | Preferred extent size allocator hint for the B+ tree allocator. 
|-
| bgcolor="#88ee88" valign="top" | Sep-2005 
| bgcolor="#99cccc" valign="top" | Support for inline extended attribute format 2, which significantly improves performance when using extended attributes. Driven by needs of the Samba folks for Samba version 4, but also helpful to many other people. 
|-
| bgcolor="#88ee88" valign="top" | Jun-2005 
| bgcolor="#99cccc" valign="top" | Added support for project quota, and the ability to inherit project identifiers. Provides the mechanism for implementing a form of directory tree quota. 
|-
| bgcolor="#88ee88" valign="top" | Jun-2005 
| bgcolor="#99cccc" valign="top" | Merged all XFS fixes since SP1 into SLES9 SP2. Thanks Andreas! 
|-
| bgcolor="#88ee88" valign="top" | Jun-2005 
| bgcolor="#99cccc" valign="top" | [http://people.freebsd.org/~rodrigc/xfs/ XFS for FreeBSD] website announced (an independent porting effort not sponsored by SGI), supported by FreeBSD developers Craig Rodrigues and Alexander Kabaev. 
|-
| bgcolor="#88ee88" valign="top" | Jan-2005 
| bgcolor="#99cccc" valign="top" | Support use of 64 bit inode numbers with NFS. 
|-
| bgcolor="#88ee88" valign="top" | Dec-2004 
| bgcolor="#99cccc" valign="top" | Added ihashsize mount option and reworked inode hash sizing algorithms for improved scalability. 
|-
| bgcolor="#88ee88" valign="top" | Oct-2004 
| bgcolor="#99cccc" valign="top" | XFS web pages updated for the first time in several years. (*cough*) 
|-
| bgcolor="#88ee88" valign="top" | Oct-2004 
| bgcolor="#99cccc" valign="top" | Merged all XFS fixes since 2.6.5 into SLES9 Service Pack 1. Thanks Andreas! 
|-
| bgcolor="#88ee88" valign="top" | Feb-2004 
| bgcolor="#99cccc" valign="top" | XFS is merged into Marcelo's 2.4.25 kernel. 
|-
| bgcolor="#88ee88" valign="top" | Dec-2003 
| bgcolor="#99cccc" valign="top" | 2.6.0 kernel is released, with full XFS support. 
|-
| bgcolor="#88ee88" valign="top" | Oct-2003 
| bgcolor="#99cccc" valign="top" | Added support for allocation groups larger than four gigabytes (up to 1 terabyte per AG supported now). Big scalability improvement for large filesystems. 
|-
| bgcolor="#88ee88" valign="top" | Sep-2003 
| bgcolor="#99cccc" valign="top" | Additional per-inode flags introduced (immutable, append-only, noatime, nodump, sync). 
|-
| bgcolor="#88ee88" valign="top" | Aug-2003 
| bgcolor="#99cccc" valign="top" | XFS 1.3 is released. 
|-
| bgcolor="#88ee88" valign="top" | Apr-2003 
| bgcolor="#99cccc" valign="top" | XFS merged in Alan Cox's 2.4.21-rc1-ac3 kernel. 
|-
| bgcolor="#88ee88" valign="top" | Feb-2003 
| bgcolor="#99cccc" valign="top" | XFS 1.2 is released. 
|-
| bgcolor="#88ee88" valign="top" | Dec-2002 
| bgcolor="#99cccc" valign="top" | Added support for sector sizes larger than 512 bytes. Particularly useful with MD RAID5 setups, significant performance win when a sector size matching the block size is used. 
|-
| bgcolor="#88ee88" valign="top" | Sep-2002 
| bgcolor="#99cccc" valign="top" | XFS is merged into Linus' 2.5 development tree. 
|-
| bgcolor="#88ee88" valign="top" | Jun-2002 
| bgcolor="#99cccc" valign="top" | Added support for version 2 logs, allowing larger incore log buffers and log write alignment. 
|-
| bgcolor="#88ee88" valign="top" | May-2002 
| bgcolor="#99cccc" valign="top" | XFS quota syscall support and VFS interfaces merged in 2.5. 
|-
| bgcolor="#88ee88" valign="top" | Apr-2002 
| bgcolor="#99cccc" valign="top" | SuSE 8.0 is available with native XFS support. 
|-
| bgcolor="#88ee88" valign="top" | Apr-2002 
| bgcolor="#99cccc" valign="top" | XFS 1.1 is released. 
|-
| bgcolor="#88ee88" valign="top" | Feb-2002 
| bgcolor="#99cccc" valign="top" | Extended attribute syscalls and VFS interfaces merged in 2.5. 
|-
| bgcolor="#88ee88" valign="top" | Nov-2001 
| bgcolor="#99cccc" valign="top" | Work starts on the 2.5 kernel tree. 
|-
| bgcolor="#88ee88" valign="top" | Nov-2001 
| bgcolor="#99cccc" valign="top" | XFS 1.0.2 Release 
|-
| bgcolor="#88ee88" valign="top" | Sep-2001 
| bgcolor="#99cccc" valign="top" | Mandrake 8.1 is available with native XFS support. 
|-
| bgcolor="#88ee88" valign="top" | Jul-2001 
| bgcolor="#99cccc" valign="top" | XFS 1.0.1 Release 
|-
| bgcolor="#88ee88" valign="top" | Jun-2001 
| bgcolor="#99cccc" valign="top" | Debian "Woody" XFS install discs are now available. Thanks Zoltan! 
|-
| bgcolor="#88ee88" valign="top" | May-2001 
| bgcolor="#99cccc" valign="top" | XFS 1.0 patches now in the Debian testing ("Woody") distribution. Thanks again to Ed Boraas. 
|-
| bgcolor="#88ee88" valign="top" | May-2001 
| bgcolor="#99cccc" valign="top" | XFS 1.0 patches now in the Debian unstable ("Sid") distribution. Thanks go out to Ed Boraas. 
|-
| bgcolor="#88ee88" valign="top" | May-2001 
| bgcolor="#99cccc" valign="top" | Mandrake packages are now available, thanks to Chmouel Boudjnah. 
|-
| bgcolor="#88ee88" valign="top" | May-2001 
| bgcolor="#99cccc" valign="top" | Debian kernel packages available from Marcos Pinto. Thanks! 
|-
| bgcolor="#88ee88" valign="top" | May-2001 
| bgcolor="#99cccc" valign="top" | XFS 1.0 Release 
|-
| bgcolor="#88ee88" valign="top" | Sep-2000 
| bgcolor="#99cccc" valign="top" | XFS Beta Release 
|-
| bgcolor="#88ee88" valign="top" | Aug-2000 
| bgcolor="#99cccc" valign="top" | Initial FAQ now available, thanks to Thomas Graichen. 
|-
| bgcolor="#88ee88" valign="top" | Jun-2000 
| bgcolor="#99cccc" valign="top" | Usenix 2000: Porting XFS to Linux talk 
|-
| bgcolor="#88ee88" valign="top" | Jun-2000 
| bgcolor="#99cccc" valign="top" | Usenix 2000 XFS pre-beta ISO image available 
|-
| bgcolor="#88ee88" valign="top" | Apr-2000 
| bgcolor="#99cccc" valign="top" | Web interface to CVS repository now available 
|-
| bgcolor="#88ee88" valign="top" | Mar-2000 
| bgcolor="#99cccc" valign="top" | XFS source code officially available! 
|}

Main Page

2009-01-08T23:42:20Z

Cattelan: /* Information about XFS */

== Welcome to XFS.org ==

This site is set up to help with the XFS file system.

== Information about XFS ==

[http://en.wikipedia.org/wiki/Xfs Wikipedia xfs page, good detailed information.]

[http://oss.sgi.com/projects/xfs Main sgi xfs website]

[[XFS FAQ]]

[[XFS Status Updates]]

[http://oss.sgi.com/projects/xfs/training/index.html Link to XFS training material]

[[XFS Papers and Documentation]]

[[Linux Distributions shipping XFS]]

[[XFS Rpm for RedHat]]

[[XFS Companies]]

[[OLD News]]

{{#widget:Ohloh Project|id=xfs|type=thin_badge}}

== XFS Developer Resources ==

[[ XFS email list and archives ]]

[http://oss.sgi.com/projects/xfs Main sgi xfs website]

[http://oss.sgi.com/bugzilla/ Bugzilla @ oss.sgi.com]

[http://bugzilla.kernel.org/ Bugzilla @ kernel.org]

[[Getting the latest source code]]

[[Unfinished work]]

[[Shrinking Support]]

[[Ideas for XFS from Dave Chinner]]

== Professional XFS Consulting Services ==

[[ Consulting Resources ]]

XFS email list and archives

2009-01-08T23:07:24Z

Cattelan:

== XFS email list ==
Patches, comments, requests and questions should go to:

[mailto:xfs@oss.sgi.com xfs@oss.sgi.com]

[http://oss.sgi.com/archives/xfs List archives]

[http://oss.sgi.com/pipermail/xfs List archives using pipermail]

== Subscribing to the list ==

The easiest method is to use the mailman web interface.

[http://oss.sgi.com/mailmain/listinfo/xfs http://oss.sgi.com/mailmain/listinfo/xfs]

The email interface is also available by sending and email with the body:

<pre>subscribe</pre>

to [mailto:xfs-request@oss.sgi.com xfs-request@oss.sgi.com]

XFS email list and archives

2009-01-08T23:06:10Z

Cattelan: New page: == XFS email list == Patches, comments, requests and questions should go to: [mailto:xfs@oss.sgi.com xfs@oss.sgi.com] [http://oss.sgi.com/archives/xfs List archives] == Subscribing to ...

== XFS email list ==
Patches, comments, requests and questions should go to:

[mailto:xfs@oss.sgi.com xfs@oss.sgi.com]

[http://oss.sgi.com/archives/xfs List archives]

== Subscribing to the list ==

The easiest method is to use the mailman web interface.

[http://oss.sgi.com/mailmain/listinfo/xfs http://oss.sgi.com/mailmain/listinfo/xfs]

The email interface is also available by sending and email with the body:

<pre>subscribe</pre>

to [mailto:xfs-request@oss.sgi.com xfs-request@oss.sgi.com]

Main Page

2009-01-08T22:26:17Z

Cattelan: /* XFS Developer Resources */

== Welcome to XFS.org ==

This site is set up to help with the XFS file system.

== Information about XFS ==

[http://en.wikipedia.org/wiki/Xfs Wikipedia xfs page, good detailed information.]

[http://oss.sgi.com/projects/xfs Main sgi xfs website]

[[XFS FAQ]]

[[XFS Status Updates]]

[http://oss.sgi.com/projects/xfs/training/index.html Link to XFS training material]

[[XFS Papers and Documentation]]

[[Linux Distributions shipping XFS]]

[[XFS Rpm for RedHat]]

[[XFS Companies]]

{{#widget:Ohloh Project|id=xfs|type=thin_badge}}

== XFS Developer Resources ==

[[ XFS email list and archives ]]

[http://oss.sgi.com/projects/xfs Main sgi xfs website]

[http://oss.sgi.com/bugzilla/ Bugzilla @ oss.sgi.com]

[http://bugzilla.kernel.org/ Bugzilla @ kernel.org]

[[Getting the latest source code]]

[[Unfinished work]]

[[Shrinking Support]]

[[Ideas for XFS from Dave Chinner]]

== Professional XFS Consulting Services ==

[[ Consulting Resources ]]

Main Page

2009-01-08T21:55:04Z

Cattelan: /* XFS Developer Resources */

== Welcome to XFS.org ==

This site is set up to help with the XFS file system.

== Information about XFS ==

[http://en.wikipedia.org/wiki/Xfs Wikipedia xfs page, good detailed information.]

[http://oss.sgi.com/projects/xfs Main sgi xfs website]

[[XFS FAQ]]

[[XFS Status Updates]]

[http://oss.sgi.com/projects/xfs/training/index.html Link to XFS training material]

[[XFS Papers and Documentation]]

[[Linux Distributions shipping XFS]]

[[XFS Rpm for RedHat]]

[[XFS Companies]]

{{#widget:Ohloh Project|id=xfs|type=thin_badge}}

== XFS Developer Resources ==

[http://oss.sgi.com/projects/xfs Main sgi xfs website]

[http://oss.sgi.com/archives/xfs xfs@oss.sgi.com mailing list archives]

[http://oss.sgi.com/bugzilla/ Bugzilla @ oss.sgi.com]

[http://bugzilla.kernel.org/ Bugzilla @ kernel.org]

[[Getting the latest source code]]

[[Unfinished work]]

[[Shrinking Support]]

[[Ideas for XFS from Dave Chinner]]

== Professional XFS Consulting Services ==

[[ Consulting Resources ]]

Main Page

2009-01-08T21:54:18Z

Cattelan: /* XFS Developer Resources */

== Welcome to XFS.org ==

This site is set up to help with the XFS file system.

== Information about XFS ==

[http://en.wikipedia.org/wiki/Xfs Wikipedia xfs page, good detailed information.]

[http://oss.sgi.com/projects/xfs Main sgi xfs website]

[[XFS FAQ]]

[[XFS Status Updates]]

[http://oss.sgi.com/projects/xfs/training/index.html Link to XFS training material]

[[XFS Papers and Documentation]]

[[Linux Distributions shipping XFS]]

[[XFS Rpm for RedHat]]

[[XFS Companies]]

{{#widget:Ohloh Project|id=xfs|type=thin_badge}}

== XFS Developer Resources ==

[http://oss.sgi.com/projects/xfs Main sgi xfs website]

[http://oss.sgi.com/archives/xfs xfs@oss.sgi.com mailing list archives]

[http://oss.sgi.com/bugzilla/ oss.sgi.com bugzilla]

[http://bugzilla.kernel.org/ kernel.org bugzilla]

[[Getting the latest source code]]

[[Unfinished work]]

[[Shrinking Support]]

[[Ideas for XFS from Dave Chinner]]

== Professional XFS Consulting Services ==

[[ Consulting Resources ]]

Getting the latest source code

2009-01-08T21:52:13Z

Cattelan:

== XFS Released/Stable source ==

* '''Mainline kernels''' XFS has been maintained in the official Linux kernel [http://www.kernel.org/ kernel trees] starting with Linux 2.4 and is frequently updated with the latest stable fixes and features from the SGI XFS development team.

* '''Vendor kernels''' All modern Linux distributions include support for XFS. SGI actively works with [http://www.suse.com/ SUSE] to provide a supported version of XFS in that distribution.

* '''XFS userspace''' Sgi also provides [ftp://oss.sgi.com/projects/xfs source code taballs] of the xfs userspace tools. These tarballs form the basis of the xfsprogs packages found in Linux distributions.

== Development and bleeding edge Development ==

[[XFS git howto]]

Development git trees

Current XFS kernel source
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=summary xfs]
<pre>$ git clone git://oss.sgi.com/xfs/xfs</pre>
XFS user space tools
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfsprogs.git;a=summary xfsprogs]
<pre>$ git clone git://oss.sgi.com/xfs/cmds/xfsprogs</pre>
XFS dump
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfsdump.git;a=summary xfsdump]
<pre>$ git clone git://oss.sgi.com/xfs/cmds/xfsdump</pre>
XFS tests
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/xfstests.git;a=summary xfstests]
<pre>$ git clone git://oss.sgi.com/xfs/cmds/xfstests</pre>
DMAPI user space tools
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/cmds/dmapi.git;a=summary dmapi]
<pre>$ git clone git://oss.sgi.com/xfs/cmds/dmapi</pre>

The Git trees are automated mirrored copied of the cvs trees using git-cvsimport.
Since git-cvsimport utilized the tool cvsps to recreate the atomic commits of ptools
or "mod" it is easier to see the entire change that was committed using git.

git-cvsimport generated trees.
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=linux-2.6-xfs-from-cvs/.git;a=summary linux-2.6-xfs-from-cvs]
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs-cmds/.git;a=summary xfs-cmds]

== XFS cvs trees ==

The cvs trees were created using a script that converted sgi's internal
ptools repository to a cvs repository, so the cvs trees were considered read only.

At this point a new development is being managed by the git trees so the cvs trees
are not longer active in terms of current development and should only be used
for reference.

[[XFS CVS howto]]

XFS CVS howto

2009-01-08T17:30:28Z

Cattelan: New page: = XFS: Source Code = * '''CVS web''' Browse the XFS source trees. ** [http://oss.sgi.com/cgi-bin/cvsweb.cgi/linux-2.6-xfs/ 2.6.x-xfs] **...

= XFS: Source Code =

* '''CVS web''' Browse the XFS source trees.
** [http://oss.sgi.com/cgi-bin/cvsweb.cgi/linux-2.6-xfs/ 2.6.x-xfs]
** [http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/ xfs-cmds]

* '''CVS trees''' Direct CVS access to the most recent XFS changes. See below.

== Using CVS trees ==

The cvs trees are automated mirrors of the SGI internal ptools manage source trees.

[http://www.cvshome.org/new_users.html CVS for new users] contains links to general CVS documentation.

Set the CVSROOT environment variable.

{| width="100%" cellspacing="2"
| bgcolor="#DFDFDF" | <tt>$ export CVSROOT=':pserver:cvs@oss.sgi.com:/cvs'</tt> <tt>''(for sh, bash, ksh, or similar shells)''</tt> <tt>$ setenv CVSROOT :pserver:cvs@oss.sgi.com:/cvs</tt> <tt>''(for csh or tcsh shells)''</tt> 
|}

Login to the CVS server (this only needs to be done ONCE, not every time you access CVS).

{| width="100%" cellspacing="2"
| bgcolor="#DFDFDF" | <tt>$ cvs login</tt> ''(the password is "''cvs''")'' 
|}

Now grab the XFS source tree(s) of interest:

{| width="100%" cellspacing="2"
| bgcolor="#DFDFDF" | <tt>$ cvs checkout linux-2.6-xfs</tt> <tt>$ cvs checkout xfs-cmds</tt> 
|}

Subsequently, you can checkout new code using:

{| width="100%" cellspacing="2"
| bgcolor="#DFDFDF" | <tt>$ cvs update -d</tt> 
|}

Getting the latest source code

2009-01-08T17:29:01Z

Cattelan:

== Using GIT trees ==

[[XFS git howto]]

The Git trees are automated mirrored copied of the cvs trees using git-cvsimport.
Since git-cvsimport utilized the tool cvsps to recreate the atomic commits of ptools
or "mod" it is easier to see the entire change that was committed using git.

git-cvsimport generated trees.
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=linux-2.6-xfs-from-cvs/.git;a=summary linux-2.6-xfs-from-cvs]
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs-cmds/.git;a=summary xfs-cmds]
Changes headed for the main linux 2.6 tree, manual merges.
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs-2.6.git;a=summary xfs-2.6]

Cloning the git trees for local use:
{| width="100%" cellspacing="1"
| bgcolor="#DFDFDF" | 
<tt>$ git clone git://oss.sgi.com/linux-2.6-xfs-from-cvs</tt> 
<tt>$ git clone git://oss.sgi.com/xfs-cmds</tt> 
<tt>$ git clone git://oss.sgi.com/xfs/xfs-2.6</tt> 
|}

== XFS cvs trees 

The cvs trees were created using a script that converted sgi's internal
ptools repository to a cvs repository, so the cvs trees were considered read only.

At this point a new development is being managed by the git trees so the cvs trees
are not longer active in terms of current development and should only be used
for reference.

[[XFS CVS howto]]

Git

2009-01-08T17:06:02Z

Cattelan: Git moved to XFS git howto: be a little more descriptive

#REDIRECT [[XFS git howto]]

XFS git howto

2009-01-08T17:06:02Z

Cattelan: Git moved to XFS git howto: be a little more descriptive

== Where is it? ==

A git server is setup on oss.sgi.com which is serving
out of /oss/git. A user, git, has been setup with the home directory
of /oss/git. The xfs trees are located under /oss/git/xfs. The main
development tree is a bare repository under /oss/git/xfs/xfs.git. (So it
has no checked out files just the .git database files at the top
level) So far, it has a master, a mainline and an xfs-dev branch.
The master branch is used for the checking in of development, it
is mainline+latest XFS. xfs-dev will be set up to track ptools,
checkins are currently closed to that branch.

== Checking out a tree: ==
$ git clone git+ssh://oss.sgi.com/oss/git/xfs/xfs/
( for local trees you can use the path directly, if the machine is
running a git-daemon you can use git://, but that will not
auto-setup push syntax )

this will clone the tree (all the commit objects), and checkout
the HEAD branch (master for our case, other branches can be seen
with git branch -a, to checkout a branch (local or remote) just
use:
$ git checkout $branch

== Tree Status: ==
$ git status # lists modified, unmerged and untracked files.
$ git log # shows all commited modifications
( you can use git log $remote/branch to see the log of a remote )

== Modifying files before checkins: ==
no need to mark files for modification, git will find out
about them automagically, just edit them.
$ git add file # add
$ git rm file # remove

Note: that if one uses "git-commit -a" or "git-finalize -a"
then you don't have to add files which have been modified or
deleted as git will detect them, you only have to add new files.

== Commiting/checking in: ==
$ git commit
From man page - useful options:
--------------------------------
-a|--all::
Tell the command to automatically stage files that have
been modified and deleted, but new files you have not
told git about are not affected.

--amend::
Used to amend the tip of the current branch. Prepare the tree
object you would want to replace the latest commit as usual
(this includes the usual -i/-o and explicit paths), and the
commit log editor is seeded with the commit message from the
tip of the current branch. The commit you create replaces the
current tip -- if it was a merge, it will have the parents of
the current tip as parents -- so the current top commit is
discarded.

-s|--signoff::
Add Signed-off-by line at the end of the commit message.

I do like to git commit -asm "Commit message"
--------------------------------
remember that a git commit, only commits to YOUR local tree, you
then need to push things over:

General form is:
$ git push git://uri/of/the/other/rep +refspec

$ git push oss

Where in xfs/.git/config it has the few lines:

[remote "oss"]
url = ssh://oss.sgi.com/oss/git/xfs/xfs.git
push = master

(Or modify the config file using "git remote add" mentioned below)

== Going back in history - changing one's mind ==
[the STUPID (ptools) way]
$ git revert $mod
will introduce a new commit reverting $mod.

[ the OH GOD WE ARE DISTRIBUTED way (for mods not pushed anyway)]

if the mods to revert are the last n one:
$ git reset HEAD^n

if not (dangerous)
$ git rebase -i $mod^1 # considered harmfull read documentation !

Other related commands:
$ git reset # see doc for --hard

== Tracking remote trees: ==
$ git remote add $name $uri # adds a remote tracking
$ git remote update # updates all remotes
$ git branch -a # shows all accessible branches
( including remotes in the
$remote_name/ namespace )

$ git checkout -b $local_branch_name --track $remote_name/$remote_branch
# creates a tracked local branch, git will warn whenever the
remote adds commits.

== Publishing one's tree: ==
give shell access to your tree, use git+ssh://machine/path or
direct path

or

$ sudo git-daemon --export-all --base-path=/srv/git --base-path-relaxed --reuseaddr --user-path=public_git
# exports all git trees found under ~/public_git and /srv/git

Or set up indetd etc.

== Reviews and requesting them ==
=== Developer ===
==== From a git tree: ====
* publishes a git tree at git://dev/tree, containing his feature1 branch
* Requests a pull from the reviewer.

or

==== publishes a series of patches: ====
$ git format-patch $since_head # create ordered patches since
head $since_head, i.e. on
branch linus-create-ea, I'd
$since_head would be linus/master
$ git send-email --compose *.patch

== Importing changes to our development tree ==
=== Reviewer - typically someone at SGI ===
==== From a git tree: ====
* Adds a remote locally called "dev":
$ git remote add dev git://dev/tree

* Looks at the differences between his tree (dev) and feature1:
$ git log HEAD...dev/feature1 # differences in both ways,
read man for more detail…
List patches of commits from HEAD or dev/feature1 but not in both
(A...B in one branch but not both)
(A..B in branch B but not in A)

* Reviews the diffs: (-p adds commit change in patch form)
$ git log -p dev/feature1..HEAD

* For each commit he accepts, imports it to his tree, adding a
Signed-off-by: automatically:

Whilst in our own development tree, cherry-pick from the remote
$ git cherry-pick -s -e $commit # easily scriptable with git cherry

* The only trick there is putting the description into our preferred format,
with summary line with [XFS] prefix, body, and SGI-PV.
I guess we'll need to do that manually.

* Pushes it to tree git+ssh://oss.sgi.com/oss/git/xfs/xfs.git
$ git push oss # if you've set up tracking remotes correctly

if not
$ git push git://chook/xfs/xfs-dev <refspec> # read man…

of form: git push repository <refspec>
where <refspec> of form:
The canonical format of a <refspec> parameter is
`+?<src>:<dst>`; that is, an optional plus `+`, followed
by the source ref, followed by a colon `:`, followed by
the destination ref.
The local ref that matches <src> is used
to fast forward the remote ref that matches <dst>. If
the optional plus `+` is used, the remote ref is updated
even if it does not result in a fast forward update.

or

==== From emailed patches: ====
* From a plain patch:
$ git apply $patch # evil
* From a mailbox:
$ git am -s $mailbox # the way to go, adds Signed-off-by

In order to modify the commit description, it may work to apply the committed
patches in another branch and then "cherry-pick -e" them into the development
branch.

== Lost your quilt ? ==
Hope you use underwear.

if not, you can look at many projects made of awesome:
=== guilt (written by Jeffpc, an XFS hacker): ===
quilt for git; similar to Mercurial queues Guilt (Git Quilt) is a
series of bash scripts which add a Mercurial queues-like
functionality and interface to git. The one distinguishing
feature from other quilt-like porcelains, is the format of the
patches directory. _All_ the information is stored as plain text
- a series file and the patches (one per file). This easily lends
itself to versioning the patches using any number of of SCMs.

=== stgit: ===
manage stacks of patches in a git repository stgit provides
similar functionality to quilt (i.e. pushing/popping patches
to/from a stack) on top of git.

These operations are performed using git commands and the patches
are stored as git commit objects, allowing easy merging of the
stgit patches into other repositories using standard git
functionality.

Homepage: http://www.procode.org/stgit/

=== topgit ===
(house favourite, we may want to impose this one, but needs
a bit of git knowledge to fully understand):

a Git patch queue manager TopGit manages a patch queue using Git
topic branches, one patch per branch. It allows for patch
dependencies and can thus manage non-linear patch series.

TopGit is a minimal layer on top of Git, which does not limit use
of Git's functionality (such as the index). It rigorously keeps
history until a patch is accepted upstream. It is also fully
usable across distributed repositories.

Homepage: http://repo.or.cz/w/topgit.git

Xfs.org:Current events

2009-01-05T09:05:52Z

Cattelan: New page: Please see main page

Please see main page

Consulting Resources

2008-12-21T18:33:16Z

Cattelan: New page: [http://www.digitalelves.com Digital Elves Inc]

[http://www.digitalelves.com Digital Elves Inc]

Main Page

2008-12-21T18:29:19Z

Cattelan:

== Welcome to XFS.org ==

This site is set up to help with the XFS file system.

== Information about XFS ==

[http://en.wikipedia.org/wiki/Xfs Wikipedia xfs page, good detailed information.]

[http://oss.sgi.com/projects/xfs Main sgi xfs website]

[[XFS FAQ]]

[[XFS Status Updates]]

{{#widget:Ohloh Project|id=xfs|type=thin_badge}}

== XFS Developer Resources ==

[http://oss.sgi.com/projects/xfs Main sgi xfs website]

[http://oss.sgi.com/archives/xfs xfs@oss.sgi.com mailing list archives]

[http://oss.sgi.com/bugzilla/ oss.sgi.com bugzilla]

[http://bugzilla.kernel.org/ kernel.org bugzilla]

[[Getting the latest source code]]

[[Git]]

[[Shrinking_Support]]

== XFS User Resources ==

[http://oss.sgi.com/projects/xfs/training/index.html Link to XFS training material]

[[XFS Papers and Documentation]]

[[Linux Distributions shipping XFS]]

[[XFS Rpm for RedHat]]

[[XFS Companies]]

== XFS Future Development Thoughts ==

[[Ideas for XFS from Dave Chinner]]

== Professional XFS Consulting Services ==

[[ Consulting Resources ]]

User:Cattelan

2008-10-09T02:23:50Z

Cattelan: New page: Russell Cattelan

Russell Cattelan

Xfs.org:Site support

2008-10-08T04:08:55Z

Cattelan:

Donations are welcome

Thank you

Xfs.org:Site support

2008-10-08T03:45:26Z

Cattelan:

Donations are welcome

Reliable Detection and Repair of Metadata Corruption

2008-10-08T03:07:29Z

Cattelan: New page: == Reliable Detection and Repair of Metadata Corruption == This can be broken down into specific phases. Firstly, we cannot repair a corruption we have not detected. Hence the first th...

== Reliable Detection and Repair of Metadata Corruption ==

This can be broken down into specific phases. Firstly, we cannot repair a
corruption we have not detected. Hence the first thing we need to do is
reliable detection of errors and corruption. Once we can reliably detect errors
in structures and verified that we are propagating all the errors reported from
lower layers into XFS correctly, we can look at ways of handling them more
robustly. In many cases, the same type of error needs to be handled differently
due to the context in which the error occurs. This introduces extra complexity
into this problem.

Rather than continually refering to specific types of problems (such as
corruption or error handling) I'll refer to them as 'exceptions'. This avoids
thinking about specific error conditions through specific paths and so helps us
to look at the issues from a more general or abstract point of view.

== Exception Detection ==

Our current approach to exception detection is entirely reactive and rather
slapdash - we read a metadata block from disk and check certain aspects of it
(e.g. the magic number) to determine if it is the block we wanted. We have no
way of verifying that it is the correct block of metadata of the type
we were trying to read; just that it is one of that specific type. We
do bounds checking on critical fields, but this can't detect bit errors
in those fields. There's many fields we don't even bother to check because
the range of valid values are not limited.

Effectively, this can be broken down into three separate areas:

- ensuring what we've read is exactly what we wrote
- ensuring what we've read is the block we were supposed to read
- robust contents checking

Firstly, if we introduce a mechanism that we can use to ensure what we read is
something that the filesystem wrote, we can detect a whole range of exceptions
that are caused in layers below the filesystem (software and hardware). The
best method for this is to use a guard value that travels with the metadata it
is guarding. The guard value needs to be derived from the contents of the
block being guarded. Any event that changes the guard or the contents it is
guarding will immediately trigger an exception handling process when the
metadata is read in. Some examples of what this will detect are:

- bit errors in media/busses/memory after guard is calculated
- uninitialised blocks being returned from lower layers (dmcrypt
had a readahead cancelling bug that could do this)
- zeroed sectors as a result of double sector failures
in RAID5 systems
- overwrite by data blocks
- partial overwrites (e.g. due to power failure)

The simplest method for doing this is introducing a checksum or CRC into each
block. We can calculate this for each different type of metadata being written
just before they are written to disk, hence we are able to provide a guard that
travels all the way to and from disk with the metadata itself. Given that
metadata blocks can be a maximum of 64k in size, we don't need a hugely complex
CRC or number of bits to protect blocks of this size. A 32 bit CRC will allow
us to reliably detect 15 bit errors on a 64k block, so this would catch almost
all types of bit error exceptions that occur. It will also detect almost all
other types of major content change that might occur due to an exception.
It has been noted that we should select the guard algorithm to be one that
has (or is targetted for) widespread hardware acceleration support.

The other advantage this provides us with is a very fast method of determining
if a corrupted btree is a result of a lower layer problem or indeed an XFS
problem. That is, instead of always getting a WANT_CORRUPTED_GOTO btree
exception and shutdown, we'll get a'bad CRC' exception before we even start
processing the contents. This will save us much time when triaging corrupt
btrees - we won't spend time chasing problems that result from (potentially
silent or unhandled) lower layer exceptions.

While a metadata block guard will protect us against content change, it won't
protect us against blocks that are written to the wrong location on disk. This,
unfortunately, happens more often that anyone would like and can be very
difficult to track down when it does occur. To protect against this problem,
metadata needs to be self-describing on disk. That is, if we read a block
on disk, there needs to be enough information in that block to determine
that it is the correct block for that location.

Currently we have a very simplistic method of determining that we really have
read the correct block - the magic numbers in each metadata structure. This
only enables us to identify type - we still need location and filesystem to
really determine if the block we've read is the correct one. We need the
filesystem identifier because misdirected writes can cross filesystem
boundaries. This is easily done by including the UUID of the filesystem in
every individually referencable metadata structure on disk.

For block based metadata structures such as btrees, AG headers, etc, we
can add the block number directly to the header structures hence enabling
easy checking. e.g. for btree blocks, we already have sibling pointers in the
header, so adding a long 'self' pointer makes a great deal of sense.
For inodes, adding the inode number into the inode core will provide exactly
the same protection - we'll now know that the inode we are reading is the
one we are supposed to have read. We can make similar modifications to dquots
to make them self identifying as well.

So now we are able to verify the metadata we read from disk is what we wrote
and it's the correct metadata block, the only thing that remains is more
robust checking of the content. In many cases we already do this in DEBUG
code but not in runtime code. For example, when we read an inode cluster
in we only check the first inode for a matching magic number, whereas in
debug code we check every inode in the cluster.

In some cases, there is not much point in doing this sort of detailed checking;
it's pretty hard to check the validity of the contents of a btree block without
doing a full walk of the tree and that is prohibitive overhead for production
systems. The added block guards and self identifiers should be sufficient to
catch all non-filesystem based exceptions in this case, whilst the existing
exception detection should catch all others. With the btree factoring that
is being done on for this work, all of the btrees should end up protected by
WANT_CORRUPTED_GOTO runtime exception checking.

We also need to verify that metadata is sane before we use it. For example, if
we pull a block number out of a btree record in a block that has passed all
other validity it still may be invalid due to corruption prior to writing
it to disk. In these cases we need to ensure the block number lands
within the filesystem and/or within the bounds of the specific AG.

Similar checking is needed for pretty much any forward or backwards reference
we are going to follow or using in an algorithm somewhere. This will help
prevent kernel panics by out of bound references (e.g. using an unchecked ag
number to index the per-AG array) by turning them into a handled exception
(which will initially be a shutdown). That is, we will turn a total system
failure into a (potentially recoverable) filesystem failure.

Another failures that we often have reported is that XFS has 'hung' and
traige indicates that the filesystem appears to be waiting for a metadata
I/O completion to occur. We have seen in the past I/O errors not being
propagated from the lower layers back into the filesystem causing these
sort of problems. We have also seen cases where there have been silent
I/O errors and the first thing to go wrong is 'XFS has hung'.

To catch situations like this, we need to track all I/O we have in flight and
have some method of timing them out. That is, if we haven't completed the I/O
in N seconds, issue a warning and enter an exception handling process that
attempts to deal with the problem.

My initial thoughts on this is that it could be implemented via the MRU cache
without much extra code being needed. The complexity with this is that we
can't catch data read I/O because we use the generic I/O path for read. We do
our own data write and metadata read/write, so we can easily add hooks to track
all these types of I/O. Hence we will initially target just metadata I/O as
this would only need to hook into the xfs_buf I/O submission layer.

To further improve exception detection, once guards and self-describing
structures are on disk, we can add filesystem scrubbing daemons that can verify
the structure of the filesystem pro-actively. That is, we can use background
processes to discovery degradation in the filesystem before it is found by a
user intiated operation. This gives us the ability to do exception handling in
a context that enables further checking and potential repair of the exception.
This sort of exception handling may not be possible if we are in a
user-initiated I/O context, and certainly not if we are in a transaction
context.

This will also allow us to detect errors in rarely referenced parts of
the filesystem, thereby giving us advance warning of degradation in filesystems
that we might not otherwise get (e.g. in systems without media scrubbing).
Ideally, data scrubbing woul dneed to be done as well, but without data guards
it is rather hard to detect that there's been a change in the data....

== Exception Handling ==

Once we can detect exceptions, we need to handle them in a sane manner.
The method of exception handling is two-fold:

- retry (write) or cancel (read) asynchronous I/O
- shut down the filesystem (fatal).

Effectively, we either defer non-critical failures to a later point in
time or we come to a complete halt and prevent the filesystem from being
accessed further. We have no other methods of handling exceptions.

If we look at the different types of exceptions we can have, they
broadly fall into:

- media read errors
- media write errors
- successful media read, corrupted contents

The context in which the errors occur also influences the exception processing
that is required. For example, an unrecoverable metadata read error within a
dirty transaction is a fatal error, whilst the same error during a read-only
operation will simply log the error to syslog and return an error to userspace.

Furthermore, the storage subsystem plays a part indeciding how to handle
errors. The reason is that in many storage configurations I/O errors can be
transient. For example, in a SAN a broken fibre can cause a failover to a
redundant path, however the inflight I/O on the failed is usually timed out and
an error returned. We don't want to shut down the filesystem on such an error -
we want to wait for failover to a redundant path and then retry the I/O. If the
failover succeeds, then the I/O will succeed. Hence any robust method of
exception handling needs to consider that I/O exceptions may be transient.

In the abscence of redundant metadata, there is little we can do right now
on a permanent media read error. There are a number of approaches we
can take for handling the exception:

- try reading the block again. Normally we don't get an error
returned until the device has given up on trying to recover it.
If it's a transient failure, then we should eventually get a
good block back. If a retry fails, then:

- inform the lower layer that it needs to perform recovery on that
block before trying to read it again. For path failover situations,
this should block until a redundant path is brought online. If no
redundant path exists or recovery from parity/error coding blocks
fails, then we cannot recover the block and we have a fatal error
situation.

Ultimately, however, we reach a point where we have to give up - the metadata
no longer exists on disk and we have to enter a repair process to fix the
problem. That is, shut down the filesystem and get a human to intervene
and fix the problem.

At this point, the only way we can prevent a shutdown situation from occurring
is to have redundant metadata on disk. That is, whenever we get an error
reported, we can immediately retry by reading from an alternate metadata block.
If we can read from the alternate block, we can continue onwards without
the user even knowing there is a block in the filesystem. Of course, we'd
need to log the event for the administrator to take action on at some point
in the future.

Even better, we can mostly avoid this intervention if we have alternate
metadata blocks. That is, we can repair blocks that are returning read errors
during the exception processing. In the case of media errors, they can
generally be corrected simply by re-writing the block that was returning the
error. This will force drives to remap the bad blocks internally so the next
read from that location will return valid data. This, if my understanding is
correct, is the same process that ZFS and BTRFS use to recover from and correct
such errors.

NOTE: Adding redundant metadata can be done in several different ways. I'm not
going to address that here as it is a topic all to itself. The focus of this
document is to outline how the redundant metadata could be used to enhance
exception processing and prevent a large number of cases where we currently
shut down the filesystem.

TODO:
Transient write error
Permanent write error
Corrupted data on read
Corrupted data on write (detected during guard calculation)
I/O timeouts
Memory corruption

== Reverse Mapping ==

It is worth noting that even redundant metadata doesn't solve all our
problems. Realistically, all that redundant metadata gives us is the ability
to recover from top-down traversal exceptions. It does not help exception
handling of occurences such as double sector failures (i.e. loss of redundancy
and a metadata block). Double sector failures are the most common cause
of RAID5 data loss - loss of a disk followed by a sector read error during
rebuild on one of the remaining disks.

In this case, we've got a block on disk that is corrupt. We know what block it
is, but we have no idea who the owner of the block is. If it is a metadata
block, then we can recover it if we have redundant metadata. Even if this is
user data, we still want to be able to tell them what file got corrupted by the
failure event. However, without doing a top-down traverse of the filesystem we
cannot find the owner of the block that was corrupted.

This is where we need a reverse block map. Every time we do an allocation of
an extent we know who the owner of the block is. If we record this information
in a separate tree then we can do a simple lookup to find the owner of any
block and start an exception handling process to repair the damage. Ideally
we also need to include information about the type of block as well. For
example, and inode can own:

- data blocks
- data fork BMBT blocks
- attribute blocks
- attribute fork BMBT blocks

So keeping track of owner + type would help indicate what sort of exception
handling needs to take place. For example, a missing data fork BMBT block means there
will be unreferenced extents across the filesystem. These 'lost extents'
could be recovered by reverse map traversal to find all the BMBT and data
blocks owned by that inode and finding the ones that are not referenced.
If the reverse map held suffient extra metadata - such as the offset within the
file for the extent - the exception handling process could rebuild the BMBT
tree completely without needing ænd external help.

It would seem to me that the reverse map needs to be a long-pointer format
btree and held per-AG. it needs long pointers because the owner of an extent
can be anywhere in the filesystem, and it needs to be per-AG to avoid adverse
effect on allocation parallelism.

The format of the reverse map record will be dependent on the amount of
metdata we need to store. We need:

- owner (64 bit, primary record)
- {block, len} extent descriptor
- type
- per-type specific metadata (e.g. offset for data types).

Looking at worst case here, say we have 32 bytes per record, the worst case
space usage of the reverse map btree woul dbe roughly 62 records per 4k
block. With a 1TB allocation group, we have 228 4k blocks in the AG
that could require unique reverse mappings. That gives us roughly 222
4k blocks to for the reverse map, or 234 bytes - roughly 16GB per 1TB
of space.

It may be a good idea to allocate this space at mkfs time (tagged as unwritten
so it doesn't need zeroing) to avoid allocation overhead and potential free
space fragmentation as the reverse map index grows and shrinks. If we do
this we could even treat this as a array/skip list where a given block in the
AG has a fixed location in the map. This will require more study to determine
the advantages and disadvantages of such approaches.

== Recovering From Errors During Transactions ==

One of the big problems we face with exception recovery is what to do
when we take an exception inside a dirty transaction. At present, any
error is treated as a fatal error, the transaction is cancelled and
the filesystem is shut down. Even though we may have a context which
can return an error, we are unable to revert the changes we have
made during the transaction and so cannot back out.

Effectively, a cancelled dirty transaction looks exactly like in-memory
structure corruption. That is, what is in memory is different to that
on disk, in the log or in asynchronous transactions yet to be written
to the log. Hence we cannot simply return an error and continue.

To be able to do this, we need to be able to undo changes made in a given
transaction. The method XFS uses for journalling - write-ahead logging -
makes this diffcult to do. A transaction proceeds in the following
order:

- allocate transaction
- reserve space in the journal for transaction
- repeat until change is complete:
- lock item
- join item to transaction
- modify item
- record region of change to item
- transaction commit

Effectively, we modify structures in memory then record where we
changed them for the transaction commit to write to disk. Unfortunately,
this means we overwrite the original state of the items in memory,
leaving us with no way to back out those changes from memory if
something goes wrong.

However, based on the observation that we are supposed to join an item to the
transaction *before* we start modifying it, it is possible to record the state
of the item before we start changing it. That is, we have a hoook that can
allow us take a copy of the unmodified item when we join it to the
transaction.

If we have an unmodified copy of the item in memory, then if the transaction
is cancelled when dirty, we have the information necessary to undo, or roll
back, the changes made in the transaction. This would allow us to return
the in-memory state to that prior to the transaction starting, thereby
ensuring that the in-memory state matches the rest of the filesystem and
allowing us to return an error to the calling context.

This is not without overhead. we would have to copy every metadata item
entirely in every transaction. This will increase the CPU overhead
of each transaction as well as the memory required. It is the memory
requirement more than the CPU overhead that concerns me - we may need
to ensure we have a memory pool associated with transaction reservation
that guarantees us enough memory is available to complete the transaction.
However, given that we could roll back transactions, we could now *fail
transactions* with ENOMEM and not have to shut down the filesystem, so this
may be an acceptible trade-off.

In terms of implementation, it is worth noting that there is debug code in
the xfs_buf_log_item for checking that all the modified regions of a buffer
were logged. Importantly, this is implemented by copying the original buffer
in the item initialisation when it is first attached to a transaction. In
other words, this debug code implements the mechanism we need to be able
to rollback changes made in a transaction. Other item types would require
similar changes to be made.

Overall, this doesn't look like a particularly complex change to make; the
only real question is how much overhead is it going to introduce. With CPUs
growing more cores all the time, and XFS being aimed at extremely
multi-threaded workloads, this overhead may not be a concern for long.

== Failure Domains ==

If we plan to have redundant metadata, or even try to provide fault isolation
between different parts of the filesystem namespace, we need to know about
independent regions of the filesystem. 'Independent Regions' (IR) are ranges
of the filesystem block address space that don't share resources with
any other range.

A classic example of a filesystem made up of multiple IRs is a linear
concatenation of multiple drives into a larger address space. The address
space associated with each drive can operate independently from the other
drives, and a failure of one drive will not affect the operation of the address
spaces associated with other drives in the linear concatenation.

A Failure Domain (FD) is made up of one or more IRs. IRs cannot be shared
between FDs - IRs are not independent if they are shared! Effectively, an
ID is an encoding of the address space within the filesystem that lower level
failures (from below the filesystem) will not propagate outside. The geometry
and redundancy in the underlying storage will determine the nature of the
IRs available to the filesystem.

To use redundant metadata effectively for recovering from fatal lower layer
loss or corruption, we really need to be able to place said redundant
metadata in a different FDs. That way a loss in one domain can be recovered
from a domain that is still intact. It also means that it is extremely
difficult to lose or corrupt all copies of a given piece of metadata;
that would require multiple independent faults to occur in a localised
temporaral window. Concurrent multiple component failure in multiple
IRs is considered to be quite unlikely - if such an event were to
occur, it is likely that there is more to worry about than filesystem
consistency (like putting out the fire in the data center).

Another use of FDs is to try to minimise the number of domain boundaries
each object in the filesystem crosses. If an object is wholly contained
within a FD, and that object is corrupted, then the repair problem is
isolated to that FD, not the entire filesystem. That is, by making
allocation strategies and placement decisions aware of failure domain
boundaries we can constraint the location of related data and metadata.
Once locality is constrained, the scope of repairing an object if
it becomes corrupted is reduced to that of ensuring the FD is consistent.

There are many ways of limiting cross-domain dependencies; I will
not try to detail them here. Likewise, there are many ways of introducing
such information into XFS - mkfs, dynamically via allocation policies,
etc - so I won't try to detail them, either. The main point to be
made is that to make full use of redundant metadata and to reduce
the scope of common reapir problems we need to pay attention to
how the system can fail to ensure that we can recover from failures
as quickly as possible.

Improving Metadata Performance By Reducing Journal Overhead

2008-10-08T03:03:55Z

Cattelan: New page: == Improving Metadata Performance By Reducing Journal Overhead == XFS currently uses asynchronous write-ahead logging to ensure that changes to the filesystem structure are preserved o...

== Improving Metadata Performance By Reducing Journal Overhead ==

XFS currently uses asynchronous write-ahead logging to ensure that changes to
the filesystem structure are preserved on crash. It does this by logging
detailed records ofteh changes being made to each object on disk during a
transaction. Every byte that is modified needs to be recorded in the journal.

There are two issues with this approach. The first is that transactions can
modify a *lot* of metadata to complete a single operation. Worse is the fact
that the average size of a transactions grows as structures get larger and
deeper, so performance on larger, fuller filesystem drops off as log bandwidth
is consumed by fewer, larger transactions.

The second is that we re-log previous changes that are active in the journal
if the object is modified again. hence if an object is modified repeatedly, the
dirty parts of the object get rewritten over and over again. in the worst case,
frequently logged buffers will be entirely dirty and so even if we only change
a single byte in the buffer we'll log the entire buffer.

An example of how needless this can be is the operation of a removing all the
files in a directory result in the directory blocks being logged over and over
again before finally being freed and made stale in the log. If we are freeing
the entire contents of the directory, the only transactions we really need in
the journal w.r.t to directory buffers is the 'remove, stale and free'
transaction; all other changes are irrelevant because we don't care about
changes to free space. Depending on the directory block size, we might log each
directory buffer tens to hundreds of times before making it stale...

Clearly we have two different axis to approach this problem along:

- reduce the amount we log in a given transaction
- reduce the number of times we re-log objects.

Both of these things give the same end result - we require less bandwidth to
the journal to log changes that are happening in the filesystem. Let's start
by looking at how to reduce re-logging of objects.

== Asynchronous Transaction Aggregation ==

The first observation that we need to make is that we are already doing
asynchronous journalling for everything other than explicitly synchronous
transactions. This means we are aggregating completed transactions in memory
before writing them to disk. This reduces the number of disk I/Os needed to
write the log, but it does nothing to help prevent relogging of items.

The second observation is that we store asynchronous committed transactions
in two forms while they are being written to disk:

- the physical form in the log buffer that will be written
- the logical form attached to the log buffer so that on I/O completion
of the log buffer the items in the transaction can be unpinned and
moved to or updated in the AIL for later writeback.

The fact that we store the logical form of the transaction until after the
log buffer is written to the journal is important - it means the transaction
and all it's dirty items live longer than process that creates and commits
the transaction. This allows us to redefine what 'transaction commit' actually
means.

A transaction commit currently takes the following steps:

- apply superblock and dquot changes to in-core structures
- build an vector array to all the dirty regions in all the items in
the transaction.
- write the vector array into the log buffer (may trigger log I/O)
- release ununsed transaction reservations to in-core structures
- attach transaction to log buffer callbacks
- write a commit record into the log buffer for the transaction
- unlock all the items locked in the transaction
- release the log buffer (may trigger log I/O)
- if synchronous transaction, issue a synchronous log force to
get the transaction on disk.

Based on the observation that the transaction structure exists until it is
freed during log buffer I/o completion, we really don't have to format the
transaction into a log buffer during the transaction commit - we could
simply queue it into a list for later processing. Synchronous
transactions don't change this - they just queue the transaction then
do a log force to cause the transaction queue to be flushed to disk.

Now that we have an asynchronous transaction queue in logical format, we can
take our time deciding when and how best to write it to disk. If we have
the situation where we are relogging items, we will have a given item
in multiple transactions. If we write out each transaction as an individual
commit like we currently do, we'd then have the problem of future changes
being written in the first transaction that we write. This will cause
problems for recovery.

Hence what we really want to do is aggregate all those transactions into a
single large journal commit. This makes the journalling model more of a
'checkpoint of changes' than a 'transactional change' model. By commiting
a set of transactions rather than just a single transaction per commit
record, we can avoid needed to commit items several times to the log
if they are modified in multiple transactions. During recovery, we only
recover the entire commit so we only need a single version of each item
that encompasses all the changes in the commit period.

As an aside, if we have large numbers of items per commit record now,
it makes sense to start optimising the recovery read-ahead by sorting
all the items in the commit record before issuing readahead on them.
This will reduce the seeking the readahead triggers somewhat, so should
lead to faster recovery times.

The down sides to this approach are:

- holds items pinned in memory for longer, thereby increases
the chances of triggering a log force to unpin them.
- partial log forces (i.e. those to a specific LSN) are no longer
really possible as we do not have multiple independent queues
(iclogbufs) holding the transactions.
- log forces become more expensive by having to drain the entire
async transaction queue.
- synchronous transactions become more expensive by having to
drain the entire async transaction queue.
- possible 'interesting' interactions with tail-pushing if we
allow too many async transactions to be queued without flushing
them.

The main concern with this approach is ensuring that we don't adversely affect
fsync() performance. For example, ext3 uses a checkpoint based journalling
system that has a very long checkpoint period (5 seconds). As a result, a
synchronous operation such as an fsync() can be forced to flush up to 5 seconds
worth of transactions to disk. In ordered mode, this also involves flushing
data, so the fsync() latency can be measured in tens of seconds on a busy
filesystem. This is known as the 'sync the world' problem, and currently XFS
does not suffer from this at all.

[Data point: Recent testing of this phenomenon by Chris Mason showed XFS took
less than one second to fsync a 4k write in the presence of a background
streaming write; BTRFS took two seconds and ext3 took five seconds. ]

To avoid this type of latency, we should not be allowing too many transactions
to accumulate in the async transaction queue. If we look at optimising
workloads such as sequential creates or deletes in a single directory then, in
theory, accumulating just 10 async transactions into a single commit record
should provide close to an order of magnitude reduction in bandwidth to the log
under these workloads. We also reduce the number of log writes by aggregating
like this and that will give us even larger gains by avoiding seeks to write
log buffers out.

Hence I don't think the async transaction queue needs to be all that deep
to realise substantial gains, and hence the impact on synchronous transaction
latency can be kept quite low as a result.

== Atomic Multi-Transaction Operations ==

A feature asynchronous transaction aggregation makes possible is atomic
multi-transaction operations. On the first transaction we hold the queue in
memory, preventing it from being committed. We can then do further transactions
that will end up in the same commit record, and on the final transaction we
unlock the async transaction queue. This will allow all those transaction to be
applied atomically. This is far simpler than any other method I've been looking
at to do this.

After a bit of reflection, I think this feature may be necessary for correct
implementation of existing logging techniques. The way we currently implement
rolling transactions (with permanent log reservations and rolling
dup/commit/re-reserve sequences) would seem to require all the commits in a
rolling transaction to be including in a single commit record. If I understand
history and the original design correctly, these rolling transactions were
implemented so that large, complex transactions would not pin the tail of the
log as they progressed. IOWs, they implicitly use re-logging to keep the tail
of the log moving forward as they progress and continue to modify items in the
transaction.

Given we are using asynchronous transaction aggregation as a method of reducing
re-logging, it would make sense to prevent these sorts of transactions from
pinning the tail of the log at all. Further, because we are effectively
disturbing the concept of unique transactions, I don't think that allowing a
rolling transaction to span aggregated commits is valid as we are going to be
ignoring the transaction IDs that are used to identify individual transactions.

Hence I think it is a good idea to simply replace rolling transactions with
atomic multi-transaction operations. This may also allow us to split some of
the large compound transactions into smaller, more self contained transactions.
This would reduce reservation pressure on log space in the common case where
all the corner cases in the transactions are not taken. In terms of
implementation, I think we can initially augment the permanent transaction
reservation/release interface to acheive this. With a working implementation,
we can then look to changing to a more explicit interface and slowly work to
remove the 'permanent log transaction' concept entirely. This shold simplify
the log code somewhat....

Note: This asynchronous transaction aggregation is originally based on a
concept floated by Nathan Scott called 'Delayed Logging' after observing how
ext3 implemented journalling. This never passed more than a concept
description phase....

== Operation Based Logging ==

The second approach to reducing log traffic is to change exactly what we
log in the transactions. At the moment, what we log is the exact change to
the item that is being made. For things like inodes and dquots, this isn't
particularly expensive because it is already a very compact form. The issue
comes with changes that are logged in buffers.

The prime example of this is a btree modification that involves either removing
or inserting a record into a buffer. The records are kept in compact form, so an
insert or remove will also move other records around in the buffer. In the worst
case, a single insert or remove of a 16 byte record can dirty an entire block
(4k generally, but could be up to 64k). In this case, if we were to log the
btree operation (e.g. insert {record, index}) rather than the resultant change
on the buffer the overhead of a btree operation is fixed. Such logging also
allows us to avoid needing to log the changes due to splits and merges - we just
replay the operation and subsequent splits/merges get done as part of replay.

The result of this is that complex transactions no longer need as much log space
as all possible change they can cause - we only log the basic operations that
are occurring and their result. Hence transaction end up being much smaller,
vary less in size between empty and full filesystems, etc. An example set of
operations describing all the changes made by an extent allocation on an inode
would be:

- inode X intent to allocate extent {off, len}
- AGCNT btree update record in AG X {old rec} {new rec values}
- AGBNO btree delete record in AG X {block, len}
- inode X BMBT btree insert record {off, block, len}
- inode X delta

This comes down to a relatively small, bound amount of space which is close the
minimun and existing allocation transaction would consume. However, with this
method of logging the transaction size does not increase with the size of
structures or the amount of updates necessary to complete the operations.

A major difference to the existing transaction system is that re-logging
of items doesn't fit very neatly with operation based logging.

There are three main disadvantages to this approach:

- recovery becomes more complex - it will need to change substantially
to accomodate operation replay rather than just reading from disk
and applying deltas.
- we have to create a whole new set of item types and add the necessary
hooks into the code to log all the operations correctly.
- re-logging is probably not possible, and that introduces
differences to the way we'll need to track objects for flushing. It
may, in fact, require transaction IDs in all objects to allow us
to determine what the last transaction that modified the item
on disk was during recovery.

Changing the logging strategy as described is a much more fundamental change to
XFS than asynchronous transaction aggregation. It will be difficult to change
to such a model in an evolutionary manner; it is more of a 'flag day' style
change where then entire functionality needs to be added in one hit. Given that
we will also still have to support the old log format, it doesn't enable us to
remove any code, either.

Given that we are likely to see major benefits in the problem workloads as a
result of asynchronous transaction aggregation, it may not be necessary to
completely rework the transaction subsystem. Combining aggregation with an
ongoing process of targeted reduction of transaction size will provide benefits
out to at least the medium term. It is unclear whether this direction will be
sufficient in the long run until we can measure the benefit that aggregation
will provide.

== Reducing Transaction Overhead ==

To switch tracks completely, I have not addressed general issues with overhead
in the transaction subsystem itself. There are several points where the
transaction subsystem will single thread because of filesystem scope locks and
structures. We have, for example, the log grant lock for protecting
reservation and used log space, the AIL lock for tracking dirty metadata, the
log state lock for state transition of log buffers and other associated
structure modifications.

We have already started down the path of reducing contention in
various paths. For example:

- changing iclog reference counts to atomics to avoid needing the log
state lock on every transaction commit
- protecting iclog callback lists with a per-iclog lock instead of the log
state lock
- removing the AIL lock from the transaction reserve path by isolating
AIL tail pushing to a single thread instead of being done
synchronously.

Asynchronous transaction aggregation is likely to perturb the current known
behaviour and bottlenecks as a result of moving all of the log interfacing out
of the direct transaction commit path. Similar to moving the AIL pushing into
it's own thread, this will mean that there will typically only be a single
thread formatting and writing to iclog buffers. This will remove much of the
parallelism that puts excessive pressure on many of these locks.

I am certain that asynchronous transaction aggregation will open up new areas
of optimisation in the log formatting and dispatch code - it will probably
enable us to remove a lot of the complexity because we will be able to directly
control the parallelism in the formatting and dispatch of log buffers. This
implies that we may not need to be limited to a fixed pool of fixed sized log
buffers for writing transactions to disk.

However, it is probably best to leave consideration of such optimisations until
after the asynchronous transaction aggregation is implemented and we can
directly observe the pain points that become apparent as a result of such a
change.

== Reducing Recovery Time ==

With 2GB logs, recovery can take an awfully long time due to the need
to read each object synchronously as we process the jounral. An obvious
way to avoid this is to add another pass to the processing to do asynchronous
readahead of all the objects in the log before doing the processing passes.
This will populate the cache as quickly as possible and hide any read latency
that could occur as we process commit records.

A logical extension to this is to sort the objects in ascending offset order
before issuing I/O on them. That will further optimise the readahead I/O
to reduce seeking and hence should speed up the read phase of recovery
further.

== ToDo ==
Further investigation of recovery for future optimisation.