<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://xfs.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Sandeen</id>
	<title>xfs.org - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://xfs.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Sandeen"/>
	<link rel="alternate" type="text/html" href="https://xfs.org/index.php/Special:Contributions/Sandeen"/>
	<updated>2026-04-20T10:32:24Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.42.3</generator>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_FAQ&amp;diff=3007</id>
		<title>XFS FAQ</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_FAQ&amp;diff=3007"/>
		<updated>2018-07-30T22:32:12Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: /* Q: Where can I find information about the internals of XFS? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about XFS? ==&lt;br /&gt;
&lt;br /&gt;
The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.&lt;br /&gt;
&lt;br /&gt;
You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the &#039;&#039;&#039;&amp;lt;nowiki&amp;gt;#xfs&amp;lt;/nowiki&amp;gt;&#039;&#039;&#039; IRC channel on &#039;&#039;irc.freenode.net&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about ACLs? ==&lt;br /&gt;
&lt;br /&gt;
Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;acl(5)&#039;&#039;&#039; manual page is also quite extensive.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find information about the internals of XFS? ==&lt;br /&gt;
&lt;br /&gt;
An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.&lt;br /&gt;
&lt;br /&gt;
Darrick has updated Barry Naujok&#039;s documentation of the [https://www.kernel.org/pub/linux/utils/fs/xfs/docs/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.&lt;br /&gt;
&lt;br /&gt;
== Q: What partition type should I use for XFS on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Linux native filesystem (83).&lt;br /&gt;
&lt;br /&gt;
== Q: What mount options does XFS have? ==&lt;br /&gt;
&lt;br /&gt;
There are a number of mount options influencing XFS filesystems - refer to the &#039;&#039;&#039;mount(8)&#039;&#039;&#039; manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])&lt;br /&gt;
&lt;br /&gt;
== Q: Is there any relation between the XFS utilities and the kernel version? ==&lt;br /&gt;
&lt;br /&gt;
No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Does it run on platforms other than i386? ==&lt;br /&gt;
&lt;br /&gt;
XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Do quotas work on XFS? ==&lt;br /&gt;
&lt;br /&gt;
Yes.&lt;br /&gt;
&lt;br /&gt;
To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/  http://sourceforge.net/projects/linuxquota/] or use &#039;&#039;&#039;xfs_quota(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: What&#039;s project quota? ==&lt;br /&gt;
&lt;br /&gt;
The  project  quota  is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Can group quota and project quota be used at the same time? ==&lt;br /&gt;
&lt;br /&gt;
No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==&lt;br /&gt;
&lt;br /&gt;
To be answered.&lt;br /&gt;
&lt;br /&gt;
== Q: Are there any dump/restore tools for XFS? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and &#039;&#039;&#039;xfsrestore(8)&#039;&#039;&#039; are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.&lt;br /&gt;
&lt;br /&gt;
== Q: Does LILO work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
This depends on where you install LILO.&lt;br /&gt;
&lt;br /&gt;
Yes, for MBR (Master Boot Record) installations.&lt;br /&gt;
&lt;br /&gt;
No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.&lt;br /&gt;
&lt;br /&gt;
== Q: Does GRUB work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.&lt;br /&gt;
&lt;br /&gt;
== Q: Can XFS be used for a root filesystem? ==&lt;br /&gt;
&lt;br /&gt;
Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the &amp;quot;rootflags=&amp;quot; kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit &amp;quot;logdev=&amp;quot; specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]&lt;br /&gt;
&lt;br /&gt;
== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be &amp;quot;clean&amp;quot; when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.&lt;br /&gt;
&lt;br /&gt;
== Q: Is there a way to make a XFS filesystem larger or smaller? ==&lt;br /&gt;
&lt;br /&gt;
You can &#039;&#039;NOT&#039;&#039; make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.&lt;br /&gt;
&lt;br /&gt;
An XFS filesystem may be enlarged by using &#039;&#039;&#039;xfs_growfs(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the &#039;&#039;exact same&#039;&#039; starting point. Run &#039;&#039;&#039;xfs_growfs&#039;&#039;&#039; to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.&lt;br /&gt;
&lt;br /&gt;
Using XFS filesystems on top of a volume manager makes this a lot easier.&lt;br /&gt;
&lt;br /&gt;
== Q: What information should I include when reporting a problem? ==&lt;br /&gt;
&lt;br /&gt;
What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:&lt;br /&gt;
&lt;br /&gt;
* kernel version (uname -a)&lt;br /&gt;
* xfsprogs version (xfs_repair -V)&lt;br /&gt;
* number of CPUs&lt;br /&gt;
* contents of /proc/meminfo&lt;br /&gt;
* contents of /proc/mounts&lt;br /&gt;
* contents of /proc/partitions&lt;br /&gt;
* RAID layout (hardware and/or software)&lt;br /&gt;
* LVM configuration&lt;br /&gt;
* type of disks you are using&lt;br /&gt;
* write cache status of drives&lt;br /&gt;
* size of BBWC and mode it is running in&lt;br /&gt;
* xfs_info output on the filesystem in question&lt;br /&gt;
* dmesg output showing all error messages and stack traces&lt;br /&gt;
 &lt;br /&gt;
Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:&lt;br /&gt;
&lt;br /&gt;
# iostat -x -d -m 5&lt;br /&gt;
# vmstat 5&lt;br /&gt;
 &lt;br /&gt;
can give us insight into the IO and memory utilisation of your machine at the time of the problem.&lt;br /&gt;
&lt;br /&gt;
If the filesystem is hanging, then capture the output of the dmesg command after running:&lt;br /&gt;
&lt;br /&gt;
 # echo w &amp;gt; /proc/sysrq-trigger&lt;br /&gt;
 # dmesg&lt;br /&gt;
&lt;br /&gt;
will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.&lt;br /&gt;
&lt;br /&gt;
And for advanced users, capturing an event trace using &#039;&#039;&#039;trace-cmd&#039;&#039;&#039; (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it&#039;s a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd record -e xfs\*&lt;br /&gt;
&lt;br /&gt;
before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd report &amp;gt; trace_report.txt&lt;br /&gt;
&lt;br /&gt;
Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.&lt;br /&gt;
&lt;br /&gt;
If you have a problem with &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039;, make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using &#039;&#039;&#039;xfs_metadump(8)&#039;&#039;&#039; (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.&lt;br /&gt;
&lt;br /&gt;
== Q: Mounting an XFS filesystem does not work - what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
If mount prints an error message something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     mount: /dev/hda5 has wrong major or minor number&lt;br /&gt;
&lt;br /&gt;
you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the &amp;quot;-t xfs&amp;quot; option on mount or the &amp;quot;xfs&amp;quot; option in &amp;lt;tt&amp;gt;/etc/fstab&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
If you get something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 mount: wrong fs type, bad option, bad superblock on /dev/sda1,&lt;br /&gt;
        or too many mounted file systems&lt;br /&gt;
&lt;br /&gt;
Refer to your system log file (&amp;lt;tt&amp;gt;/var/log/messages&amp;lt;/tt&amp;gt;) for a detailed diagnostic message from the kernel.&lt;br /&gt;
&lt;br /&gt;
== Q: Does the filesystem have an undelete capability? ==&lt;br /&gt;
&lt;br /&gt;
There is no undelete in XFS.&lt;br /&gt;
&lt;br /&gt;
However, if an inode is unlinked but neither it nor its associated data blocks get immediately re-used and overwritten, there is some small chance to recover the file from the disk.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;photorec&#039;&#039;, &#039;&#039;xfs_irecover&#039;&#039; or &#039;&#039;xfsr&#039;&#039; are some tools which attempt to do this, with varying success.&lt;br /&gt;
&lt;br /&gt;
There are also commercial data recovery services and closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS] which claims to recover data, although this has not been tested by the XFS developers.&lt;br /&gt;
&lt;br /&gt;
As always, the best advice is to keep good backups.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I backup a XFS filesystem and ACLs? ==&lt;br /&gt;
&lt;br /&gt;
You can backup a XFS filesystem with utilities like &#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and standard &#039;&#039;&#039;tar(1)&#039;&#039;&#039; for standard files. If you want to backup ACLs you will need to use &#039;&#039;&#039;xfsdump&#039;&#039;&#039; or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (&amp;gt; version 3.1.4) or [http://rsync.samba.org/ rsync] (&amp;gt;= version 3.0.0) to backup ACLs and EAs. &#039;&#039;&#039;xfsdump&#039;&#039;&#039; can also be integrated with [http://www.amanda.org/ amanda(8)].&lt;br /&gt;
&lt;br /&gt;
== Q: I see applications returning error 990 or &amp;quot;Structure needs cleaning&amp;quot;, what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], &amp;quot;Structure needs cleaning.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.&lt;br /&gt;
&lt;br /&gt;
There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.&lt;br /&gt;
&lt;br /&gt;
You can use xfs_repair to remedy the problem (with the file system unmounted).&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==&lt;br /&gt;
&lt;br /&gt;
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.&lt;br /&gt;
&lt;br /&gt;
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.&lt;br /&gt;
&lt;br /&gt;
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you&#039;ll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the &#039;&#039;&#039;xfs_bmap(8)&#039;&#039;&#039; command).&lt;br /&gt;
&lt;br /&gt;
== Q: What is the problem with the write cache on journaled filesystems? ==&lt;br /&gt;
&lt;br /&gt;
Many drives use a write back cache in order to speed up the performance of writes.  However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk.  Further, the drive can de-stage data from the write cache to the platters in any order that it chooses.  This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk.  When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.&lt;br /&gt;
&lt;br /&gt;
With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information.  In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.&lt;br /&gt;
&lt;br /&gt;
With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued.  A powerfail &amp;quot;only&amp;quot; loses data in the cache but no essential ordering is violated, and corruption will not occur.&lt;br /&gt;
&lt;br /&gt;
With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance.  But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I tell if I have the disk write cache enabled? ==&lt;br /&gt;
&lt;br /&gt;
For SCSI/SATA:&lt;br /&gt;
&lt;br /&gt;
* Look in dmesg(8) output for a driver line, such as:&amp;lt;br /&amp;gt; &amp;quot;SCSI device sda: drive cache: write back&amp;quot;&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# sginfo -c /dev/sda | grep -i &#039;write cache&#039; &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -I /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; and look under &amp;quot;Enabled Supported&amp;quot; for &amp;quot;Write cache&amp;quot;&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
== Q: How can I address the problem with the disk write cache? ==&lt;br /&gt;
&lt;br /&gt;
=== Disabling the disk write back cache. ===&lt;br /&gt;
&lt;br /&gt;
For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -W0 /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # hdparm -W0 /dev/hda&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# blktool /dev/sda wcache off&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # blktool /dev/hda wcache off&lt;br /&gt;
&lt;br /&gt;
For SCSI:&lt;br /&gt;
&lt;br /&gt;
* Using sginfo(8) which is a little tedious&amp;lt;br /&amp;gt; It takes 3 steps. For example:&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -c /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives a list of attribute names and values&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cX /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives an array of cache values which you must match up with from step 1, e.g.&amp;lt;br /&amp;gt; 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; allows you to reset the value of the cache attributes.&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Using an external log. ===&lt;br /&gt;
&lt;br /&gt;
Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will &#039;&#039;&#039;not&#039;&#039;&#039; solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won&#039;t be able to guarantee that if the metadata is on a drive with the write cache enabled.&lt;br /&gt;
&lt;br /&gt;
In fact using an external log will disable XFS&#039; write barrier support.&lt;br /&gt;
&lt;br /&gt;
=== Write barrier support. ===&lt;br /&gt;
&lt;br /&gt;
Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with &amp;quot;nobarrier&amp;quot;. Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported with external log device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported by the underlying device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, trial barrier write failed&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If the filesystem is mounted with an external log device then we currently don&#039;t support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn&#039;t support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.&lt;br /&gt;
&lt;br /&gt;
== Q. Should barriers be enabled with storage which has a persistent write cache? ==&lt;br /&gt;
&lt;br /&gt;
Many hardware RAIDs have a persistent write cache which is preserved across power failure, interface resets, system crashes, etc.  The same may be true of some SSD devices.  This sort of hardware should report to the operating system that no flushes are required, and in that case barriers will not be issued, even without the &amp;quot;nobarrier&amp;quot; option.  Quoting Christoph Hellwig [http://oss.sgi.com/archives/xfs/2015-12/msg00281.html on the xfs list],&lt;br /&gt;
  If the device does not need cache flushes it should not report requiring&lt;br /&gt;
  flushes, in which case nobarrier will be a noop.  Or to phrase it&lt;br /&gt;
  differently:  If nobarrier makes a difference skipping it is not safe.&lt;br /&gt;
On modern kernels with hardware which properly reports write cache behavior, there is no need to change barrier options at mount time.&lt;br /&gt;
&lt;br /&gt;
== Q. Which settings does my RAID controller need ? ==&lt;br /&gt;
&lt;br /&gt;
It&#039;s hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:&lt;br /&gt;
&lt;br /&gt;
Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory &amp;quot;[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]&amp;quot;) which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.&lt;br /&gt;
&lt;br /&gt;
If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.&lt;br /&gt;
&lt;br /&gt;
* onboard RAID controllers: there are so many different types it&#039;s hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn&#039;t even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.&lt;br /&gt;
&lt;br /&gt;
* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86); &lt;br /&gt;
&lt;br /&gt;
* Adaptec: allows setting individual drives cache&lt;br /&gt;
arcconf setcache &amp;lt;disk&amp;gt; wb|wt&lt;br /&gt;
wb=write back, which means write cache on, wt=write through, which means write cache off. So &amp;quot;wt&amp;quot; should be chosen.&lt;br /&gt;
&lt;br /&gt;
* Areca: In archttp under &amp;quot;System Controls&amp;quot; -&amp;gt; &amp;quot;System Configuration&amp;quot; there&#039;s the option &amp;quot;Disk Write Cache Mode&amp;quot; (defaults &amp;quot;Auto&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Off&amp;quot;: disk write cache is turned off&lt;br /&gt;
&lt;br /&gt;
&amp;quot;On&amp;quot;: disk write cache is enabled, this is not safe for your data but fast&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Auto&amp;quot;: If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to &amp;quot;On&amp;quot;, because neither controller cache nor disk cache is safe so you don&#039;t seem to care about your data and just want high speed (which you get then).&lt;br /&gt;
&lt;br /&gt;
That&#039;s a very sensible default so you can let it &amp;quot;Auto&amp;quot; or enforce &amp;quot;Off&amp;quot; to be sure.&lt;br /&gt;
&lt;br /&gt;
* LSI MegaRAID: allows setting individual disks cache:&lt;br /&gt;
 MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL                          # flushes the controller cache&lt;br /&gt;
 MegaCli -LDGetProp -Cache    -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the controller cache settings&lt;br /&gt;
 MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the disk cache settings (for all phys. disks in logical disk)&lt;br /&gt;
 MegaCli -LDSetProp -EnDskCache|DisDskCache  -LN|-L0,1,2|-LAll  -aN|-a0,1,2|-aALL # set disk cache setting&lt;br /&gt;
&lt;br /&gt;
* Xyratex: from the docs: &amp;quot;Write cache includes the disk drive cache and controller cache.&amp;quot;. So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.&lt;br /&gt;
&lt;br /&gt;
== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==&lt;br /&gt;
&lt;br /&gt;
The biggest problem is that those products seem to also virtualize disk &lt;br /&gt;
writes in a way that even barriers don&#039;t work any more, which means even &lt;br /&gt;
a fsync is not reliable. Tests confirm that unplugging the power from &lt;br /&gt;
such a system even with RAID controller with battery backed cache and &lt;br /&gt;
hard disk cache turned off (which is safe on a normal host) you can &lt;br /&gt;
destroy a database within the virtual machine (client, domU whatever you &lt;br /&gt;
call it).&lt;br /&gt;
&lt;br /&gt;
In qemu you can specify cache=off on the line specifying the virtual &lt;br /&gt;
disk. For others information is missing.&lt;br /&gt;
&lt;br /&gt;
== Q: What is the issue with directory corruption in Linux 2.6.17? ==&lt;br /&gt;
&lt;br /&gt;
In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some &amp;quot;sparse&amp;quot; endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: the fix is included in 2.6.17.7 and later kernels.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
To add insult to injury, &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039; is currently not correcting these directories on detection of this corrupt state either. This &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; issue is actively being worked on, and a fixed version will be available shortly.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfs_repair -n&#039;&#039;&#039; should be able to detect any directory corruption.&lt;br /&gt;
&lt;br /&gt;
Until a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; binary is available, one can make use of the &#039;&#039;&#039;xfs_db(8)&#039;&#039;&#039; command to mark the problem directory for removal (see the example below). A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; invocation will remove the directory and move all contents into &amp;quot;lost+found&amp;quot;, named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 core.mode = 040755&lt;br /&gt;
 core.version = 2&lt;br /&gt;
 core.format = 3 (btree)&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; write core.mode 0&lt;br /&gt;
 xfs_db&amp;amp;gt; quit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; will clear the directory, and add new entries (named by inode number) in lost+found.&lt;br /&gt;
&lt;br /&gt;
The easiest way to map inode numbers to full paths is via &#039;&#039;&#039;xfs_ncheck(8)&#039;&#039;&#039;&amp;lt;nowiki&amp;gt;: &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_ncheck -i 14101 -i 14102 /dev/sdXXX&lt;br /&gt;
       14101 full/path/mumble_fratz_foo_bar_1495&lt;br /&gt;
       14102 full/path/mumble_fratz_foo_bar_1494&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 ...&lt;br /&gt;
 next_unlinked = null&lt;br /&gt;
 u.bmbt.level = 1&lt;br /&gt;
 u.bmbt.numrecs = 1&lt;br /&gt;
 u.bmbt.keys[1] = [startoff] 1:[0]&lt;br /&gt;
 u.bmbt.ptrs[1] = 1:3628&lt;br /&gt;
 xfs_db&amp;amp;gt; fsblock 3628&lt;br /&gt;
 xfs_db&amp;amp;gt; type bmapbtd&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 magic = 0x424d4150&lt;br /&gt;
 level = 0&lt;br /&gt;
 numrecs = 19&lt;br /&gt;
 leftsib = null&lt;br /&gt;
 rightsib = null&lt;br /&gt;
 recs[1-19] = [startoff,startblock,blockcount,extentflag]&lt;br /&gt;
        1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]&lt;br /&gt;
        5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]&lt;br /&gt;
        9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]&lt;br /&gt;
        12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]&lt;br /&gt;
        15:[33554436,3488,8,0] 16:[33554444,3629,4,0]&lt;br /&gt;
        17:[33554448,3748,4,0] 18:[33554452,3900,4,0]&lt;br /&gt;
        19:[67108864,3364,4,0]&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the &#039;&#039;&#039;xfs_db&#039;&#039;&#039; dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; dblock 20&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 dhdr.magic = 0x58443244&lt;br /&gt;
 dhdr.bestfree[0].offset = 0&lt;br /&gt;
 dhdr.bestfree[0].length = 0&lt;br /&gt;
 dhdr.bestfree[1].offset = 0&lt;br /&gt;
 dhdr.bestfree[1].length = 0&lt;br /&gt;
 dhdr.bestfree[2].offset = 0&lt;br /&gt;
 dhdr.bestfree[2].length = 0&lt;br /&gt;
 du[0].inumber = 13937&lt;br /&gt;
 du[0].namelen = 25&lt;br /&gt;
 du[0].name = &amp;quot;mumble_fratz_foo_bar_1595&amp;quot;&lt;br /&gt;
 du[0].tag = 0x10&lt;br /&gt;
 du[1].inumber = 13938&lt;br /&gt;
 du[1].namelen = 25&lt;br /&gt;
 du[1].name = &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;&lt;br /&gt;
 du[1].tag = 0x38&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
So, here we can see that inode number 13938 matches up with name &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;. Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at &amp;quot;lost+found&amp;quot; (once &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; has removed the corrupt directory).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why does my &amp;gt; 2TB XFS partition disappear when I reboot ? ==&lt;br /&gt;
&lt;br /&gt;
Strictly speaking this is not an XFS problem.&lt;br /&gt;
&lt;br /&gt;
To support &amp;gt; 2TB partitions you need two things: a kernel that supports large block devices (&amp;lt;tt&amp;gt;CONFIG_LBD=y&amp;lt;/tt&amp;gt;) and a partition table format that can hold large partitions.  The default DOS partition tables don&#039;t.  The best partition format for&lt;br /&gt;
&amp;gt; 2TB partitions is the EFI GPT format (&amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Without CONFIG_LBD=y you can&#039;t even create the filesystem, but without &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt; it works fine until you reboot at which point the partition will disappear.  Note that you need to enable the &amp;lt;tt&amp;gt;CONFIG_PARTITION_ADVANCED&amp;lt;/tt&amp;gt; option before you can set &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I receive &amp;lt;tt&amp;gt;No space left on device&amp;lt;/tt&amp;gt; after &amp;lt;tt&amp;gt;xfs_growfs&amp;lt;/tt&amp;gt;? ==&lt;br /&gt;
&lt;br /&gt;
After [http://oss.sgi.com/archives/xfs/2009-01/msg01023.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. This was an issue with the older &amp;quot;inode32&amp;quot; inode allocation mode, where inode allocation is restricted to lower filesysetm blocks.  To fix this, [http://oss.sgi.com/archives/xfs/2009-01/msg01031.html Dave Chinner advised]:&lt;br /&gt;
&lt;br /&gt;
  The only way to fix this is to move data around to free up space&lt;br /&gt;
  below 1TB. Find your oldest data (i.e. that was around before even&lt;br /&gt;
  the first grow) and move it off the filesystem (move, not copy).&lt;br /&gt;
  Then if you copy it back on, the data blocks will end up above 1TB&lt;br /&gt;
  and that should leave you with plenty of space for inodes below 1TB.&lt;br /&gt;
  &lt;br /&gt;
  A complete dump and restore will also fix the problem ;)&lt;br /&gt;
&lt;br /&gt;
Alternately, you can add &#039;inode64&#039; to your mount options to allow inodes to live above 1TB.&lt;br /&gt;
&lt;br /&gt;
example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&amp;amp;forum=38 No space left on device on xfs filesystem with 7.7TB free]&lt;br /&gt;
&lt;br /&gt;
However, &#039;inode64&#039; has been the default behavior since kernel v3.7...&lt;br /&gt;
&lt;br /&gt;
Unfortunately, v3.7 also added a bug present from kernel v3.7 to v3.17 which caused new allocation groups added by growfs to be unavailable for inode allocation.  This was fixed by commit &amp;lt;tt&amp;gt;[http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9de67c3ba9ea961ba420573d56479d09d33a7587 9de67c3b xfs: allow inode allocations in post-growfs disk space.]&amp;lt;/tt&amp;gt; in kernel v3.17.&lt;br /&gt;
Without that commit, the problem can be worked around by doing a &amp;quot;mount -o remount,inode64&amp;quot; after the growfs operation.&lt;br /&gt;
&lt;br /&gt;
== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==&lt;br /&gt;
&lt;br /&gt;
The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons. &lt;br /&gt;
&lt;br /&gt;
Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.&lt;br /&gt;
&lt;br /&gt;
== Q: How to get around a bad inode repair is unable to clean up ==&lt;br /&gt;
&lt;br /&gt;
The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.&lt;br /&gt;
&lt;br /&gt;
  xfs_db -x -c &#039;inode XXX&#039; -c &#039;write core.nextents 0&#039; -c &#039;write core.size 0&#039; /dev/hdXX&lt;br /&gt;
&lt;br /&gt;
== Q: How to calculate the correct sunit,swidth values for optimal performance ==&lt;br /&gt;
&lt;br /&gt;
XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.&lt;br /&gt;
&lt;br /&gt;
These options can be sometimes autodetected (for example with md raid and recent enough kernel (&amp;gt;= 2.6.32) and xfsprogs (&amp;gt;= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.&lt;br /&gt;
&lt;br /&gt;
The calculation of these values is quite simple:&lt;br /&gt;
&lt;br /&gt;
  su = &amp;lt;RAID controllers stripe size in BYTES (or KiBytes when used with k)&amp;gt;&lt;br /&gt;
  sw = &amp;lt;# of data disks (don&#039;t count parity disks)&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use&lt;br /&gt;
&lt;br /&gt;
  su = 64k&lt;br /&gt;
  sw = 6 (RAID-6 of 8 disks has 6 data disks)&lt;br /&gt;
&lt;br /&gt;
A RAID stripe size of 256KB with a RAID-10 over 16 disks should use&lt;br /&gt;
&lt;br /&gt;
  su = 256k&lt;br /&gt;
  sw = 8 (RAID-10 of 16 disks has 8 data disks)&lt;br /&gt;
&lt;br /&gt;
Alternatively, you can use &amp;quot;sunit&amp;quot; instead of &amp;quot;su&amp;quot; and &amp;quot;swidth&amp;quot; instead of &amp;quot;sw&amp;quot; but then sunit/swidth values need to be specified in &amp;quot;number of 512B sectors&amp;quot;!&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; interpret sunit and swidth as being specified in units of 512B sectors; that&#039;s unfortunately not the unit they&#039;re reported in, however.&lt;br /&gt;
&amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; report them in multiples of your basic block size (bsize) and not in 512B sectors.&lt;br /&gt;
&lt;br /&gt;
Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.&lt;br /&gt;
&lt;br /&gt;
When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.&lt;br /&gt;
&lt;br /&gt;
== Q: Why doesn&#039;t NFS-exporting subdirectories of inode64-mounted filesystem work? ==&lt;br /&gt;
&lt;br /&gt;
The default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; type encodes only 32-bit of the inode number for subdirectory exports.  However, exporting the root of the filesystem works, or using one of the non-default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; types (&amp;lt;tt&amp;gt;fsid=uuid&amp;lt;/tt&amp;gt; in &amp;lt;tt&amp;gt;/etc/exports&amp;lt;/tt&amp;gt; with recent &amp;lt;tt&amp;gt;nfs-utils&amp;lt;/tt&amp;gt;) should work as well. (Thanks, Christoph!)&lt;br /&gt;
&lt;br /&gt;
== Q: What is the inode64 mount option for? ==&lt;br /&gt;
&lt;br /&gt;
By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like &amp;quot;disk full&amp;quot; when you still have plenty space free, but there&#039;s no more place in the first TB to create a new inode. Also, performance sucks.&lt;br /&gt;
&lt;br /&gt;
To come around this, use the inode64 mount options for filesystems &amp;gt;1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.&lt;br /&gt;
&lt;br /&gt;
Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.&lt;br /&gt;
&lt;br /&gt;
== Q: Can I just try the inode64 option to see if it helps me? ==&lt;br /&gt;
&lt;br /&gt;
Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can&#039;t access files &amp;amp; dirs that have been created with an inode &amp;gt;32bit anymore.&lt;br /&gt;
&lt;br /&gt;
== Q: Performance: mkfs.xfs -n size=64k option ==&lt;br /&gt;
&lt;br /&gt;
Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:&lt;br /&gt;
&lt;br /&gt;
Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a&lt;br /&gt;
directory entry is determined by the length of the name.&lt;br /&gt;
&lt;br /&gt;
There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there&#039;s the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.&lt;br /&gt;
&lt;br /&gt;
For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.&lt;br /&gt;
&lt;br /&gt;
In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don&#039;t have any numbers on what the difference might be - I&#039;m getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....&lt;br /&gt;
&lt;br /&gt;
== Q: I want to tune my XFS filesystems for &amp;lt;something&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Premature optimization is the root of all evil.&#039;&#039; - Donald Knuth&lt;br /&gt;
&lt;br /&gt;
The standard answer you will get to this question is this: use the defaults.&lt;br /&gt;
&lt;br /&gt;
There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to  configure the filesystem appropriately.&lt;br /&gt;
&lt;br /&gt;
There are a lot of &amp;quot;XFS tuning guides&amp;quot; that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don&#039;t expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.&lt;br /&gt;
&lt;br /&gt;
In most cases, the only thing you need to to consider for &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; mount options. Increasing &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; reduces the number of journal IOs for a given workload, and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; will reduce them even further. The trade off for this increase in metadata performance is that more operations may be &amp;quot;missing&amp;quot; after recovery if the system crashes while actively making modifications.&lt;br /&gt;
&lt;br /&gt;
As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.&lt;br /&gt;
&lt;br /&gt;
== Q: Which factors influence the memory usage of xfs_repair? ==&lt;br /&gt;
&lt;br /&gt;
This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -n -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2096.&lt;br /&gt;
  #&lt;br /&gt;
&lt;br /&gt;
xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,&lt;br /&gt;
of which 2,097,152KB is needed for tracking free space. &lt;br /&gt;
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)&lt;br /&gt;
&lt;br /&gt;
Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2289.&lt;br /&gt;
&lt;br /&gt;
That is now needs at least another 200MB of RAM to run.&lt;br /&gt;
&lt;br /&gt;
The numbers reported by xfs_repair are the absolute minimum required and approximate at that;&lt;br /&gt;
more RAM than this may be required to complete successfully.&lt;br /&gt;
Also, if you only give xfs_repair the minimum required RAM, it will be slow;&lt;br /&gt;
for best repair performance, the more RAM you can give it the better.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why some files of my filesystem shows as &amp;quot;?????????? ? ?      ?          ?                ? filename&amp;quot; ? ==&lt;br /&gt;
&lt;br /&gt;
If ls -l shows you a listing as&lt;br /&gt;
&lt;br /&gt;
  # ?????????? ? ?      ?          ?                ? file1&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file2&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file3&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file4&lt;br /&gt;
&lt;br /&gt;
and errors like:&lt;br /&gt;
  # ls /pathtodir/&lt;br /&gt;
    ls: cannot access /pathtodir/file1: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file2: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file3: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file4: Invalid argument&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
or even:&lt;br /&gt;
  # failed to stat /pathtodir/file1&lt;br /&gt;
&lt;br /&gt;
It is very probable your filesystem must be mounted with inode64&lt;br /&gt;
  # mount -oremount,inode64 /dev/diskpart /mnt/xfs&lt;br /&gt;
&lt;br /&gt;
should make it work ok again.&lt;br /&gt;
If it works, add the option to fstab.&lt;br /&gt;
&lt;br /&gt;
== Q: The xfs_db &amp;quot;frag&amp;quot; command says I&#039;m over 50%.  Is that bad? ==&lt;br /&gt;
&lt;br /&gt;
It depends.  It&#039;s important to know how the value is calculated.  xfs_db looks at the extents in all files, and returns:&lt;br /&gt;
&lt;br /&gt;
  (actual extents - ideal extents) / actual extents&lt;br /&gt;
&lt;br /&gt;
This means that if, for example, you have an average of 2 extents per file, you&#039;ll get an answer of 50%.  4 extents per file would give you 75%.  This may or may not be a problem, especially depending on the size of the files in question.  (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented).  The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.&lt;br /&gt;
&lt;br /&gt;
Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:&lt;br /&gt;
[[Image:Frag_factor.png|500px]]&lt;br /&gt;
&lt;br /&gt;
== Q: I&#039;m getting &amp;quot;Internal error xfs_sb_read_verify&amp;quot; errors when I try to run xfs_growfs under kernels v3.10 through v3.12 ==&lt;br /&gt;
&lt;br /&gt;
This may happen when running xfs_growfs under a v3.10-v3.12 kernel,&lt;br /&gt;
if the filesystem was previously grown under a kernel prior to v3.8.&lt;br /&gt;
&lt;br /&gt;
Old kernel versions prior to v3.8 did not zero the empty part of&lt;br /&gt;
new secondary superblocks when growing the filesystem with xfs_growfs.&lt;br /&gt;
&lt;br /&gt;
Kernels v3.10 and later began detecting this non-zero part of the&lt;br /&gt;
superblock as corruption, and emit the &lt;br /&gt;
&lt;br /&gt;
    Internal error xfs_sb_read_verify&lt;br /&gt;
&lt;br /&gt;
error message.&lt;br /&gt;
&lt;br /&gt;
Kernels v3.13 and later are more forgiving about this - if the non-zero &lt;br /&gt;
data is found on a Version 4 superblock, it will not be flagged as&lt;br /&gt;
corruption.&lt;br /&gt;
&lt;br /&gt;
The problematic secondary superblocks may be repaired by using an xfs_repair&lt;br /&gt;
version 3.2.0-alpha1 or above.&lt;br /&gt;
&lt;br /&gt;
The relevant kernelspace commits are as follows:&lt;br /&gt;
&lt;br /&gt;
    v3.8  1375cb6 xfs: growfs: don&#039;t read garbage for new secondary superblocks &amp;lt;- fixed underlying problem &lt;br /&gt;
    v3.10 04a1e6c xfs: add CRC checks to the superblock &amp;lt;- detected old underlying problem&lt;br /&gt;
    v3.13 10e6e65 xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields &amp;lt;- is more forgiving of old underlying problem&lt;br /&gt;
&lt;br /&gt;
This commit allows xfs_repair to detect and correct the problem:&lt;br /&gt;
&lt;br /&gt;
    v3.2.0-alpha1 cbd7508 xfs_repair: zero out unused parts of superblocks&lt;br /&gt;
&lt;br /&gt;
== Q: Why do files on XFS use more data blocks than expected? ==&lt;br /&gt;
&lt;br /&gt;
The XFS speculative preallocation algorithm allocates extra blocks beyond end of file (EOF) to minimize file fragmentation during buffered write workloads. Workloads that benefit from this behaviour include slowly growing files, concurrent writers and mixed reader/writer workloads. It also provides fragmentation resistance in situations where memory pressure prevents adequate buffering of dirty data to allow formation of large contiguous regions of data in memory.&lt;br /&gt;
&lt;br /&gt;
This post-EOF block allocation is accounted identically to blocks within EOF. It is visible in &#039;st_blocks&#039; counts via stat() system calls, accounted as globally allocated space and against quotas that apply to the associated file. The space is reported by various userspace utilities (stat, du, df, ls) and thus provides a common source of confusion for administrators. Post-EOF blocks are temporary in most situations and are usually reclaimed via several possible mechanisms in XFS.&lt;br /&gt;
&lt;br /&gt;
See the FAQ entry on speculative preallocation for details.&lt;br /&gt;
&lt;br /&gt;
== Q: What is speculative preallocation? ==&lt;br /&gt;
&lt;br /&gt;
XFS speculatively preallocates post-EOF blocks on file extending writes in anticipation of future extending writes. The size of a preallocation is dynamic and depends on the runtime state of the file and fs. Generally speaking, preallocation is disabled for very small files and preallocation sizes grow as files grow larger.&lt;br /&gt;
&lt;br /&gt;
Preallocations are capped to the maximum extent size supported by the filesystem. Preallocation size is throttled automatically as the filesystem approaches low free space conditions or other allocation limits on a file (such as a quota).&lt;br /&gt;
&lt;br /&gt;
In most cases, speculative preallocation is automatically reclaimed when a file is closed. Applications that repeatedly trigger preallocation and reclaim cycles (e.g., this is common in file server or log file workloads) can cause fragmentation. Therefore, this pattern is detected and causes the preallocation to persist beyond the lifecycle of the file descriptor.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I speed up or avoid delayed removal of speculative preallocation?  ==&lt;br /&gt;
&lt;br /&gt;
Linux 3.8 (and later) includes a scanner to perform background trimming of files with lingering post-EOF preallocations. The scanner bypasses dirty files to avoid interference with ongoing writes. A 5 minute scan interval is used by default and can be adjusted via the following file (value in seconds):&lt;br /&gt;
&lt;br /&gt;
        /proc/sys/fs/xfs/speculative_prealloc_lifetime&lt;br /&gt;
&lt;br /&gt;
== Q: Is speculative preallocation permanent? ==&lt;br /&gt;
&lt;br /&gt;
Preallocated blocks are normally reclaimed on file close, inode reclaim, unmount or in the background once file write activity subsides. They can be explicitly made permanent via fallocate or a similar interface. They can be implicitly made permanent in situations where file size is extended beyond a range of post-EOF blocks (i.e., via an extending truncate) or following a crash. In the event of a crash, the in-memory state used to track and reclaim the speculative preallocation is lost.&lt;br /&gt;
&lt;br /&gt;
== Q: My workload has known characteristics - can I disable speculative preallocation or tune it to an optimal fixed size? ==&lt;br /&gt;
&lt;br /&gt;
Speculative preallocation can not be disabled but XFS can be tuned to a fixed allocation size with the &#039;allocsize=&#039; mount option. Speculative preallocation is not dynamically resized when the allocsize mount option is set and thus the potential for fragmentation is increased. Use &#039;allocsize=64k&#039; to revert to the default XFS behavior prior to support for dynamic speculative preallocation.&lt;br /&gt;
&lt;br /&gt;
== Q: mount (or umount) takes minutes or even hours - what could be the reason ? ==&lt;br /&gt;
&lt;br /&gt;
In some cases xfs log (journal) can become quite big. For example if it accumulates many entries and doesn&#039;t get chance to apply these to disk (due to lockup, crash, hard reset etc). xfs will try to reapply these at mount (in dmesg: &amp;quot;Starting recovery (logdev: internal)&amp;quot;).&lt;br /&gt;
&lt;br /&gt;
That process with big log to be reapplied can take very long time (minutes or even hours). Similar problem can happen with unmount taking hours when there are hundreds of thousands of dirty inode in memory that need to be flushed to disk.&lt;br /&gt;
&lt;br /&gt;
(http://oss.sgi.com/pipermail/xfs/2015-October/044457.html)&lt;br /&gt;
&lt;br /&gt;
== Q: Which I/O scheduler for XFS? ==&lt;br /&gt;
&lt;br /&gt;
=== On rotational disks without hardware raid ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;CFQ&#039;&#039;: not great for XFS parallelism:&lt;br /&gt;
&lt;br /&gt;
  &amp;lt; dchinner&amp;gt; it doesn&#039;t allow other threads to get IO issued immediately after the first one&lt;br /&gt;
  &amp;lt; dchinner&amp;gt; it waits, instead, for a timeslice to expire before moving to the IO of a different process.&lt;br /&gt;
  &amp;lt; dchinner&amp;gt; so instead of interleaving the IO of multiple jobs in a single sweep across the disk,&lt;br /&gt;
              it enforces single threaded access to the disk&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;deadline&#039;&#039;: good option, doesn&#039;t have such problem&lt;br /&gt;
&lt;br /&gt;
Note that some kernels have block multiqueue enabled which (currently - 08/2016) doesn&#039;t support I/O schedulers at all thus there is no optimisation and reordering IO for best seek order, so disable blk-mq for rotational disks (see CONFIG_SCSI_MQ_DEFAULT, CONFIG_DM_MQ_DEFAULT options and use_blk_mq parameter for scsi-mod/dm-mod kernel modules).&lt;br /&gt;
&lt;br /&gt;
Also hardware raid can be smart enough to cache and reorder I/O requests thus additional layer of reordering&lt;br /&gt;
(like Linux I/O scheduler) can potentially conflict and make performance worse. If you have such raid card&lt;br /&gt;
then try method described below.&lt;br /&gt;
&lt;br /&gt;
=== SSD disks or rotational disks but with hardware raid card that has cache enabled ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Block multiqueue&#039;&#039; enabled (and thus no I/O scheduler at all) or block multiqueue disabled and &#039;&#039;noop&#039;&#039; or &#039;&#039;deadline&#039;&#039; I/O scheduler activated is good solution. SSD disks don&#039;t really need I/O schedulers while smart raid cards do I/O ordering on their own.&lt;br /&gt;
&lt;br /&gt;
Note that if your raid is very dumb and/or has no cache enabled then it likely cannot reorder I/O requests and thus it could benefit from I/O scheduler.&lt;br /&gt;
&lt;br /&gt;
== Q: Why does userspace say &amp;quot;filesystem uses v1 dirs, limited functionality provided?&amp;quot; ==&lt;br /&gt;
&lt;br /&gt;
Either you have a very old or a rather new filesystem coupled with too-old userspace.&lt;br /&gt;
&lt;br /&gt;
Very old filesystems used a format called &amp;quot;directory version 1&amp;quot; and in this case the error is correct and hopefully self explanatory.&lt;br /&gt;
&lt;br /&gt;
However, if you have a newer filesystem with version 5 superblocks and the metadata CRC feature enabled, older releases of xfsprogs may incorrectly issue the &amp;quot;v1 dir&amp;quot; message.  In this case, get newer xfsprogs; at least v3.2.0, but preferably the latest release.&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=Getting_the_latest_source_code&amp;diff=3006</id>
		<title>Getting the latest source code</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=Getting_the_latest_source_code&amp;diff=3006"/>
		<updated>2018-07-30T22:20:51Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: /* Current XFS kernel source */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== &amp;lt;font face=&amp;quot;ARIAL NARROW,HELVETICA&amp;quot;&amp;gt; XFS Released/Stable source &amp;lt;/font&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
Note: as of September 2016, the XFS project is moving away from oss.sgi.com infrastructure. As we move to other infrastructure the links below will be updated to point to the new locations.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mainline kernels&#039;&#039;&#039;&lt;br /&gt;
:XFS has been maintained in the official Linux kernel [http://www.kernel.org/ kernel trees] starting with [http://lkml.org/lkml/2003/12/8/35 Linux 2.4] and is frequently updated with the latest stable fixes and features from the XFS development team.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Vendor kernels&#039;&#039;&#039;&lt;br /&gt;
:All modern Linux distributions include support for XFS. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;XFS userspace&#039;&#039;&#039;&lt;br /&gt;
:[https://kernel.org/pub/linux/utils/fs/xfs source code tarballs] of the xfs userspace tools. These tarballs form the basis of the xfsprogs packages found in Linux distributions.&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;font face=&amp;quot;ARIAL NARROW,HELVETICA&amp;quot;&amp;gt; Development and bleeding edge Development &amp;lt;/font&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
* [[XFS git howto]]&lt;br /&gt;
&lt;br /&gt;
=== Current XFS kernel source ===&lt;br /&gt;
&lt;br /&gt;
* [https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git xfs]&lt;br /&gt;
&lt;br /&gt;
 $ git clone git.kernel.org/pub/scm/fs/xfs/xfs-linux.git&lt;br /&gt;
&lt;br /&gt;
=== XFS user space tools ===&lt;br /&gt;
* [https://git.kernel.org/cgit/fs/xfs/xfsprogs-dev.git/ xfsprogs ]&lt;br /&gt;
&lt;br /&gt;
 git clone git://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git&lt;br /&gt;
&lt;br /&gt;
A few packages are needed to compile &amp;lt;tt&amp;gt;xfsprogs&amp;lt;/tt&amp;gt;, depending on your package manager:&lt;br /&gt;
&lt;br /&gt;
 apt-get install libtool automake gettext libblkid-dev uuid-dev pkg-config&lt;br /&gt;
 yum     install libtool automake gettext libblkid-devel libuuid-devel&lt;br /&gt;
&lt;br /&gt;
=== XFS dump ===&lt;br /&gt;
* [https://git.kernel.org/cgit/fs/xfs/xfsdump-dev.git/ xfsdump ]&lt;br /&gt;
&lt;br /&gt;
 git clone git://git.kernel.org/pub/scm/fs/xfs/xfsdump-dev.git&lt;br /&gt;
&lt;br /&gt;
=== XFS tests ===&lt;br /&gt;
* [https://git.kernel.org/cgit/fs/xfs/xfstests-dev.git/ xfstests ]&lt;br /&gt;
&lt;br /&gt;
 git clone git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git&lt;br /&gt;
&lt;br /&gt;
=== DMAPI user space tools ===&lt;br /&gt;
* [https://git.kernel.org/cgit/fs/xfs/dmapi-dev.git/ dmapi ]&lt;br /&gt;
&lt;br /&gt;
 git clone git://git.kernel.org/pub/scm/fs/xfs/dmapi-dev.git&lt;br /&gt;
&lt;br /&gt;
=== git-cvsimport generated trees ===&lt;br /&gt;
&lt;br /&gt;
The Git trees are automated mirrored copies of the CVS trees using [http://www.kernel.org/pub/software/scm/git/docs/git-cvsimport.html git-cvsimport].&lt;br /&gt;
Since git-cvsimport utilized the tool [http://www.cobite.com/cvsps/ cvsps] to recreate the atomic commits of ptools or &amp;quot;mod&amp;quot; it is easier to see the entire change that was committed using git.&lt;br /&gt;
&lt;br /&gt;
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=summary linux-2.6-xfs-from-cvs]&lt;br /&gt;
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-cmds.git;a=summary xfs-cmds]&lt;br /&gt;
&lt;br /&gt;
Before building in the &amp;lt;tt&amp;gt;xfsdump&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;dmapi&amp;lt;/tt&amp;gt; directories (after building &amp;lt;tt&amp;gt;xfsprogs&amp;lt;/tt&amp;gt;), you will need to run:&lt;br /&gt;
  # cd xfsprogs&lt;br /&gt;
  # make install-dev&lt;br /&gt;
to create &amp;lt;tt&amp;gt;/usr/include/xfs&amp;lt;/tt&amp;gt; and install appropriate files there.&lt;br /&gt;
&lt;br /&gt;
Before building in the xfstests directory, you will need to run:&lt;br /&gt;
  # cd xfsprogs&lt;br /&gt;
  # make install-qa&lt;br /&gt;
to install a somewhat larger set of files in &amp;lt;tt&amp;gt;/usr/include/xfs&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;font face=&amp;quot;ARIAL NARROW,HELVETICA&amp;quot;&amp;gt;XFS cvs trees &amp;lt;/font&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
The cvs trees were created using a script that converted sgi&#039;s internal&lt;br /&gt;
ptools repository to a cvs repository, so the cvs trees were considered read only.&lt;br /&gt;
&lt;br /&gt;
At this point all new development is being managed by the git trees thus the cvs trees&lt;br /&gt;
are no longer active in terms of current development and should only be used&lt;br /&gt;
for reference.&lt;br /&gt;
&lt;br /&gt;
* [[XFS CVS howto]]&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=Getting_the_latest_source_code&amp;diff=3005</id>
		<title>Getting the latest source code</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=Getting_the_latest_source_code&amp;diff=3005"/>
		<updated>2018-07-24T06:32:03Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: /* XFS user space tools */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== &amp;lt;font face=&amp;quot;ARIAL NARROW,HELVETICA&amp;quot;&amp;gt; XFS Released/Stable source &amp;lt;/font&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
Note: as of September 2016, the XFS project is moving away from oss.sgi.com infrastructure. As we move to other infrastructure the links below will be updated to point to the new locations.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Mainline kernels&#039;&#039;&#039;&lt;br /&gt;
:XFS has been maintained in the official Linux kernel [http://www.kernel.org/ kernel trees] starting with [http://lkml.org/lkml/2003/12/8/35 Linux 2.4] and is frequently updated with the latest stable fixes and features from the XFS development team.&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;Vendor kernels&#039;&#039;&#039;&lt;br /&gt;
:All modern Linux distributions include support for XFS. &lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;&#039;XFS userspace&#039;&#039;&#039;&lt;br /&gt;
:[https://kernel.org/pub/linux/utils/fs/xfs source code tarballs] of the xfs userspace tools. These tarballs form the basis of the xfsprogs packages found in Linux distributions.&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;font face=&amp;quot;ARIAL NARROW,HELVETICA&amp;quot;&amp;gt; Development and bleeding edge Development &amp;lt;/font&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
* [[XFS git howto]]&lt;br /&gt;
&lt;br /&gt;
=== Current XFS kernel source ===&lt;br /&gt;
&lt;br /&gt;
* [https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/ xfs]&lt;br /&gt;
&lt;br /&gt;
 $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git &lt;br /&gt;
&lt;br /&gt;
Note: the old kernel tree on [http://oss.sgi.com/cgi-bin/gitweb.cgi oss.sgi.com] is no longer kept up to date with the master tree on kernel.org.&lt;br /&gt;
&lt;br /&gt;
=== XFS user space tools ===&lt;br /&gt;
* [https://git.kernel.org/cgit/fs/xfs/xfsprogs-dev.git/ xfsprogs ]&lt;br /&gt;
&lt;br /&gt;
 git clone git://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git&lt;br /&gt;
&lt;br /&gt;
A few packages are needed to compile &amp;lt;tt&amp;gt;xfsprogs&amp;lt;/tt&amp;gt;, depending on your package manager:&lt;br /&gt;
&lt;br /&gt;
 apt-get install libtool automake gettext libblkid-dev uuid-dev pkg-config&lt;br /&gt;
 yum     install libtool automake gettext libblkid-devel libuuid-devel&lt;br /&gt;
&lt;br /&gt;
=== XFS dump ===&lt;br /&gt;
* [https://git.kernel.org/cgit/fs/xfs/xfsdump-dev.git/ xfsdump ]&lt;br /&gt;
&lt;br /&gt;
 git clone git://git.kernel.org/pub/scm/fs/xfs/xfsdump-dev.git&lt;br /&gt;
&lt;br /&gt;
=== XFS tests ===&lt;br /&gt;
* [https://git.kernel.org/cgit/fs/xfs/xfstests-dev.git/ xfstests ]&lt;br /&gt;
&lt;br /&gt;
 git clone git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git&lt;br /&gt;
&lt;br /&gt;
=== DMAPI user space tools ===&lt;br /&gt;
* [https://git.kernel.org/cgit/fs/xfs/dmapi-dev.git/ dmapi ]&lt;br /&gt;
&lt;br /&gt;
 git clone git://git.kernel.org/pub/scm/fs/xfs/dmapi-dev.git&lt;br /&gt;
&lt;br /&gt;
=== git-cvsimport generated trees ===&lt;br /&gt;
&lt;br /&gt;
The Git trees are automated mirrored copies of the CVS trees using [http://www.kernel.org/pub/software/scm/git/docs/git-cvsimport.html git-cvsimport].&lt;br /&gt;
Since git-cvsimport utilized the tool [http://www.cobite.com/cvsps/ cvsps] to recreate the atomic commits of ptools or &amp;quot;mod&amp;quot; it is easier to see the entire change that was committed using git.&lt;br /&gt;
&lt;br /&gt;
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=summary linux-2.6-xfs-from-cvs]&lt;br /&gt;
* [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-cmds.git;a=summary xfs-cmds]&lt;br /&gt;
&lt;br /&gt;
Before building in the &amp;lt;tt&amp;gt;xfsdump&amp;lt;/tt&amp;gt; or &amp;lt;tt&amp;gt;dmapi&amp;lt;/tt&amp;gt; directories (after building &amp;lt;tt&amp;gt;xfsprogs&amp;lt;/tt&amp;gt;), you will need to run:&lt;br /&gt;
  # cd xfsprogs&lt;br /&gt;
  # make install-dev&lt;br /&gt;
to create &amp;lt;tt&amp;gt;/usr/include/xfs&amp;lt;/tt&amp;gt; and install appropriate files there.&lt;br /&gt;
&lt;br /&gt;
Before building in the xfstests directory, you will need to run:&lt;br /&gt;
  # cd xfsprogs&lt;br /&gt;
  # make install-qa&lt;br /&gt;
to install a somewhat larger set of files in &amp;lt;tt&amp;gt;/usr/include/xfs&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== &amp;lt;font face=&amp;quot;ARIAL NARROW,HELVETICA&amp;quot;&amp;gt;XFS cvs trees &amp;lt;/font&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
The cvs trees were created using a script that converted sgi&#039;s internal&lt;br /&gt;
ptools repository to a cvs repository, so the cvs trees were considered read only.&lt;br /&gt;
&lt;br /&gt;
At this point all new development is being managed by the git trees thus the cvs trees&lt;br /&gt;
are no longer active in terms of current development and should only be used&lt;br /&gt;
for reference.&lt;br /&gt;
&lt;br /&gt;
* [[XFS CVS howto]]&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_FAQ&amp;diff=3004</id>
		<title>XFS FAQ</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_FAQ&amp;diff=3004"/>
		<updated>2016-12-05T21:04:01Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: /* Q: Why does userspace say &amp;quot;filesystem uses v1 dirs, limited functionality provided?&amp;quot; */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about XFS? ==&lt;br /&gt;
&lt;br /&gt;
The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.&lt;br /&gt;
&lt;br /&gt;
You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the &#039;&#039;&#039;&amp;lt;nowiki&amp;gt;#xfs&amp;lt;/nowiki&amp;gt;&#039;&#039;&#039; IRC channel on &#039;&#039;irc.freenode.net&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about ACLs? ==&lt;br /&gt;
&lt;br /&gt;
Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;acl(5)&#039;&#039;&#039; manual page is also quite extensive.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find information about the internals of XFS? ==&lt;br /&gt;
&lt;br /&gt;
An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.&lt;br /&gt;
&lt;br /&gt;
Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.&lt;br /&gt;
&lt;br /&gt;
== Q: What partition type should I use for XFS on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Linux native filesystem (83).&lt;br /&gt;
&lt;br /&gt;
== Q: What mount options does XFS have? ==&lt;br /&gt;
&lt;br /&gt;
There are a number of mount options influencing XFS filesystems - refer to the &#039;&#039;&#039;mount(8)&#039;&#039;&#039; manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])&lt;br /&gt;
&lt;br /&gt;
== Q: Is there any relation between the XFS utilities and the kernel version? ==&lt;br /&gt;
&lt;br /&gt;
No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Does it run on platforms other than i386? ==&lt;br /&gt;
&lt;br /&gt;
XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Do quotas work on XFS? ==&lt;br /&gt;
&lt;br /&gt;
Yes.&lt;br /&gt;
&lt;br /&gt;
To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/  http://sourceforge.net/projects/linuxquota/] or use &#039;&#039;&#039;xfs_quota(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: What&#039;s project quota? ==&lt;br /&gt;
&lt;br /&gt;
The  project  quota  is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Can group quota and project quota be used at the same time? ==&lt;br /&gt;
&lt;br /&gt;
No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==&lt;br /&gt;
&lt;br /&gt;
To be answered.&lt;br /&gt;
&lt;br /&gt;
== Q: Are there any dump/restore tools for XFS? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and &#039;&#039;&#039;xfsrestore(8)&#039;&#039;&#039; are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.&lt;br /&gt;
&lt;br /&gt;
== Q: Does LILO work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
This depends on where you install LILO.&lt;br /&gt;
&lt;br /&gt;
Yes, for MBR (Master Boot Record) installations.&lt;br /&gt;
&lt;br /&gt;
No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.&lt;br /&gt;
&lt;br /&gt;
== Q: Does GRUB work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.&lt;br /&gt;
&lt;br /&gt;
== Q: Can XFS be used for a root filesystem? ==&lt;br /&gt;
&lt;br /&gt;
Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the &amp;quot;rootflags=&amp;quot; kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit &amp;quot;logdev=&amp;quot; specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]&lt;br /&gt;
&lt;br /&gt;
== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be &amp;quot;clean&amp;quot; when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.&lt;br /&gt;
&lt;br /&gt;
== Q: Is there a way to make a XFS filesystem larger or smaller? ==&lt;br /&gt;
&lt;br /&gt;
You can &#039;&#039;NOT&#039;&#039; make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.&lt;br /&gt;
&lt;br /&gt;
An XFS filesystem may be enlarged by using &#039;&#039;&#039;xfs_growfs(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the &#039;&#039;exact same&#039;&#039; starting point. Run &#039;&#039;&#039;xfs_growfs&#039;&#039;&#039; to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.&lt;br /&gt;
&lt;br /&gt;
Using XFS filesystems on top of a volume manager makes this a lot easier.&lt;br /&gt;
&lt;br /&gt;
== Q: What information should I include when reporting a problem? ==&lt;br /&gt;
&lt;br /&gt;
What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:&lt;br /&gt;
&lt;br /&gt;
* kernel version (uname -a)&lt;br /&gt;
* xfsprogs version (xfs_repair -V)&lt;br /&gt;
* number of CPUs&lt;br /&gt;
* contents of /proc/meminfo&lt;br /&gt;
* contents of /proc/mounts&lt;br /&gt;
* contents of /proc/partitions&lt;br /&gt;
* RAID layout (hardware and/or software)&lt;br /&gt;
* LVM configuration&lt;br /&gt;
* type of disks you are using&lt;br /&gt;
* write cache status of drives&lt;br /&gt;
* size of BBWC and mode it is running in&lt;br /&gt;
* xfs_info output on the filesystem in question&lt;br /&gt;
* dmesg output showing all error messages and stack traces&lt;br /&gt;
 &lt;br /&gt;
Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:&lt;br /&gt;
&lt;br /&gt;
# iostat -x -d -m 5&lt;br /&gt;
# vmstat 5&lt;br /&gt;
 &lt;br /&gt;
can give us insight into the IO and memory utilisation of your machine at the time of the problem.&lt;br /&gt;
&lt;br /&gt;
If the filesystem is hanging, then capture the output of the dmesg command after running:&lt;br /&gt;
&lt;br /&gt;
 # echo w &amp;gt; /proc/sysrq-trigger&lt;br /&gt;
 # dmesg&lt;br /&gt;
&lt;br /&gt;
will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.&lt;br /&gt;
&lt;br /&gt;
And for advanced users, capturing an event trace using &#039;&#039;&#039;trace-cmd&#039;&#039;&#039; (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it&#039;s a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd record -e xfs\*&lt;br /&gt;
&lt;br /&gt;
before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd report &amp;gt; trace_report.txt&lt;br /&gt;
&lt;br /&gt;
Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.&lt;br /&gt;
&lt;br /&gt;
If you have a problem with &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039;, make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using &#039;&#039;&#039;xfs_metadump(8)&#039;&#039;&#039; (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.&lt;br /&gt;
&lt;br /&gt;
== Q: Mounting an XFS filesystem does not work - what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
If mount prints an error message something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     mount: /dev/hda5 has wrong major or minor number&lt;br /&gt;
&lt;br /&gt;
you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the &amp;quot;-t xfs&amp;quot; option on mount or the &amp;quot;xfs&amp;quot; option in &amp;lt;tt&amp;gt;/etc/fstab&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
If you get something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 mount: wrong fs type, bad option, bad superblock on /dev/sda1,&lt;br /&gt;
        or too many mounted file systems&lt;br /&gt;
&lt;br /&gt;
Refer to your system log file (&amp;lt;tt&amp;gt;/var/log/messages&amp;lt;/tt&amp;gt;) for a detailed diagnostic message from the kernel.&lt;br /&gt;
&lt;br /&gt;
== Q: Does the filesystem have an undelete capability? ==&lt;br /&gt;
&lt;br /&gt;
There is no undelete in XFS.&lt;br /&gt;
&lt;br /&gt;
However, if an inode is unlinked but neither it nor its associated data blocks get immediately re-used and overwritten, there is some small chance to recover the file from the disk.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;photorec&#039;&#039;, &#039;&#039;xfs_irecover&#039;&#039; or &#039;&#039;xfsr&#039;&#039; are some tools which attempt to do this, with varying success.&lt;br /&gt;
&lt;br /&gt;
There are also commercial data recovery services and closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS] which claims to recover data, although this has not been tested by the XFS developers.&lt;br /&gt;
&lt;br /&gt;
As always, the best advice is to keep good backups.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I backup a XFS filesystem and ACLs? ==&lt;br /&gt;
&lt;br /&gt;
You can backup a XFS filesystem with utilities like &#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and standard &#039;&#039;&#039;tar(1)&#039;&#039;&#039; for standard files. If you want to backup ACLs you will need to use &#039;&#039;&#039;xfsdump&#039;&#039;&#039; or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (&amp;gt; version 3.1.4) or [http://rsync.samba.org/ rsync] (&amp;gt;= version 3.0.0) to backup ACLs and EAs. &#039;&#039;&#039;xfsdump&#039;&#039;&#039; can also be integrated with [http://www.amanda.org/ amanda(8)].&lt;br /&gt;
&lt;br /&gt;
== Q: I see applications returning error 990 or &amp;quot;Structure needs cleaning&amp;quot;, what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], &amp;quot;Structure needs cleaning.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.&lt;br /&gt;
&lt;br /&gt;
There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.&lt;br /&gt;
&lt;br /&gt;
You can use xfs_repair to remedy the problem (with the file system unmounted).&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==&lt;br /&gt;
&lt;br /&gt;
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.&lt;br /&gt;
&lt;br /&gt;
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.&lt;br /&gt;
&lt;br /&gt;
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you&#039;ll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the &#039;&#039;&#039;xfs_bmap(8)&#039;&#039;&#039; command).&lt;br /&gt;
&lt;br /&gt;
== Q: What is the problem with the write cache on journaled filesystems? ==&lt;br /&gt;
&lt;br /&gt;
Many drives use a write back cache in order to speed up the performance of writes.  However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk.  Further, the drive can de-stage data from the write cache to the platters in any order that it chooses.  This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk.  When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.&lt;br /&gt;
&lt;br /&gt;
With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information.  In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.&lt;br /&gt;
&lt;br /&gt;
With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued.  A powerfail &amp;quot;only&amp;quot; loses data in the cache but no essential ordering is violated, and corruption will not occur.&lt;br /&gt;
&lt;br /&gt;
With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance.  But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I tell if I have the disk write cache enabled? ==&lt;br /&gt;
&lt;br /&gt;
For SCSI/SATA:&lt;br /&gt;
&lt;br /&gt;
* Look in dmesg(8) output for a driver line, such as:&amp;lt;br /&amp;gt; &amp;quot;SCSI device sda: drive cache: write back&amp;quot;&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# sginfo -c /dev/sda | grep -i &#039;write cache&#039; &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -I /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; and look under &amp;quot;Enabled Supported&amp;quot; for &amp;quot;Write cache&amp;quot;&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
== Q: How can I address the problem with the disk write cache? ==&lt;br /&gt;
&lt;br /&gt;
=== Disabling the disk write back cache. ===&lt;br /&gt;
&lt;br /&gt;
For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -W0 /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # hdparm -W0 /dev/hda&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# blktool /dev/sda wcache off&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # blktool /dev/hda wcache off&lt;br /&gt;
&lt;br /&gt;
For SCSI:&lt;br /&gt;
&lt;br /&gt;
* Using sginfo(8) which is a little tedious&amp;lt;br /&amp;gt; It takes 3 steps. For example:&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -c /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives a list of attribute names and values&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cX /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives an array of cache values which you must match up with from step 1, e.g.&amp;lt;br /&amp;gt; 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; allows you to reset the value of the cache attributes.&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Using an external log. ===&lt;br /&gt;
&lt;br /&gt;
Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will &#039;&#039;&#039;not&#039;&#039;&#039; solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won&#039;t be able to guarantee that if the metadata is on a drive with the write cache enabled.&lt;br /&gt;
&lt;br /&gt;
In fact using an external log will disable XFS&#039; write barrier support.&lt;br /&gt;
&lt;br /&gt;
=== Write barrier support. ===&lt;br /&gt;
&lt;br /&gt;
Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with &amp;quot;nobarrier&amp;quot;. Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported with external log device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported by the underlying device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, trial barrier write failed&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If the filesystem is mounted with an external log device then we currently don&#039;t support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn&#039;t support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.&lt;br /&gt;
&lt;br /&gt;
== Q. Should barriers be enabled with storage which has a persistent write cache? ==&lt;br /&gt;
&lt;br /&gt;
Many hardware RAIDs have a persistent write cache which is preserved across power failure, interface resets, system crashes, etc.  The same may be true of some SSD devices.  This sort of hardware should report to the operating system that no flushes are required, and in that case barriers will not be issued, even without the &amp;quot;nobarrier&amp;quot; option.  Quoting Christoph Hellwig [http://oss.sgi.com/archives/xfs/2015-12/msg00281.html on the xfs list],&lt;br /&gt;
  If the device does not need cache flushes it should not report requiring&lt;br /&gt;
  flushes, in which case nobarrier will be a noop.  Or to phrase it&lt;br /&gt;
  differently:  If nobarrier makes a difference skipping it is not safe.&lt;br /&gt;
On modern kernels with hardware which properly reports write cache behavior, there is no need to change barrier options at mount time.&lt;br /&gt;
&lt;br /&gt;
== Q. Which settings does my RAID controller need ? ==&lt;br /&gt;
&lt;br /&gt;
It&#039;s hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:&lt;br /&gt;
&lt;br /&gt;
Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory &amp;quot;[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]&amp;quot;) which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.&lt;br /&gt;
&lt;br /&gt;
If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.&lt;br /&gt;
&lt;br /&gt;
* onboard RAID controllers: there are so many different types it&#039;s hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn&#039;t even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.&lt;br /&gt;
&lt;br /&gt;
* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86); &lt;br /&gt;
&lt;br /&gt;
* Adaptec: allows setting individual drives cache&lt;br /&gt;
arcconf setcache &amp;lt;disk&amp;gt; wb|wt&lt;br /&gt;
wb=write back, which means write cache on, wt=write through, which means write cache off. So &amp;quot;wt&amp;quot; should be chosen.&lt;br /&gt;
&lt;br /&gt;
* Areca: In archttp under &amp;quot;System Controls&amp;quot; -&amp;gt; &amp;quot;System Configuration&amp;quot; there&#039;s the option &amp;quot;Disk Write Cache Mode&amp;quot; (defaults &amp;quot;Auto&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Off&amp;quot;: disk write cache is turned off&lt;br /&gt;
&lt;br /&gt;
&amp;quot;On&amp;quot;: disk write cache is enabled, this is not safe for your data but fast&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Auto&amp;quot;: If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to &amp;quot;On&amp;quot;, because neither controller cache nor disk cache is safe so you don&#039;t seem to care about your data and just want high speed (which you get then).&lt;br /&gt;
&lt;br /&gt;
That&#039;s a very sensible default so you can let it &amp;quot;Auto&amp;quot; or enforce &amp;quot;Off&amp;quot; to be sure.&lt;br /&gt;
&lt;br /&gt;
* LSI MegaRAID: allows setting individual disks cache:&lt;br /&gt;
 MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL                          # flushes the controller cache&lt;br /&gt;
 MegaCli -LDGetProp -Cache    -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the controller cache settings&lt;br /&gt;
 MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the disk cache settings (for all phys. disks in logical disk)&lt;br /&gt;
 MegaCli -LDSetProp -EnDskCache|DisDskCache  -LN|-L0,1,2|-LAll  -aN|-a0,1,2|-aALL # set disk cache setting&lt;br /&gt;
&lt;br /&gt;
* Xyratex: from the docs: &amp;quot;Write cache includes the disk drive cache and controller cache.&amp;quot;. So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.&lt;br /&gt;
&lt;br /&gt;
== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==&lt;br /&gt;
&lt;br /&gt;
The biggest problem is that those products seem to also virtualize disk &lt;br /&gt;
writes in a way that even barriers don&#039;t work any more, which means even &lt;br /&gt;
a fsync is not reliable. Tests confirm that unplugging the power from &lt;br /&gt;
such a system even with RAID controller with battery backed cache and &lt;br /&gt;
hard disk cache turned off (which is safe on a normal host) you can &lt;br /&gt;
destroy a database within the virtual machine (client, domU whatever you &lt;br /&gt;
call it).&lt;br /&gt;
&lt;br /&gt;
In qemu you can specify cache=off on the line specifying the virtual &lt;br /&gt;
disk. For others information is missing.&lt;br /&gt;
&lt;br /&gt;
== Q: What is the issue with directory corruption in Linux 2.6.17? ==&lt;br /&gt;
&lt;br /&gt;
In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some &amp;quot;sparse&amp;quot; endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: the fix is included in 2.6.17.7 and later kernels.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
To add insult to injury, &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039; is currently not correcting these directories on detection of this corrupt state either. This &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; issue is actively being worked on, and a fixed version will be available shortly.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfs_repair -n&#039;&#039;&#039; should be able to detect any directory corruption.&lt;br /&gt;
&lt;br /&gt;
Until a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; binary is available, one can make use of the &#039;&#039;&#039;xfs_db(8)&#039;&#039;&#039; command to mark the problem directory for removal (see the example below). A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; invocation will remove the directory and move all contents into &amp;quot;lost+found&amp;quot;, named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 core.mode = 040755&lt;br /&gt;
 core.version = 2&lt;br /&gt;
 core.format = 3 (btree)&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; write core.mode 0&lt;br /&gt;
 xfs_db&amp;amp;gt; quit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; will clear the directory, and add new entries (named by inode number) in lost+found.&lt;br /&gt;
&lt;br /&gt;
The easiest way to map inode numbers to full paths is via &#039;&#039;&#039;xfs_ncheck(8)&#039;&#039;&#039;&amp;lt;nowiki&amp;gt;: &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_ncheck -i 14101 -i 14102 /dev/sdXXX&lt;br /&gt;
       14101 full/path/mumble_fratz_foo_bar_1495&lt;br /&gt;
       14102 full/path/mumble_fratz_foo_bar_1494&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 ...&lt;br /&gt;
 next_unlinked = null&lt;br /&gt;
 u.bmbt.level = 1&lt;br /&gt;
 u.bmbt.numrecs = 1&lt;br /&gt;
 u.bmbt.keys[1] = [startoff] 1:[0]&lt;br /&gt;
 u.bmbt.ptrs[1] = 1:3628&lt;br /&gt;
 xfs_db&amp;amp;gt; fsblock 3628&lt;br /&gt;
 xfs_db&amp;amp;gt; type bmapbtd&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 magic = 0x424d4150&lt;br /&gt;
 level = 0&lt;br /&gt;
 numrecs = 19&lt;br /&gt;
 leftsib = null&lt;br /&gt;
 rightsib = null&lt;br /&gt;
 recs[1-19] = [startoff,startblock,blockcount,extentflag]&lt;br /&gt;
        1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]&lt;br /&gt;
        5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]&lt;br /&gt;
        9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]&lt;br /&gt;
        12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]&lt;br /&gt;
        15:[33554436,3488,8,0] 16:[33554444,3629,4,0]&lt;br /&gt;
        17:[33554448,3748,4,0] 18:[33554452,3900,4,0]&lt;br /&gt;
        19:[67108864,3364,4,0]&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the &#039;&#039;&#039;xfs_db&#039;&#039;&#039; dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; dblock 20&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 dhdr.magic = 0x58443244&lt;br /&gt;
 dhdr.bestfree[0].offset = 0&lt;br /&gt;
 dhdr.bestfree[0].length = 0&lt;br /&gt;
 dhdr.bestfree[1].offset = 0&lt;br /&gt;
 dhdr.bestfree[1].length = 0&lt;br /&gt;
 dhdr.bestfree[2].offset = 0&lt;br /&gt;
 dhdr.bestfree[2].length = 0&lt;br /&gt;
 du[0].inumber = 13937&lt;br /&gt;
 du[0].namelen = 25&lt;br /&gt;
 du[0].name = &amp;quot;mumble_fratz_foo_bar_1595&amp;quot;&lt;br /&gt;
 du[0].tag = 0x10&lt;br /&gt;
 du[1].inumber = 13938&lt;br /&gt;
 du[1].namelen = 25&lt;br /&gt;
 du[1].name = &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;&lt;br /&gt;
 du[1].tag = 0x38&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
So, here we can see that inode number 13938 matches up with name &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;. Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at &amp;quot;lost+found&amp;quot; (once &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; has removed the corrupt directory).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why does my &amp;gt; 2TB XFS partition disappear when I reboot ? ==&lt;br /&gt;
&lt;br /&gt;
Strictly speaking this is not an XFS problem.&lt;br /&gt;
&lt;br /&gt;
To support &amp;gt; 2TB partitions you need two things: a kernel that supports large block devices (&amp;lt;tt&amp;gt;CONFIG_LBD=y&amp;lt;/tt&amp;gt;) and a partition table format that can hold large partitions.  The default DOS partition tables don&#039;t.  The best partition format for&lt;br /&gt;
&amp;gt; 2TB partitions is the EFI GPT format (&amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Without CONFIG_LBD=y you can&#039;t even create the filesystem, but without &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt; it works fine until you reboot at which point the partition will disappear.  Note that you need to enable the &amp;lt;tt&amp;gt;CONFIG_PARTITION_ADVANCED&amp;lt;/tt&amp;gt; option before you can set &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I receive &amp;lt;tt&amp;gt;No space left on device&amp;lt;/tt&amp;gt; after &amp;lt;tt&amp;gt;xfs_growfs&amp;lt;/tt&amp;gt;? ==&lt;br /&gt;
&lt;br /&gt;
After [http://oss.sgi.com/archives/xfs/2009-01/msg01023.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. This was an issue with the older &amp;quot;inode32&amp;quot; inode allocation mode, where inode allocation is restricted to lower filesysetm blocks.  To fix this, [http://oss.sgi.com/archives/xfs/2009-01/msg01031.html Dave Chinner advised]:&lt;br /&gt;
&lt;br /&gt;
  The only way to fix this is to move data around to free up space&lt;br /&gt;
  below 1TB. Find your oldest data (i.e. that was around before even&lt;br /&gt;
  the first grow) and move it off the filesystem (move, not copy).&lt;br /&gt;
  Then if you copy it back on, the data blocks will end up above 1TB&lt;br /&gt;
  and that should leave you with plenty of space for inodes below 1TB.&lt;br /&gt;
  &lt;br /&gt;
  A complete dump and restore will also fix the problem ;)&lt;br /&gt;
&lt;br /&gt;
Alternately, you can add &#039;inode64&#039; to your mount options to allow inodes to live above 1TB.&lt;br /&gt;
&lt;br /&gt;
example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&amp;amp;forum=38 No space left on device on xfs filesystem with 7.7TB free]&lt;br /&gt;
&lt;br /&gt;
However, &#039;inode64&#039; has been the default behavior since kernel v3.7...&lt;br /&gt;
&lt;br /&gt;
Unfortunately, v3.7 also added a bug present from kernel v3.7 to v3.17 which caused new allocation groups added by growfs to be unavailable for inode allocation.  This was fixed by commit &amp;lt;tt&amp;gt;[http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9de67c3ba9ea961ba420573d56479d09d33a7587 9de67c3b xfs: allow inode allocations in post-growfs disk space.]&amp;lt;/tt&amp;gt; in kernel v3.17.&lt;br /&gt;
Without that commit, the problem can be worked around by doing a &amp;quot;mount -o remount,inode64&amp;quot; after the growfs operation.&lt;br /&gt;
&lt;br /&gt;
== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==&lt;br /&gt;
&lt;br /&gt;
The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons. &lt;br /&gt;
&lt;br /&gt;
Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.&lt;br /&gt;
&lt;br /&gt;
== Q: How to get around a bad inode repair is unable to clean up ==&lt;br /&gt;
&lt;br /&gt;
The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.&lt;br /&gt;
&lt;br /&gt;
  xfs_db -x -c &#039;inode XXX&#039; -c &#039;write core.nextents 0&#039; -c &#039;write core.size 0&#039; /dev/hdXX&lt;br /&gt;
&lt;br /&gt;
== Q: How to calculate the correct sunit,swidth values for optimal performance ==&lt;br /&gt;
&lt;br /&gt;
XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.&lt;br /&gt;
&lt;br /&gt;
These options can be sometimes autodetected (for example with md raid and recent enough kernel (&amp;gt;= 2.6.32) and xfsprogs (&amp;gt;= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.&lt;br /&gt;
&lt;br /&gt;
The calculation of these values is quite simple:&lt;br /&gt;
&lt;br /&gt;
  su = &amp;lt;RAID controllers stripe size in BYTES (or KiBytes when used with k)&amp;gt;&lt;br /&gt;
  sw = &amp;lt;# of data disks (don&#039;t count parity disks)&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use&lt;br /&gt;
&lt;br /&gt;
  su = 64k&lt;br /&gt;
  sw = 6 (RAID-6 of 8 disks has 6 data disks)&lt;br /&gt;
&lt;br /&gt;
A RAID stripe size of 256KB with a RAID-10 over 16 disks should use&lt;br /&gt;
&lt;br /&gt;
  su = 256k&lt;br /&gt;
  sw = 8 (RAID-10 of 16 disks has 8 data disks)&lt;br /&gt;
&lt;br /&gt;
Alternatively, you can use &amp;quot;sunit&amp;quot; instead of &amp;quot;su&amp;quot; and &amp;quot;swidth&amp;quot; instead of &amp;quot;sw&amp;quot; but then sunit/swidth values need to be specified in &amp;quot;number of 512B sectors&amp;quot;!&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; interpret sunit and swidth as being specified in units of 512B sectors; that&#039;s unfortunately not the unit they&#039;re reported in, however.&lt;br /&gt;
&amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; report them in multiples of your basic block size (bsize) and not in 512B sectors.&lt;br /&gt;
&lt;br /&gt;
Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.&lt;br /&gt;
&lt;br /&gt;
When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.&lt;br /&gt;
&lt;br /&gt;
== Q: Why doesn&#039;t NFS-exporting subdirectories of inode64-mounted filesystem work? ==&lt;br /&gt;
&lt;br /&gt;
The default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; type encodes only 32-bit of the inode number for subdirectory exports.  However, exporting the root of the filesystem works, or using one of the non-default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; types (&amp;lt;tt&amp;gt;fsid=uuid&amp;lt;/tt&amp;gt; in &amp;lt;tt&amp;gt;/etc/exports&amp;lt;/tt&amp;gt; with recent &amp;lt;tt&amp;gt;nfs-utils&amp;lt;/tt&amp;gt;) should work as well. (Thanks, Christoph!)&lt;br /&gt;
&lt;br /&gt;
== Q: What is the inode64 mount option for? ==&lt;br /&gt;
&lt;br /&gt;
By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like &amp;quot;disk full&amp;quot; when you still have plenty space free, but there&#039;s no more place in the first TB to create a new inode. Also, performance sucks.&lt;br /&gt;
&lt;br /&gt;
To come around this, use the inode64 mount options for filesystems &amp;gt;1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.&lt;br /&gt;
&lt;br /&gt;
Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.&lt;br /&gt;
&lt;br /&gt;
== Q: Can I just try the inode64 option to see if it helps me? ==&lt;br /&gt;
&lt;br /&gt;
Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can&#039;t access files &amp;amp; dirs that have been created with an inode &amp;gt;32bit anymore.&lt;br /&gt;
&lt;br /&gt;
== Q: Performance: mkfs.xfs -n size=64k option ==&lt;br /&gt;
&lt;br /&gt;
Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:&lt;br /&gt;
&lt;br /&gt;
Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a&lt;br /&gt;
directory entry is determined by the length of the name.&lt;br /&gt;
&lt;br /&gt;
There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there&#039;s the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.&lt;br /&gt;
&lt;br /&gt;
For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.&lt;br /&gt;
&lt;br /&gt;
In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don&#039;t have any numbers on what the difference might be - I&#039;m getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....&lt;br /&gt;
&lt;br /&gt;
== Q: I want to tune my XFS filesystems for &amp;lt;something&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Premature optimization is the root of all evil.&#039;&#039; - Donald Knuth&lt;br /&gt;
&lt;br /&gt;
The standard answer you will get to this question is this: use the defaults.&lt;br /&gt;
&lt;br /&gt;
There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to  configure the filesystem appropriately.&lt;br /&gt;
&lt;br /&gt;
There are a lot of &amp;quot;XFS tuning guides&amp;quot; that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don&#039;t expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.&lt;br /&gt;
&lt;br /&gt;
In most cases, the only thing you need to to consider for &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; mount options. Increasing &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; reduces the number of journal IOs for a given workload, and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; will reduce them even further. The trade off for this increase in metadata performance is that more operations may be &amp;quot;missing&amp;quot; after recovery if the system crashes while actively making modifications.&lt;br /&gt;
&lt;br /&gt;
As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.&lt;br /&gt;
&lt;br /&gt;
== Q: Which factors influence the memory usage of xfs_repair? ==&lt;br /&gt;
&lt;br /&gt;
This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -n -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2096.&lt;br /&gt;
  #&lt;br /&gt;
&lt;br /&gt;
xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,&lt;br /&gt;
of which 2,097,152KB is needed for tracking free space. &lt;br /&gt;
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)&lt;br /&gt;
&lt;br /&gt;
Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2289.&lt;br /&gt;
&lt;br /&gt;
That is now needs at least another 200MB of RAM to run.&lt;br /&gt;
&lt;br /&gt;
The numbers reported by xfs_repair are the absolute minimum required and approximate at that;&lt;br /&gt;
more RAM than this may be required to complete successfully.&lt;br /&gt;
Also, if you only give xfs_repair the minimum required RAM, it will be slow;&lt;br /&gt;
for best repair performance, the more RAM you can give it the better.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why some files of my filesystem shows as &amp;quot;?????????? ? ?      ?          ?                ? filename&amp;quot; ? ==&lt;br /&gt;
&lt;br /&gt;
If ls -l shows you a listing as&lt;br /&gt;
&lt;br /&gt;
  # ?????????? ? ?      ?          ?                ? file1&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file2&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file3&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file4&lt;br /&gt;
&lt;br /&gt;
and errors like:&lt;br /&gt;
  # ls /pathtodir/&lt;br /&gt;
    ls: cannot access /pathtodir/file1: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file2: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file3: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file4: Invalid argument&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
or even:&lt;br /&gt;
  # failed to stat /pathtodir/file1&lt;br /&gt;
&lt;br /&gt;
It is very probable your filesystem must be mounted with inode64&lt;br /&gt;
  # mount -oremount,inode64 /dev/diskpart /mnt/xfs&lt;br /&gt;
&lt;br /&gt;
should make it work ok again.&lt;br /&gt;
If it works, add the option to fstab.&lt;br /&gt;
&lt;br /&gt;
== Q: The xfs_db &amp;quot;frag&amp;quot; command says I&#039;m over 50%.  Is that bad? ==&lt;br /&gt;
&lt;br /&gt;
It depends.  It&#039;s important to know how the value is calculated.  xfs_db looks at the extents in all files, and returns:&lt;br /&gt;
&lt;br /&gt;
  (actual extents - ideal extents) / actual extents&lt;br /&gt;
&lt;br /&gt;
This means that if, for example, you have an average of 2 extents per file, you&#039;ll get an answer of 50%.  4 extents per file would give you 75%.  This may or may not be a problem, especially depending on the size of the files in question.  (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented).  The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.&lt;br /&gt;
&lt;br /&gt;
Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:&lt;br /&gt;
[[Image:Frag_factor.png|500px]]&lt;br /&gt;
&lt;br /&gt;
== Q: I&#039;m getting &amp;quot;Internal error xfs_sb_read_verify&amp;quot; errors when I try to run xfs_growfs under kernels v3.10 through v3.12 ==&lt;br /&gt;
&lt;br /&gt;
This may happen when running xfs_growfs under a v3.10-v3.12 kernel,&lt;br /&gt;
if the filesystem was previously grown under a kernel prior to v3.8.&lt;br /&gt;
&lt;br /&gt;
Old kernel versions prior to v3.8 did not zero the empty part of&lt;br /&gt;
new secondary superblocks when growing the filesystem with xfs_growfs.&lt;br /&gt;
&lt;br /&gt;
Kernels v3.10 and later began detecting this non-zero part of the&lt;br /&gt;
superblock as corruption, and emit the &lt;br /&gt;
&lt;br /&gt;
    Internal error xfs_sb_read_verify&lt;br /&gt;
&lt;br /&gt;
error message.&lt;br /&gt;
&lt;br /&gt;
Kernels v3.13 and later are more forgiving about this - if the non-zero &lt;br /&gt;
data is found on a Version 4 superblock, it will not be flagged as&lt;br /&gt;
corruption.&lt;br /&gt;
&lt;br /&gt;
The problematic secondary superblocks may be repaired by using an xfs_repair&lt;br /&gt;
version 3.2.0-alpha1 or above.&lt;br /&gt;
&lt;br /&gt;
The relevant kernelspace commits are as follows:&lt;br /&gt;
&lt;br /&gt;
    v3.8  1375cb6 xfs: growfs: don&#039;t read garbage for new secondary superblocks &amp;lt;- fixed underlying problem &lt;br /&gt;
    v3.10 04a1e6c xfs: add CRC checks to the superblock &amp;lt;- detected old underlying problem&lt;br /&gt;
    v3.13 10e6e65 xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields &amp;lt;- is more forgiving of old underlying problem&lt;br /&gt;
&lt;br /&gt;
This commit allows xfs_repair to detect and correct the problem:&lt;br /&gt;
&lt;br /&gt;
    v3.2.0-alpha1 cbd7508 xfs_repair: zero out unused parts of superblocks&lt;br /&gt;
&lt;br /&gt;
== Q: Why do files on XFS use more data blocks than expected? ==&lt;br /&gt;
&lt;br /&gt;
The XFS speculative preallocation algorithm allocates extra blocks beyond end of file (EOF) to minimize file fragmentation during buffered write workloads. Workloads that benefit from this behaviour include slowly growing files, concurrent writers and mixed reader/writer workloads. It also provides fragmentation resistance in situations where memory pressure prevents adequate buffering of dirty data to allow formation of large contiguous regions of data in memory.&lt;br /&gt;
&lt;br /&gt;
This post-EOF block allocation is accounted identically to blocks within EOF. It is visible in &#039;st_blocks&#039; counts via stat() system calls, accounted as globally allocated space and against quotas that apply to the associated file. The space is reported by various userspace utilities (stat, du, df, ls) and thus provides a common source of confusion for administrators. Post-EOF blocks are temporary in most situations and are usually reclaimed via several possible mechanisms in XFS.&lt;br /&gt;
&lt;br /&gt;
See the FAQ entry on speculative preallocation for details.&lt;br /&gt;
&lt;br /&gt;
== Q: What is speculative preallocation? ==&lt;br /&gt;
&lt;br /&gt;
XFS speculatively preallocates post-EOF blocks on file extending writes in anticipation of future extending writes. The size of a preallocation is dynamic and depends on the runtime state of the file and fs. Generally speaking, preallocation is disabled for very small files and preallocation sizes grow as files grow larger.&lt;br /&gt;
&lt;br /&gt;
Preallocations are capped to the maximum extent size supported by the filesystem. Preallocation size is throttled automatically as the filesystem approaches low free space conditions or other allocation limits on a file (such as a quota).&lt;br /&gt;
&lt;br /&gt;
In most cases, speculative preallocation is automatically reclaimed when a file is closed. Applications that repeatedly trigger preallocation and reclaim cycles (e.g., this is common in file server or log file workloads) can cause fragmentation. Therefore, this pattern is detected and causes the preallocation to persist beyond the lifecycle of the file descriptor.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I speed up or avoid delayed removal of speculative preallocation?  ==&lt;br /&gt;
&lt;br /&gt;
Linux 3.8 (and later) includes a scanner to perform background trimming of files with lingering post-EOF preallocations. The scanner bypasses dirty files to avoid interference with ongoing writes. A 5 minute scan interval is used by default and can be adjusted via the following file (value in seconds):&lt;br /&gt;
&lt;br /&gt;
        /proc/sys/fs/xfs/speculative_prealloc_lifetime&lt;br /&gt;
&lt;br /&gt;
== Q: Is speculative preallocation permanent? ==&lt;br /&gt;
&lt;br /&gt;
Preallocated blocks are normally reclaimed on file close, inode reclaim, unmount or in the background once file write activity subsides. They can be explicitly made permanent via fallocate or a similar interface. They can be implicitly made permanent in situations where file size is extended beyond a range of post-EOF blocks (i.e., via an extending truncate) or following a crash. In the event of a crash, the in-memory state used to track and reclaim the speculative preallocation is lost.&lt;br /&gt;
&lt;br /&gt;
== Q: My workload has known characteristics - can I disable speculative preallocation or tune it to an optimal fixed size? ==&lt;br /&gt;
&lt;br /&gt;
Speculative preallocation can not be disabled but XFS can be tuned to a fixed allocation size with the &#039;allocsize=&#039; mount option. Speculative preallocation is not dynamically resized when the allocsize mount option is set and thus the potential for fragmentation is increased. Use &#039;allocsize=64k&#039; to revert to the default XFS behavior prior to support for dynamic speculative preallocation.&lt;br /&gt;
&lt;br /&gt;
== Q: mount (or umount) takes minutes or even hours - what could be the reason ? ==&lt;br /&gt;
&lt;br /&gt;
In some cases xfs log (journal) can become quite big. For example if it accumulates many entries and doesn&#039;t get chance to apply these to disk (due to lockup, crash, hard reset etc). xfs will try to reapply these at mount (in dmesg: &amp;quot;Starting recovery (logdev: internal)&amp;quot;).&lt;br /&gt;
&lt;br /&gt;
That process with big log to be reapplied can take very long time (minutes or even hours). Similar problem can happen with unmount taking hours when there are hundreds of thousands of dirty inode in memory that need to be flushed to disk.&lt;br /&gt;
&lt;br /&gt;
(http://oss.sgi.com/pipermail/xfs/2015-October/044457.html)&lt;br /&gt;
&lt;br /&gt;
== Q: Which I/O scheduler for XFS? ==&lt;br /&gt;
&lt;br /&gt;
=== On rotational disks without hardware raid ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;CFQ&#039;&#039;: not great for XFS parallelism:&lt;br /&gt;
&lt;br /&gt;
  &amp;lt; dchinner&amp;gt; it doesn&#039;t allow other threads to get IO issued immediately after the first one&lt;br /&gt;
  &amp;lt; dchinner&amp;gt; it waits, instead, for a timeslice to expire before moving to the IO of a different process.&lt;br /&gt;
  &amp;lt; dchinner&amp;gt; so instead of interleaving the IO of multiple jobs in a single sweep across the disk,&lt;br /&gt;
              it enforces single threaded access to the disk&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;deadline&#039;&#039;: good option, doesn&#039;t have such problem&lt;br /&gt;
&lt;br /&gt;
Note that some kernels have block multiqueue enabled which (currently - 08/2016) doesn&#039;t support I/O schedulers at all thus there is no optimisation and reordering IO for best seek order, so disable blk-mq for rotational disks (see CONFIG_SCSI_MQ_DEFAULT, CONFIG_DM_MQ_DEFAULT options and use_blk_mq parameter for scsi-mod/dm-mod kernel modules).&lt;br /&gt;
&lt;br /&gt;
Also hardware raid can be smart enough to cache and reorder I/O requests thus additional layer of reordering&lt;br /&gt;
(like Linux I/O scheduler) can potentially conflict and make performance worse. If you have such raid card&lt;br /&gt;
then try method described below.&lt;br /&gt;
&lt;br /&gt;
=== SSD disks or rotational disks but with hardware raid card that has cache enabled ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Block multiqueue&#039;&#039; enabled (and thus no I/O scheduler at all) or block multiqueue disabled and &#039;&#039;noop&#039;&#039; or &#039;&#039;deadline&#039;&#039; I/O scheduler activated is good solution. SSD disks don&#039;t really need I/O schedulers while smart raid cards do I/O ordering on their own.&lt;br /&gt;
&lt;br /&gt;
Note that if your raid is very dumb and/or has no cache enabled then it likely cannot reorder I/O requests and thus it could benefit from I/O scheduler.&lt;br /&gt;
&lt;br /&gt;
== Q: Why does userspace say &amp;quot;filesystem uses v1 dirs, limited functionality provided?&amp;quot; ==&lt;br /&gt;
&lt;br /&gt;
Either you have a very old or a rather new filesystem coupled with too-old userspace.&lt;br /&gt;
&lt;br /&gt;
Very old filesystems used a format called &amp;quot;directory version 1&amp;quot; and in this case the error is correct and hopefully self explanatory.&lt;br /&gt;
&lt;br /&gt;
However, if you have a newer filesystem with version 5 superblocks and the metadata CRC feature enabled, older releases of xfsprogs may incorrectly issue the &amp;quot;v1 dir&amp;quot; message.  In this case, get newer xfsprogs; at least v3.2.0, but preferably the latest release.&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_FAQ&amp;diff=3003</id>
		<title>XFS FAQ</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_FAQ&amp;diff=3003"/>
		<updated>2016-12-05T21:02:57Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about XFS? ==&lt;br /&gt;
&lt;br /&gt;
The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.&lt;br /&gt;
&lt;br /&gt;
You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the &#039;&#039;&#039;&amp;lt;nowiki&amp;gt;#xfs&amp;lt;/nowiki&amp;gt;&#039;&#039;&#039; IRC channel on &#039;&#039;irc.freenode.net&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about ACLs? ==&lt;br /&gt;
&lt;br /&gt;
Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;acl(5)&#039;&#039;&#039; manual page is also quite extensive.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find information about the internals of XFS? ==&lt;br /&gt;
&lt;br /&gt;
An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.&lt;br /&gt;
&lt;br /&gt;
Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.&lt;br /&gt;
&lt;br /&gt;
== Q: What partition type should I use for XFS on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Linux native filesystem (83).&lt;br /&gt;
&lt;br /&gt;
== Q: What mount options does XFS have? ==&lt;br /&gt;
&lt;br /&gt;
There are a number of mount options influencing XFS filesystems - refer to the &#039;&#039;&#039;mount(8)&#039;&#039;&#039; manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])&lt;br /&gt;
&lt;br /&gt;
== Q: Is there any relation between the XFS utilities and the kernel version? ==&lt;br /&gt;
&lt;br /&gt;
No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Does it run on platforms other than i386? ==&lt;br /&gt;
&lt;br /&gt;
XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Do quotas work on XFS? ==&lt;br /&gt;
&lt;br /&gt;
Yes.&lt;br /&gt;
&lt;br /&gt;
To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/  http://sourceforge.net/projects/linuxquota/] or use &#039;&#039;&#039;xfs_quota(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: What&#039;s project quota? ==&lt;br /&gt;
&lt;br /&gt;
The  project  quota  is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Can group quota and project quota be used at the same time? ==&lt;br /&gt;
&lt;br /&gt;
No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==&lt;br /&gt;
&lt;br /&gt;
To be answered.&lt;br /&gt;
&lt;br /&gt;
== Q: Are there any dump/restore tools for XFS? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and &#039;&#039;&#039;xfsrestore(8)&#039;&#039;&#039; are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.&lt;br /&gt;
&lt;br /&gt;
== Q: Does LILO work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
This depends on where you install LILO.&lt;br /&gt;
&lt;br /&gt;
Yes, for MBR (Master Boot Record) installations.&lt;br /&gt;
&lt;br /&gt;
No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.&lt;br /&gt;
&lt;br /&gt;
== Q: Does GRUB work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.&lt;br /&gt;
&lt;br /&gt;
== Q: Can XFS be used for a root filesystem? ==&lt;br /&gt;
&lt;br /&gt;
Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the &amp;quot;rootflags=&amp;quot; kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit &amp;quot;logdev=&amp;quot; specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]&lt;br /&gt;
&lt;br /&gt;
== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be &amp;quot;clean&amp;quot; when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.&lt;br /&gt;
&lt;br /&gt;
== Q: Is there a way to make a XFS filesystem larger or smaller? ==&lt;br /&gt;
&lt;br /&gt;
You can &#039;&#039;NOT&#039;&#039; make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.&lt;br /&gt;
&lt;br /&gt;
An XFS filesystem may be enlarged by using &#039;&#039;&#039;xfs_growfs(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the &#039;&#039;exact same&#039;&#039; starting point. Run &#039;&#039;&#039;xfs_growfs&#039;&#039;&#039; to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.&lt;br /&gt;
&lt;br /&gt;
Using XFS filesystems on top of a volume manager makes this a lot easier.&lt;br /&gt;
&lt;br /&gt;
== Q: What information should I include when reporting a problem? ==&lt;br /&gt;
&lt;br /&gt;
What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:&lt;br /&gt;
&lt;br /&gt;
* kernel version (uname -a)&lt;br /&gt;
* xfsprogs version (xfs_repair -V)&lt;br /&gt;
* number of CPUs&lt;br /&gt;
* contents of /proc/meminfo&lt;br /&gt;
* contents of /proc/mounts&lt;br /&gt;
* contents of /proc/partitions&lt;br /&gt;
* RAID layout (hardware and/or software)&lt;br /&gt;
* LVM configuration&lt;br /&gt;
* type of disks you are using&lt;br /&gt;
* write cache status of drives&lt;br /&gt;
* size of BBWC and mode it is running in&lt;br /&gt;
* xfs_info output on the filesystem in question&lt;br /&gt;
* dmesg output showing all error messages and stack traces&lt;br /&gt;
 &lt;br /&gt;
Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:&lt;br /&gt;
&lt;br /&gt;
# iostat -x -d -m 5&lt;br /&gt;
# vmstat 5&lt;br /&gt;
 &lt;br /&gt;
can give us insight into the IO and memory utilisation of your machine at the time of the problem.&lt;br /&gt;
&lt;br /&gt;
If the filesystem is hanging, then capture the output of the dmesg command after running:&lt;br /&gt;
&lt;br /&gt;
 # echo w &amp;gt; /proc/sysrq-trigger&lt;br /&gt;
 # dmesg&lt;br /&gt;
&lt;br /&gt;
will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.&lt;br /&gt;
&lt;br /&gt;
And for advanced users, capturing an event trace using &#039;&#039;&#039;trace-cmd&#039;&#039;&#039; (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it&#039;s a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd record -e xfs\*&lt;br /&gt;
&lt;br /&gt;
before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd report &amp;gt; trace_report.txt&lt;br /&gt;
&lt;br /&gt;
Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.&lt;br /&gt;
&lt;br /&gt;
If you have a problem with &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039;, make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using &#039;&#039;&#039;xfs_metadump(8)&#039;&#039;&#039; (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.&lt;br /&gt;
&lt;br /&gt;
== Q: Mounting an XFS filesystem does not work - what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
If mount prints an error message something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     mount: /dev/hda5 has wrong major or minor number&lt;br /&gt;
&lt;br /&gt;
you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the &amp;quot;-t xfs&amp;quot; option on mount or the &amp;quot;xfs&amp;quot; option in &amp;lt;tt&amp;gt;/etc/fstab&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
If you get something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 mount: wrong fs type, bad option, bad superblock on /dev/sda1,&lt;br /&gt;
        or too many mounted file systems&lt;br /&gt;
&lt;br /&gt;
Refer to your system log file (&amp;lt;tt&amp;gt;/var/log/messages&amp;lt;/tt&amp;gt;) for a detailed diagnostic message from the kernel.&lt;br /&gt;
&lt;br /&gt;
== Q: Does the filesystem have an undelete capability? ==&lt;br /&gt;
&lt;br /&gt;
There is no undelete in XFS.&lt;br /&gt;
&lt;br /&gt;
However, if an inode is unlinked but neither it nor its associated data blocks get immediately re-used and overwritten, there is some small chance to recover the file from the disk.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;photorec&#039;&#039;, &#039;&#039;xfs_irecover&#039;&#039; or &#039;&#039;xfsr&#039;&#039; are some tools which attempt to do this, with varying success.&lt;br /&gt;
&lt;br /&gt;
There are also commercial data recovery services and closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS] which claims to recover data, although this has not been tested by the XFS developers.&lt;br /&gt;
&lt;br /&gt;
As always, the best advice is to keep good backups.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I backup a XFS filesystem and ACLs? ==&lt;br /&gt;
&lt;br /&gt;
You can backup a XFS filesystem with utilities like &#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and standard &#039;&#039;&#039;tar(1)&#039;&#039;&#039; for standard files. If you want to backup ACLs you will need to use &#039;&#039;&#039;xfsdump&#039;&#039;&#039; or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (&amp;gt; version 3.1.4) or [http://rsync.samba.org/ rsync] (&amp;gt;= version 3.0.0) to backup ACLs and EAs. &#039;&#039;&#039;xfsdump&#039;&#039;&#039; can also be integrated with [http://www.amanda.org/ amanda(8)].&lt;br /&gt;
&lt;br /&gt;
== Q: I see applications returning error 990 or &amp;quot;Structure needs cleaning&amp;quot;, what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], &amp;quot;Structure needs cleaning.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.&lt;br /&gt;
&lt;br /&gt;
There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.&lt;br /&gt;
&lt;br /&gt;
You can use xfs_repair to remedy the problem (with the file system unmounted).&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==&lt;br /&gt;
&lt;br /&gt;
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.&lt;br /&gt;
&lt;br /&gt;
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.&lt;br /&gt;
&lt;br /&gt;
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you&#039;ll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the &#039;&#039;&#039;xfs_bmap(8)&#039;&#039;&#039; command).&lt;br /&gt;
&lt;br /&gt;
== Q: What is the problem with the write cache on journaled filesystems? ==&lt;br /&gt;
&lt;br /&gt;
Many drives use a write back cache in order to speed up the performance of writes.  However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk.  Further, the drive can de-stage data from the write cache to the platters in any order that it chooses.  This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk.  When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.&lt;br /&gt;
&lt;br /&gt;
With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information.  In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.&lt;br /&gt;
&lt;br /&gt;
With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued.  A powerfail &amp;quot;only&amp;quot; loses data in the cache but no essential ordering is violated, and corruption will not occur.&lt;br /&gt;
&lt;br /&gt;
With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance.  But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I tell if I have the disk write cache enabled? ==&lt;br /&gt;
&lt;br /&gt;
For SCSI/SATA:&lt;br /&gt;
&lt;br /&gt;
* Look in dmesg(8) output for a driver line, such as:&amp;lt;br /&amp;gt; &amp;quot;SCSI device sda: drive cache: write back&amp;quot;&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# sginfo -c /dev/sda | grep -i &#039;write cache&#039; &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -I /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; and look under &amp;quot;Enabled Supported&amp;quot; for &amp;quot;Write cache&amp;quot;&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
== Q: How can I address the problem with the disk write cache? ==&lt;br /&gt;
&lt;br /&gt;
=== Disabling the disk write back cache. ===&lt;br /&gt;
&lt;br /&gt;
For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -W0 /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # hdparm -W0 /dev/hda&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# blktool /dev/sda wcache off&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # blktool /dev/hda wcache off&lt;br /&gt;
&lt;br /&gt;
For SCSI:&lt;br /&gt;
&lt;br /&gt;
* Using sginfo(8) which is a little tedious&amp;lt;br /&amp;gt; It takes 3 steps. For example:&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -c /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives a list of attribute names and values&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cX /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives an array of cache values which you must match up with from step 1, e.g.&amp;lt;br /&amp;gt; 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; allows you to reset the value of the cache attributes.&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Using an external log. ===&lt;br /&gt;
&lt;br /&gt;
Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will &#039;&#039;&#039;not&#039;&#039;&#039; solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won&#039;t be able to guarantee that if the metadata is on a drive with the write cache enabled.&lt;br /&gt;
&lt;br /&gt;
In fact using an external log will disable XFS&#039; write barrier support.&lt;br /&gt;
&lt;br /&gt;
=== Write barrier support. ===&lt;br /&gt;
&lt;br /&gt;
Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with &amp;quot;nobarrier&amp;quot;. Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported with external log device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported by the underlying device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, trial barrier write failed&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If the filesystem is mounted with an external log device then we currently don&#039;t support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn&#039;t support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.&lt;br /&gt;
&lt;br /&gt;
== Q. Should barriers be enabled with storage which has a persistent write cache? ==&lt;br /&gt;
&lt;br /&gt;
Many hardware RAIDs have a persistent write cache which is preserved across power failure, interface resets, system crashes, etc.  The same may be true of some SSD devices.  This sort of hardware should report to the operating system that no flushes are required, and in that case barriers will not be issued, even without the &amp;quot;nobarrier&amp;quot; option.  Quoting Christoph Hellwig [http://oss.sgi.com/archives/xfs/2015-12/msg00281.html on the xfs list],&lt;br /&gt;
  If the device does not need cache flushes it should not report requiring&lt;br /&gt;
  flushes, in which case nobarrier will be a noop.  Or to phrase it&lt;br /&gt;
  differently:  If nobarrier makes a difference skipping it is not safe.&lt;br /&gt;
On modern kernels with hardware which properly reports write cache behavior, there is no need to change barrier options at mount time.&lt;br /&gt;
&lt;br /&gt;
== Q. Which settings does my RAID controller need ? ==&lt;br /&gt;
&lt;br /&gt;
It&#039;s hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:&lt;br /&gt;
&lt;br /&gt;
Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory &amp;quot;[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]&amp;quot;) which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.&lt;br /&gt;
&lt;br /&gt;
If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.&lt;br /&gt;
&lt;br /&gt;
* onboard RAID controllers: there are so many different types it&#039;s hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn&#039;t even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.&lt;br /&gt;
&lt;br /&gt;
* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86); &lt;br /&gt;
&lt;br /&gt;
* Adaptec: allows setting individual drives cache&lt;br /&gt;
arcconf setcache &amp;lt;disk&amp;gt; wb|wt&lt;br /&gt;
wb=write back, which means write cache on, wt=write through, which means write cache off. So &amp;quot;wt&amp;quot; should be chosen.&lt;br /&gt;
&lt;br /&gt;
* Areca: In archttp under &amp;quot;System Controls&amp;quot; -&amp;gt; &amp;quot;System Configuration&amp;quot; there&#039;s the option &amp;quot;Disk Write Cache Mode&amp;quot; (defaults &amp;quot;Auto&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Off&amp;quot;: disk write cache is turned off&lt;br /&gt;
&lt;br /&gt;
&amp;quot;On&amp;quot;: disk write cache is enabled, this is not safe for your data but fast&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Auto&amp;quot;: If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to &amp;quot;On&amp;quot;, because neither controller cache nor disk cache is safe so you don&#039;t seem to care about your data and just want high speed (which you get then).&lt;br /&gt;
&lt;br /&gt;
That&#039;s a very sensible default so you can let it &amp;quot;Auto&amp;quot; or enforce &amp;quot;Off&amp;quot; to be sure.&lt;br /&gt;
&lt;br /&gt;
* LSI MegaRAID: allows setting individual disks cache:&lt;br /&gt;
 MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL                          # flushes the controller cache&lt;br /&gt;
 MegaCli -LDGetProp -Cache    -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the controller cache settings&lt;br /&gt;
 MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the disk cache settings (for all phys. disks in logical disk)&lt;br /&gt;
 MegaCli -LDSetProp -EnDskCache|DisDskCache  -LN|-L0,1,2|-LAll  -aN|-a0,1,2|-aALL # set disk cache setting&lt;br /&gt;
&lt;br /&gt;
* Xyratex: from the docs: &amp;quot;Write cache includes the disk drive cache and controller cache.&amp;quot;. So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.&lt;br /&gt;
&lt;br /&gt;
== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==&lt;br /&gt;
&lt;br /&gt;
The biggest problem is that those products seem to also virtualize disk &lt;br /&gt;
writes in a way that even barriers don&#039;t work any more, which means even &lt;br /&gt;
a fsync is not reliable. Tests confirm that unplugging the power from &lt;br /&gt;
such a system even with RAID controller with battery backed cache and &lt;br /&gt;
hard disk cache turned off (which is safe on a normal host) you can &lt;br /&gt;
destroy a database within the virtual machine (client, domU whatever you &lt;br /&gt;
call it).&lt;br /&gt;
&lt;br /&gt;
In qemu you can specify cache=off on the line specifying the virtual &lt;br /&gt;
disk. For others information is missing.&lt;br /&gt;
&lt;br /&gt;
== Q: What is the issue with directory corruption in Linux 2.6.17? ==&lt;br /&gt;
&lt;br /&gt;
In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some &amp;quot;sparse&amp;quot; endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: the fix is included in 2.6.17.7 and later kernels.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
To add insult to injury, &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039; is currently not correcting these directories on detection of this corrupt state either. This &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; issue is actively being worked on, and a fixed version will be available shortly.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfs_repair -n&#039;&#039;&#039; should be able to detect any directory corruption.&lt;br /&gt;
&lt;br /&gt;
Until a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; binary is available, one can make use of the &#039;&#039;&#039;xfs_db(8)&#039;&#039;&#039; command to mark the problem directory for removal (see the example below). A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; invocation will remove the directory and move all contents into &amp;quot;lost+found&amp;quot;, named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 core.mode = 040755&lt;br /&gt;
 core.version = 2&lt;br /&gt;
 core.format = 3 (btree)&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; write core.mode 0&lt;br /&gt;
 xfs_db&amp;amp;gt; quit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; will clear the directory, and add new entries (named by inode number) in lost+found.&lt;br /&gt;
&lt;br /&gt;
The easiest way to map inode numbers to full paths is via &#039;&#039;&#039;xfs_ncheck(8)&#039;&#039;&#039;&amp;lt;nowiki&amp;gt;: &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_ncheck -i 14101 -i 14102 /dev/sdXXX&lt;br /&gt;
       14101 full/path/mumble_fratz_foo_bar_1495&lt;br /&gt;
       14102 full/path/mumble_fratz_foo_bar_1494&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 ...&lt;br /&gt;
 next_unlinked = null&lt;br /&gt;
 u.bmbt.level = 1&lt;br /&gt;
 u.bmbt.numrecs = 1&lt;br /&gt;
 u.bmbt.keys[1] = [startoff] 1:[0]&lt;br /&gt;
 u.bmbt.ptrs[1] = 1:3628&lt;br /&gt;
 xfs_db&amp;amp;gt; fsblock 3628&lt;br /&gt;
 xfs_db&amp;amp;gt; type bmapbtd&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 magic = 0x424d4150&lt;br /&gt;
 level = 0&lt;br /&gt;
 numrecs = 19&lt;br /&gt;
 leftsib = null&lt;br /&gt;
 rightsib = null&lt;br /&gt;
 recs[1-19] = [startoff,startblock,blockcount,extentflag]&lt;br /&gt;
        1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]&lt;br /&gt;
        5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]&lt;br /&gt;
        9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]&lt;br /&gt;
        12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]&lt;br /&gt;
        15:[33554436,3488,8,0] 16:[33554444,3629,4,0]&lt;br /&gt;
        17:[33554448,3748,4,0] 18:[33554452,3900,4,0]&lt;br /&gt;
        19:[67108864,3364,4,0]&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the &#039;&#039;&#039;xfs_db&#039;&#039;&#039; dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; dblock 20&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 dhdr.magic = 0x58443244&lt;br /&gt;
 dhdr.bestfree[0].offset = 0&lt;br /&gt;
 dhdr.bestfree[0].length = 0&lt;br /&gt;
 dhdr.bestfree[1].offset = 0&lt;br /&gt;
 dhdr.bestfree[1].length = 0&lt;br /&gt;
 dhdr.bestfree[2].offset = 0&lt;br /&gt;
 dhdr.bestfree[2].length = 0&lt;br /&gt;
 du[0].inumber = 13937&lt;br /&gt;
 du[0].namelen = 25&lt;br /&gt;
 du[0].name = &amp;quot;mumble_fratz_foo_bar_1595&amp;quot;&lt;br /&gt;
 du[0].tag = 0x10&lt;br /&gt;
 du[1].inumber = 13938&lt;br /&gt;
 du[1].namelen = 25&lt;br /&gt;
 du[1].name = &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;&lt;br /&gt;
 du[1].tag = 0x38&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
So, here we can see that inode number 13938 matches up with name &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;. Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at &amp;quot;lost+found&amp;quot; (once &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; has removed the corrupt directory).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why does my &amp;gt; 2TB XFS partition disappear when I reboot ? ==&lt;br /&gt;
&lt;br /&gt;
Strictly speaking this is not an XFS problem.&lt;br /&gt;
&lt;br /&gt;
To support &amp;gt; 2TB partitions you need two things: a kernel that supports large block devices (&amp;lt;tt&amp;gt;CONFIG_LBD=y&amp;lt;/tt&amp;gt;) and a partition table format that can hold large partitions.  The default DOS partition tables don&#039;t.  The best partition format for&lt;br /&gt;
&amp;gt; 2TB partitions is the EFI GPT format (&amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Without CONFIG_LBD=y you can&#039;t even create the filesystem, but without &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt; it works fine until you reboot at which point the partition will disappear.  Note that you need to enable the &amp;lt;tt&amp;gt;CONFIG_PARTITION_ADVANCED&amp;lt;/tt&amp;gt; option before you can set &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I receive &amp;lt;tt&amp;gt;No space left on device&amp;lt;/tt&amp;gt; after &amp;lt;tt&amp;gt;xfs_growfs&amp;lt;/tt&amp;gt;? ==&lt;br /&gt;
&lt;br /&gt;
After [http://oss.sgi.com/archives/xfs/2009-01/msg01023.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. This was an issue with the older &amp;quot;inode32&amp;quot; inode allocation mode, where inode allocation is restricted to lower filesysetm blocks.  To fix this, [http://oss.sgi.com/archives/xfs/2009-01/msg01031.html Dave Chinner advised]:&lt;br /&gt;
&lt;br /&gt;
  The only way to fix this is to move data around to free up space&lt;br /&gt;
  below 1TB. Find your oldest data (i.e. that was around before even&lt;br /&gt;
  the first grow) and move it off the filesystem (move, not copy).&lt;br /&gt;
  Then if you copy it back on, the data blocks will end up above 1TB&lt;br /&gt;
  and that should leave you with plenty of space for inodes below 1TB.&lt;br /&gt;
  &lt;br /&gt;
  A complete dump and restore will also fix the problem ;)&lt;br /&gt;
&lt;br /&gt;
Alternately, you can add &#039;inode64&#039; to your mount options to allow inodes to live above 1TB.&lt;br /&gt;
&lt;br /&gt;
example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&amp;amp;forum=38 No space left on device on xfs filesystem with 7.7TB free]&lt;br /&gt;
&lt;br /&gt;
However, &#039;inode64&#039; has been the default behavior since kernel v3.7...&lt;br /&gt;
&lt;br /&gt;
Unfortunately, v3.7 also added a bug present from kernel v3.7 to v3.17 which caused new allocation groups added by growfs to be unavailable for inode allocation.  This was fixed by commit &amp;lt;tt&amp;gt;[http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9de67c3ba9ea961ba420573d56479d09d33a7587 9de67c3b xfs: allow inode allocations in post-growfs disk space.]&amp;lt;/tt&amp;gt; in kernel v3.17.&lt;br /&gt;
Without that commit, the problem can be worked around by doing a &amp;quot;mount -o remount,inode64&amp;quot; after the growfs operation.&lt;br /&gt;
&lt;br /&gt;
== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==&lt;br /&gt;
&lt;br /&gt;
The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons. &lt;br /&gt;
&lt;br /&gt;
Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.&lt;br /&gt;
&lt;br /&gt;
== Q: How to get around a bad inode repair is unable to clean up ==&lt;br /&gt;
&lt;br /&gt;
The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.&lt;br /&gt;
&lt;br /&gt;
  xfs_db -x -c &#039;inode XXX&#039; -c &#039;write core.nextents 0&#039; -c &#039;write core.size 0&#039; /dev/hdXX&lt;br /&gt;
&lt;br /&gt;
== Q: How to calculate the correct sunit,swidth values for optimal performance ==&lt;br /&gt;
&lt;br /&gt;
XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.&lt;br /&gt;
&lt;br /&gt;
These options can be sometimes autodetected (for example with md raid and recent enough kernel (&amp;gt;= 2.6.32) and xfsprogs (&amp;gt;= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.&lt;br /&gt;
&lt;br /&gt;
The calculation of these values is quite simple:&lt;br /&gt;
&lt;br /&gt;
  su = &amp;lt;RAID controllers stripe size in BYTES (or KiBytes when used with k)&amp;gt;&lt;br /&gt;
  sw = &amp;lt;# of data disks (don&#039;t count parity disks)&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use&lt;br /&gt;
&lt;br /&gt;
  su = 64k&lt;br /&gt;
  sw = 6 (RAID-6 of 8 disks has 6 data disks)&lt;br /&gt;
&lt;br /&gt;
A RAID stripe size of 256KB with a RAID-10 over 16 disks should use&lt;br /&gt;
&lt;br /&gt;
  su = 256k&lt;br /&gt;
  sw = 8 (RAID-10 of 16 disks has 8 data disks)&lt;br /&gt;
&lt;br /&gt;
Alternatively, you can use &amp;quot;sunit&amp;quot; instead of &amp;quot;su&amp;quot; and &amp;quot;swidth&amp;quot; instead of &amp;quot;sw&amp;quot; but then sunit/swidth values need to be specified in &amp;quot;number of 512B sectors&amp;quot;!&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; interpret sunit and swidth as being specified in units of 512B sectors; that&#039;s unfortunately not the unit they&#039;re reported in, however.&lt;br /&gt;
&amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; report them in multiples of your basic block size (bsize) and not in 512B sectors.&lt;br /&gt;
&lt;br /&gt;
Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.&lt;br /&gt;
&lt;br /&gt;
When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.&lt;br /&gt;
&lt;br /&gt;
== Q: Why doesn&#039;t NFS-exporting subdirectories of inode64-mounted filesystem work? ==&lt;br /&gt;
&lt;br /&gt;
The default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; type encodes only 32-bit of the inode number for subdirectory exports.  However, exporting the root of the filesystem works, or using one of the non-default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; types (&amp;lt;tt&amp;gt;fsid=uuid&amp;lt;/tt&amp;gt; in &amp;lt;tt&amp;gt;/etc/exports&amp;lt;/tt&amp;gt; with recent &amp;lt;tt&amp;gt;nfs-utils&amp;lt;/tt&amp;gt;) should work as well. (Thanks, Christoph!)&lt;br /&gt;
&lt;br /&gt;
== Q: What is the inode64 mount option for? ==&lt;br /&gt;
&lt;br /&gt;
By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like &amp;quot;disk full&amp;quot; when you still have plenty space free, but there&#039;s no more place in the first TB to create a new inode. Also, performance sucks.&lt;br /&gt;
&lt;br /&gt;
To come around this, use the inode64 mount options for filesystems &amp;gt;1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.&lt;br /&gt;
&lt;br /&gt;
Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.&lt;br /&gt;
&lt;br /&gt;
== Q: Can I just try the inode64 option to see if it helps me? ==&lt;br /&gt;
&lt;br /&gt;
Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can&#039;t access files &amp;amp; dirs that have been created with an inode &amp;gt;32bit anymore.&lt;br /&gt;
&lt;br /&gt;
== Q: Performance: mkfs.xfs -n size=64k option ==&lt;br /&gt;
&lt;br /&gt;
Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:&lt;br /&gt;
&lt;br /&gt;
Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a&lt;br /&gt;
directory entry is determined by the length of the name.&lt;br /&gt;
&lt;br /&gt;
There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there&#039;s the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.&lt;br /&gt;
&lt;br /&gt;
For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.&lt;br /&gt;
&lt;br /&gt;
In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don&#039;t have any numbers on what the difference might be - I&#039;m getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....&lt;br /&gt;
&lt;br /&gt;
== Q: I want to tune my XFS filesystems for &amp;lt;something&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Premature optimization is the root of all evil.&#039;&#039; - Donald Knuth&lt;br /&gt;
&lt;br /&gt;
The standard answer you will get to this question is this: use the defaults.&lt;br /&gt;
&lt;br /&gt;
There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to  configure the filesystem appropriately.&lt;br /&gt;
&lt;br /&gt;
There are a lot of &amp;quot;XFS tuning guides&amp;quot; that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don&#039;t expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.&lt;br /&gt;
&lt;br /&gt;
In most cases, the only thing you need to to consider for &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; mount options. Increasing &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; reduces the number of journal IOs for a given workload, and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; will reduce them even further. The trade off for this increase in metadata performance is that more operations may be &amp;quot;missing&amp;quot; after recovery if the system crashes while actively making modifications.&lt;br /&gt;
&lt;br /&gt;
As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.&lt;br /&gt;
&lt;br /&gt;
== Q: Which factors influence the memory usage of xfs_repair? ==&lt;br /&gt;
&lt;br /&gt;
This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -n -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2096.&lt;br /&gt;
  #&lt;br /&gt;
&lt;br /&gt;
xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,&lt;br /&gt;
of which 2,097,152KB is needed for tracking free space. &lt;br /&gt;
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)&lt;br /&gt;
&lt;br /&gt;
Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2289.&lt;br /&gt;
&lt;br /&gt;
That is now needs at least another 200MB of RAM to run.&lt;br /&gt;
&lt;br /&gt;
The numbers reported by xfs_repair are the absolute minimum required and approximate at that;&lt;br /&gt;
more RAM than this may be required to complete successfully.&lt;br /&gt;
Also, if you only give xfs_repair the minimum required RAM, it will be slow;&lt;br /&gt;
for best repair performance, the more RAM you can give it the better.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why some files of my filesystem shows as &amp;quot;?????????? ? ?      ?          ?                ? filename&amp;quot; ? ==&lt;br /&gt;
&lt;br /&gt;
If ls -l shows you a listing as&lt;br /&gt;
&lt;br /&gt;
  # ?????????? ? ?      ?          ?                ? file1&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file2&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file3&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file4&lt;br /&gt;
&lt;br /&gt;
and errors like:&lt;br /&gt;
  # ls /pathtodir/&lt;br /&gt;
    ls: cannot access /pathtodir/file1: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file2: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file3: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file4: Invalid argument&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
or even:&lt;br /&gt;
  # failed to stat /pathtodir/file1&lt;br /&gt;
&lt;br /&gt;
It is very probable your filesystem must be mounted with inode64&lt;br /&gt;
  # mount -oremount,inode64 /dev/diskpart /mnt/xfs&lt;br /&gt;
&lt;br /&gt;
should make it work ok again.&lt;br /&gt;
If it works, add the option to fstab.&lt;br /&gt;
&lt;br /&gt;
== Q: The xfs_db &amp;quot;frag&amp;quot; command says I&#039;m over 50%.  Is that bad? ==&lt;br /&gt;
&lt;br /&gt;
It depends.  It&#039;s important to know how the value is calculated.  xfs_db looks at the extents in all files, and returns:&lt;br /&gt;
&lt;br /&gt;
  (actual extents - ideal extents) / actual extents&lt;br /&gt;
&lt;br /&gt;
This means that if, for example, you have an average of 2 extents per file, you&#039;ll get an answer of 50%.  4 extents per file would give you 75%.  This may or may not be a problem, especially depending on the size of the files in question.  (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented).  The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.&lt;br /&gt;
&lt;br /&gt;
Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:&lt;br /&gt;
[[Image:Frag_factor.png|500px]]&lt;br /&gt;
&lt;br /&gt;
== Q: I&#039;m getting &amp;quot;Internal error xfs_sb_read_verify&amp;quot; errors when I try to run xfs_growfs under kernels v3.10 through v3.12 ==&lt;br /&gt;
&lt;br /&gt;
This may happen when running xfs_growfs under a v3.10-v3.12 kernel,&lt;br /&gt;
if the filesystem was previously grown under a kernel prior to v3.8.&lt;br /&gt;
&lt;br /&gt;
Old kernel versions prior to v3.8 did not zero the empty part of&lt;br /&gt;
new secondary superblocks when growing the filesystem with xfs_growfs.&lt;br /&gt;
&lt;br /&gt;
Kernels v3.10 and later began detecting this non-zero part of the&lt;br /&gt;
superblock as corruption, and emit the &lt;br /&gt;
&lt;br /&gt;
    Internal error xfs_sb_read_verify&lt;br /&gt;
&lt;br /&gt;
error message.&lt;br /&gt;
&lt;br /&gt;
Kernels v3.13 and later are more forgiving about this - if the non-zero &lt;br /&gt;
data is found on a Version 4 superblock, it will not be flagged as&lt;br /&gt;
corruption.&lt;br /&gt;
&lt;br /&gt;
The problematic secondary superblocks may be repaired by using an xfs_repair&lt;br /&gt;
version 3.2.0-alpha1 or above.&lt;br /&gt;
&lt;br /&gt;
The relevant kernelspace commits are as follows:&lt;br /&gt;
&lt;br /&gt;
    v3.8  1375cb6 xfs: growfs: don&#039;t read garbage for new secondary superblocks &amp;lt;- fixed underlying problem &lt;br /&gt;
    v3.10 04a1e6c xfs: add CRC checks to the superblock &amp;lt;- detected old underlying problem&lt;br /&gt;
    v3.13 10e6e65 xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields &amp;lt;- is more forgiving of old underlying problem&lt;br /&gt;
&lt;br /&gt;
This commit allows xfs_repair to detect and correct the problem:&lt;br /&gt;
&lt;br /&gt;
    v3.2.0-alpha1 cbd7508 xfs_repair: zero out unused parts of superblocks&lt;br /&gt;
&lt;br /&gt;
== Q: Why do files on XFS use more data blocks than expected? ==&lt;br /&gt;
&lt;br /&gt;
The XFS speculative preallocation algorithm allocates extra blocks beyond end of file (EOF) to minimize file fragmentation during buffered write workloads. Workloads that benefit from this behaviour include slowly growing files, concurrent writers and mixed reader/writer workloads. It also provides fragmentation resistance in situations where memory pressure prevents adequate buffering of dirty data to allow formation of large contiguous regions of data in memory.&lt;br /&gt;
&lt;br /&gt;
This post-EOF block allocation is accounted identically to blocks within EOF. It is visible in &#039;st_blocks&#039; counts via stat() system calls, accounted as globally allocated space and against quotas that apply to the associated file. The space is reported by various userspace utilities (stat, du, df, ls) and thus provides a common source of confusion for administrators. Post-EOF blocks are temporary in most situations and are usually reclaimed via several possible mechanisms in XFS.&lt;br /&gt;
&lt;br /&gt;
See the FAQ entry on speculative preallocation for details.&lt;br /&gt;
&lt;br /&gt;
== Q: What is speculative preallocation? ==&lt;br /&gt;
&lt;br /&gt;
XFS speculatively preallocates post-EOF blocks on file extending writes in anticipation of future extending writes. The size of a preallocation is dynamic and depends on the runtime state of the file and fs. Generally speaking, preallocation is disabled for very small files and preallocation sizes grow as files grow larger.&lt;br /&gt;
&lt;br /&gt;
Preallocations are capped to the maximum extent size supported by the filesystem. Preallocation size is throttled automatically as the filesystem approaches low free space conditions or other allocation limits on a file (such as a quota).&lt;br /&gt;
&lt;br /&gt;
In most cases, speculative preallocation is automatically reclaimed when a file is closed. Applications that repeatedly trigger preallocation and reclaim cycles (e.g., this is common in file server or log file workloads) can cause fragmentation. Therefore, this pattern is detected and causes the preallocation to persist beyond the lifecycle of the file descriptor.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I speed up or avoid delayed removal of speculative preallocation?  ==&lt;br /&gt;
&lt;br /&gt;
Linux 3.8 (and later) includes a scanner to perform background trimming of files with lingering post-EOF preallocations. The scanner bypasses dirty files to avoid interference with ongoing writes. A 5 minute scan interval is used by default and can be adjusted via the following file (value in seconds):&lt;br /&gt;
&lt;br /&gt;
        /proc/sys/fs/xfs/speculative_prealloc_lifetime&lt;br /&gt;
&lt;br /&gt;
== Q: Is speculative preallocation permanent? ==&lt;br /&gt;
&lt;br /&gt;
Preallocated blocks are normally reclaimed on file close, inode reclaim, unmount or in the background once file write activity subsides. They can be explicitly made permanent via fallocate or a similar interface. They can be implicitly made permanent in situations where file size is extended beyond a range of post-EOF blocks (i.e., via an extending truncate) or following a crash. In the event of a crash, the in-memory state used to track and reclaim the speculative preallocation is lost.&lt;br /&gt;
&lt;br /&gt;
== Q: My workload has known characteristics - can I disable speculative preallocation or tune it to an optimal fixed size? ==&lt;br /&gt;
&lt;br /&gt;
Speculative preallocation can not be disabled but XFS can be tuned to a fixed allocation size with the &#039;allocsize=&#039; mount option. Speculative preallocation is not dynamically resized when the allocsize mount option is set and thus the potential for fragmentation is increased. Use &#039;allocsize=64k&#039; to revert to the default XFS behavior prior to support for dynamic speculative preallocation.&lt;br /&gt;
&lt;br /&gt;
== Q: mount (or umount) takes minutes or even hours - what could be the reason ? ==&lt;br /&gt;
&lt;br /&gt;
In some cases xfs log (journal) can become quite big. For example if it accumulates many entries and doesn&#039;t get chance to apply these to disk (due to lockup, crash, hard reset etc). xfs will try to reapply these at mount (in dmesg: &amp;quot;Starting recovery (logdev: internal)&amp;quot;).&lt;br /&gt;
&lt;br /&gt;
That process with big log to be reapplied can take very long time (minutes or even hours). Similar problem can happen with unmount taking hours when there are hundreds of thousands of dirty inode in memory that need to be flushed to disk.&lt;br /&gt;
&lt;br /&gt;
(http://oss.sgi.com/pipermail/xfs/2015-October/044457.html)&lt;br /&gt;
&lt;br /&gt;
== Q: Which I/O scheduler for XFS? ==&lt;br /&gt;
&lt;br /&gt;
=== On rotational disks without hardware raid ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;CFQ&#039;&#039;: not great for XFS parallelism:&lt;br /&gt;
&lt;br /&gt;
  &amp;lt; dchinner&amp;gt; it doesn&#039;t allow other threads to get IO issued immediately after the first one&lt;br /&gt;
  &amp;lt; dchinner&amp;gt; it waits, instead, for a timeslice to expire before moving to the IO of a different process.&lt;br /&gt;
  &amp;lt; dchinner&amp;gt; so instead of interleaving the IO of multiple jobs in a single sweep across the disk,&lt;br /&gt;
              it enforces single threaded access to the disk&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;deadline&#039;&#039;: good option, doesn&#039;t have such problem&lt;br /&gt;
&lt;br /&gt;
Note that some kernels have block multiqueue enabled which (currently - 08/2016) doesn&#039;t support I/O schedulers at all thus there is no optimisation and reordering IO for best seek order, so disable blk-mq for rotational disks (see CONFIG_SCSI_MQ_DEFAULT, CONFIG_DM_MQ_DEFAULT options and use_blk_mq parameter for scsi-mod/dm-mod kernel modules).&lt;br /&gt;
&lt;br /&gt;
Also hardware raid can be smart enough to cache and reorder I/O requests thus additional layer of reordering&lt;br /&gt;
(like Linux I/O scheduler) can potentially conflict and make performance worse. If you have such raid card&lt;br /&gt;
then try method described below.&lt;br /&gt;
&lt;br /&gt;
=== SSD disks or rotational disks but with hardware raid card that has cache enabled ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Block multiqueue&#039;&#039; enabled (and thus no I/O scheduler at all) or block multiqueue disabled and &#039;&#039;noop&#039;&#039; or &#039;&#039;deadline&#039;&#039; I/O scheduler activated is good solution. SSD disks don&#039;t really need I/O schedulers while smart raid cards do I/O ordering on their own.&lt;br /&gt;
&lt;br /&gt;
Note that if your raid is very dumb and/or has no cache enabled then it likely cannot reorder I/O requests and thus it could benefit from I/O scheduler.&lt;br /&gt;
&lt;br /&gt;
== Q: Why does userspace say &amp;quot;filesystem uses v1 dirs, limited functionality provided?&amp;quot; ==&lt;br /&gt;
&lt;br /&gt;
Either you have a very old or a very new filesystem.  Very old filesystems used a format called &amp;quot;directory version 1&amp;quot; and in this case the error is self explanatory.&lt;br /&gt;
&lt;br /&gt;
However, if you have a new filesystem with version 5 superblocks and the metadata CRC feature enabled, older releases of xfsprogs may incorrectly issue the &amp;quot;v1 dir&amp;quot; message.  In this case, get newer xfsprogs; at least v3.2.0, but preferably the latest release.&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_email_list_and_archives&amp;diff=3001</id>
		<title>XFS email list and archives</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_email_list_and_archives&amp;diff=3001"/>
		<updated>2016-09-15T14:33:28Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= XFS mailing list =&lt;br /&gt;
Patches, comments, requests and questions should go to [mailto:linux-xfs@vger.kernel.org linux-xfs@vger.kernel.org]&lt;br /&gt;
&lt;br /&gt;
As of October 1 2016, the XFS mailing list will move away from the long standing address of xfs@oss.sgi.com due to the prospective shutdown of the oss.sgi.com infrastructure.&lt;br /&gt;
&lt;br /&gt;
== Subscribing to the list ==&lt;br /&gt;
&lt;br /&gt;
Details for subscribing to the list can be found at the [http://vger.kernel.org/vger-lists.html#linux-xfs vger list info page].&lt;br /&gt;
&lt;br /&gt;
Subscribing is *only* possible by sending an email with the body:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;subscribe linux-xfs&amp;lt;/pre&amp;gt; &lt;br /&gt;
&lt;br /&gt;
to [mailto:majordomo@vger.kernel.org majordomo@vger.kernel.org]&lt;br /&gt;
&lt;br /&gt;
== Archives ==&lt;br /&gt;
&lt;br /&gt;
Archives of both the old and new lists can be found at&lt;br /&gt;
&lt;br /&gt;
* [https://www.spinics.net/lists/linux-xfs/ Spinics] (new vger.kernel.org list)&lt;br /&gt;
* [https://www.spinics.net/lists/xfs/ Spinics] (old oss.sgi.com list)&lt;br /&gt;
&lt;br /&gt;
== Other / Older XFS archives ==&lt;br /&gt;
&lt;br /&gt;
The list archives on oss.sgi.com are available [http://oss.sgi.com/archives/xfs here] (MHonArc) and [http://oss.sgi.com/pipermail/xfs here] (mailman).  Note that as we transition away from oss.sgi.com infrastructure, these archives will go stale or disappear.&lt;br /&gt;
&lt;br /&gt;
Other archives include:&lt;br /&gt;
* [https://www.spinics.net/lists/linux-xfs/ Spinics] (new vger.kernel.org list)&lt;br /&gt;
* [https://www.spinics.net/lists/xfs/ Spinics] (old oss.sgi.com list)&lt;br /&gt;
* [http://marc.info/?l=linux-xfs Marc] (old oss.sgi.com list)&lt;br /&gt;
* [http://www.opensubscriber.com/messages/xfs@oss.sgi.com/topic.html OpenSubscriber] (old oss.sgi.com list)&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_email_list_and_archives&amp;diff=3000</id>
		<title>XFS email list and archives</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_email_list_and_archives&amp;diff=3000"/>
		<updated>2016-09-15T14:31:08Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__FORCETOC__&lt;br /&gt;
&lt;br /&gt;
== XFS mailing list ==&lt;br /&gt;
Patches, comments, requests and questions should go to [mailto:linux-xfs@vger.kernel.org linux-xfs@vger.kernel.org]&lt;br /&gt;
&lt;br /&gt;
As of October 1 2016, the XFS mailing list will move away from the long standing address of xfs@oss.sgi.com due to the prospective shutdown of the oss.sgi.com infrastructure.&lt;br /&gt;
&lt;br /&gt;
Archives of both the old and new lists can be found at&lt;br /&gt;
&lt;br /&gt;
* [https://www.spinics.net/lists/linux-xfs/ Spinics] (new vger.kernel.org list)&lt;br /&gt;
* [https://www.spinics.net/lists/xfs/ Spinics] (old oss.sgi.com list)&lt;br /&gt;
&lt;br /&gt;
== Subscribing to the list ==&lt;br /&gt;
&lt;br /&gt;
Details for subscribing to the list can be found at the [http://vger.kernel.org/vger-lists.html#linux-xfs vger list info page].&lt;br /&gt;
&lt;br /&gt;
Subscribing is *only* possible by sending an email with the body:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;subscribe linux-xfs&amp;lt;/pre&amp;gt; &lt;br /&gt;
&lt;br /&gt;
to [mailto:majordomo@vger.kernel.org majordomo@vger.kernel.org]&lt;br /&gt;
&lt;br /&gt;
== Other / Older XFS archives ==&lt;br /&gt;
&lt;br /&gt;
The list archives on oss.sgi.com are available [http://oss.sgi.com/archives/xfs here] (MHonArc) and [http://oss.sgi.com/pipermail/xfs here] (mailman).  Note that as we transition away from oss.sgi.com infrastructure, these archives will go stale or disappear.&lt;br /&gt;
&lt;br /&gt;
Other archives include:&lt;br /&gt;
* [https://www.spinics.net/lists/linux-xfs/ Spinics] (new vger.kernel.org list)&lt;br /&gt;
* [https://www.spinics.net/lists/xfs/ Spinics] (old oss.sgi.com list)&lt;br /&gt;
* [http://marc.info/?l=linux-xfs Marc] (old oss.sgi.com list)&lt;br /&gt;
* [http://www.opensubscriber.com/messages/xfs@oss.sgi.com/topic.html OpenSubscriber] (old oss.sgi.com list)&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_email_list_and_archives&amp;diff=2999</id>
		<title>XFS email list and archives</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_email_list_and_archives&amp;diff=2999"/>
		<updated>2016-09-15T14:23:54Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: /* XFS email list */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== XFS email list ==&lt;br /&gt;
Patches, comments, requests and questions should go to [mailto:linux-xfs@vger.kernel.org linux-xfs@vger.kernel.org]&lt;br /&gt;
&lt;br /&gt;
Current archives of the linux-xfs@vger.kernel.org list can be found at&lt;br /&gt;
&lt;br /&gt;
* [http://www.spinics.net/lists/linux-xfs/ Spinics]&lt;br /&gt;
&lt;br /&gt;
(As of October 1 2016, the XFS mailing list will move away from the long standing address of xfs@oss.sgi.com because of the prospective shutdown of the oss.sgi.com infrastructure. See below for links to the old archives.)&lt;br /&gt;
&lt;br /&gt;
== Subscribing to the list ==&lt;br /&gt;
&lt;br /&gt;
Details for subscribing to the list can be found at the [http://vger.kernel.org/vger-lists.html#linux-xfs vger list info page].&lt;br /&gt;
&lt;br /&gt;
Subscribing is *only* possible by sending an email with the body:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;subscribe linux-xfs&amp;lt;/pre&amp;gt; &lt;br /&gt;
&lt;br /&gt;
to [mailto:majordomo@vger.kernel.org majordomo@vger.kernel.org]&lt;br /&gt;
&lt;br /&gt;
== Old XFS archives ==&lt;br /&gt;
&lt;br /&gt;
The list archives on oss.sgi.com are available [http://oss.sgi.com/archives/xfs here] (MHonArc) and [http://oss.sgi.com/pipermail/xfs here] (mailman).  Note that as we transition away from oss.sgi.com infrastructure, these archives will go stale or disappear.&lt;br /&gt;
&lt;br /&gt;
Other archives include:&lt;br /&gt;
* [https://www.spinics.net/lists/linux-xfs/ Spinics] (new vger.kernel.org list)&lt;br /&gt;
* [https://www.spinics.net/lists/xfs/ Spinics] (old oss.sgi.com list)&lt;br /&gt;
* [http://marc.info/?l=linux-xfs Marc] (old oss.sgi.com list)&lt;br /&gt;
* [http://www.opensubscriber.com/messages/xfs@oss.sgi.com/topic.html OpenSubscriber]&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_email_list_and_archives&amp;diff=2998</id>
		<title>XFS email list and archives</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_email_list_and_archives&amp;diff=2998"/>
		<updated>2016-08-30T15:50:19Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: /* Old XFS archives */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== XFS email list ==&lt;br /&gt;
Patches, comments, requests and questions should go to [mailto:linux-xfs@vger.kernel.org linux-xfs@vger.kernel.org]&lt;br /&gt;
&lt;br /&gt;
As of September 2016, the XFS will move from the long standing address of xfs@oss.sgi.com because of the propsective shutdown of the oss.sgi.com infrastructure. See below for links to the old archives.&lt;br /&gt;
&lt;br /&gt;
Current crchives of the linux-xfs@vger.kernel.org list can be found at&lt;br /&gt;
&lt;br /&gt;
* [http://www.spinics.net/lists/linux-xfs/ Spinics]&lt;br /&gt;
&lt;br /&gt;
== Subscribing to the list ==&lt;br /&gt;
&lt;br /&gt;
Details for subscribing to the list can be found at the [http://vger.kernel.org/vger-lists.html#linux-xfs vger list info page].&lt;br /&gt;
&lt;br /&gt;
Subscribing is *only* possible by sending an email with the body:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;subscribe linux-xfs&amp;lt;/pre&amp;gt; &lt;br /&gt;
&lt;br /&gt;
to [mailto:majordomo@vger.kernel.org majordomo@vger.kernel.org]&lt;br /&gt;
&lt;br /&gt;
== Old XFS archives ==&lt;br /&gt;
&lt;br /&gt;
The list archives on oss.sgi.com are available [http://oss.sgi.com/archives/xfs here] (MHonArc) and [http://oss.sgi.com/pipermail/xfs here] (mailman).  Note that as we transition away from oss.sgi.com infrastructure, these archives will go stale or disappear.&lt;br /&gt;
&lt;br /&gt;
Other archives include:&lt;br /&gt;
* [https://www.spinics.net/lists/linux-xfs/ Spinics] (new vger.kernel.org list)&lt;br /&gt;
* [https://www.spinics.net/lists/xfs/ Spinics] (old oss.sgi.com list)&lt;br /&gt;
* [http://marc.info/?l=linux-xfs Marc] (old oss.sgi.com list)&lt;br /&gt;
* [http://www.opensubscriber.com/messages/xfs@oss.sgi.com/topic.html OpenSubscriber]&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_email_list_and_archives&amp;diff=2997</id>
		<title>XFS email list and archives</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_email_list_and_archives&amp;diff=2997"/>
		<updated>2016-08-30T14:54:17Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: /* Old XFS archives */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== XFS email list ==&lt;br /&gt;
Patches, comments, requests and questions should go to [mailto:linux-xfs@vger.kernel.org linux-xfs@vger.kernel.org]&lt;br /&gt;
&lt;br /&gt;
As of September 2016, the XFS will move from the long standing address of xfs@oss.sgi.com because of the propsective shutdown of the oss.sgi.com infrastructure. See below for links to the old archives.&lt;br /&gt;
&lt;br /&gt;
Current crchives of the linux-xfs@vger.kernel.org list can be found at&lt;br /&gt;
&lt;br /&gt;
* [http://www.spinics.net/lists/linux-xfs/ Spinics]&lt;br /&gt;
&lt;br /&gt;
== Subscribing to the list ==&lt;br /&gt;
&lt;br /&gt;
Details for subscribing to the list can be found at the [http://vger.kernel.org/vger-lists.html#linux-xfs vger list info page].&lt;br /&gt;
&lt;br /&gt;
Subscribing is *only* possible by sending an email with the body:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;subscribe linux-xfs&amp;lt;/pre&amp;gt; &lt;br /&gt;
&lt;br /&gt;
to [mailto:majordomo@vger.kernel.org majordomo@vger.kernel.org]&lt;br /&gt;
&lt;br /&gt;
== Old XFS archives ==&lt;br /&gt;
&lt;br /&gt;
The list archives on oss.sgi.com are available [http://oss.sgi.com/archives/xfs here] (MHonArc) and [http://oss.sgi.com/pipermail/xfs here] (mailman).  Note that as we transition away from oss.sgi.com infrastructure, these archives will go stale or disappear.&lt;br /&gt;
&lt;br /&gt;
Other archives include:&lt;br /&gt;
* [https://www.spinics.net/lists/linux-xfs/ Spinics] (new vger.kernel.org list)&lt;br /&gt;
* [https://www.spinics.net/lists/xfs/ Spinics] (old oss.sgi.com list)&lt;br /&gt;
* [http://marc.info/?l=linux-xfs] Marc] (old oss.sgi.com list)&lt;br /&gt;
* [http://www.opensubscriber.com/messages/xfs@oss.sgi.com/topic.html OpenSubscriber]&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_email_list_and_archives&amp;diff=2996</id>
		<title>XFS email list and archives</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_email_list_and_archives&amp;diff=2996"/>
		<updated>2016-08-30T14:48:08Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: /* Old XFS archives */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== XFS email list ==&lt;br /&gt;
Patches, comments, requests and questions should go to [mailto:linux-xfs@vger.kernel.org linux-xfs@vger.kernel.org]&lt;br /&gt;
&lt;br /&gt;
As of September 2016, the XFS will move from the long standing address of xfs@oss.sgi.com because of the propsective shutdown of the oss.sgi.com infrastructure. See below for links to the old archives.&lt;br /&gt;
&lt;br /&gt;
Current crchives of the linux-xfs@vger.kernel.org list can be found at&lt;br /&gt;
&lt;br /&gt;
* [http://www.spinics.net/lists/linux-xfs/ Spinics]&lt;br /&gt;
&lt;br /&gt;
== Subscribing to the list ==&lt;br /&gt;
&lt;br /&gt;
Details for subscribing to the list can be found at the [http://vger.kernel.org/vger-lists.html#linux-xfs vger list info page].&lt;br /&gt;
&lt;br /&gt;
Subscribing is *only* possible by sending an email with the body:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;subscribe linux-xfs&amp;lt;/pre&amp;gt; &lt;br /&gt;
&lt;br /&gt;
to [mailto:majordomo@vger.kernel.org majordomo@vger.kernel.org]&lt;br /&gt;
&lt;br /&gt;
== Old XFS archives ==&lt;br /&gt;
&lt;br /&gt;
The list archives on oss.sgi.com are available [http://oss.sgi.com/archives/xfs here] (MHonArc) and [http://oss.sgi.com/pipermail/xfs here] (mailman).  Note that as we transition away from oss.sgi.com infrastructure, these archives will go stale or disappear.&lt;br /&gt;
&lt;br /&gt;
Other archives include:&lt;br /&gt;
* [https://www.spinics.net/lists/linux-xfs/ Spinics] (new vger.kernel.org list)&lt;br /&gt;
* [https://www.spinics.net/lists/xfs/ Spinics] (old oss.sgicom list)&lt;br /&gt;
* [http://www.opensubscriber.com/messages/xfs@oss.sgi.com/topic.html OpenSubscriber]&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=Improving_Metadata_Performance_By_Reducing_Journal_Overhead&amp;diff=2989</id>
		<title>Improving Metadata Performance By Reducing Journal Overhead</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=Improving_Metadata_Performance_By_Reducing_Journal_Overhead&amp;diff=2989"/>
		<updated>2016-04-12T16:07:24Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Future Directions - Journalling =&lt;br /&gt;
&lt;br /&gt;
From http://oss.sgi.com/archives/xfs/2008-09/msg00800.html&lt;br /&gt;
&lt;br /&gt;
== Improving Metadata Performance By Reducing Journal Overhead ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
XFS currently uses asynchronous write-ahead logging to ensure that changes to&lt;br /&gt;
the filesystem structure are preserved on crash.  It does this by logging&lt;br /&gt;
detailed records of the changes being made to each object on disk during a&lt;br /&gt;
transaction. Every byte that is modified needs to be recorded in the journal.&lt;br /&gt;
&lt;br /&gt;
There are two issues with this approach. The first is that transactions can&lt;br /&gt;
modify a *lot* of metadata  to complete a single operation. Worse is the fact&lt;br /&gt;
that the average size of a transactions grows as structures get larger and&lt;br /&gt;
deeper, so performance on larger, fuller filesystem drops off as log bandwidth&lt;br /&gt;
is consumed by fewer, larger transactions.&lt;br /&gt;
&lt;br /&gt;
The second is that we re-log previous changes that are active in the journal&lt;br /&gt;
if the object is modified again. hence if an object is modified repeatedly, the&lt;br /&gt;
dirty parts of the object get rewritten over and over again. in the worst case,&lt;br /&gt;
frequently logged buffers will be entirely dirty and so even if we only change&lt;br /&gt;
a single byte in the buffer we&#039;ll log the entire buffer.&lt;br /&gt;
&lt;br /&gt;
The problem can be approached along two different axes:&lt;br /&gt;
&lt;br /&gt;
	- reduce the amount we log in a given transaction&lt;br /&gt;
	- change the way we re-log objects.&lt;br /&gt;
&lt;br /&gt;
Both of these things give the same end result - we require less bandwidth to&lt;br /&gt;
the journal to log changes that are happening in the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Asynchronous Transaction Aggregation ==&lt;br /&gt;
&lt;br /&gt;
Status: Done, known as delayed logging.&lt;br /&gt;
&lt;br /&gt;
Experimental in 2.6.35, stable for production in 2.6.37, planned for default&lt;br /&gt;
in 2.6.39.&lt;br /&gt;
&lt;br /&gt;
Design documentation can be found here:&lt;br /&gt;
&lt;br /&gt;
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs-delayed-logging-design.txt&lt;br /&gt;
&lt;br /&gt;
== Atomic Multi-Transaction Operations ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A feature asynchronous transaction aggregation makes possible is atomic&lt;br /&gt;
multi-transaction operations.  On the first transaction we hold the queue in&lt;br /&gt;
memory, preventing it from being committed. We can then do further transactions&lt;br /&gt;
that will end up in the same commit record, and on the final transaction we&lt;br /&gt;
unlock the async transaction queue. This will allow all those transaction to be&lt;br /&gt;
applied atomically. This is far simpler than any other method I&#039;ve been looking&lt;br /&gt;
at to do this.&lt;br /&gt;
&lt;br /&gt;
After a bit of reflection, I think this feature may be necessary for correct&lt;br /&gt;
implementation of existing logging techniques. The way we currently implement&lt;br /&gt;
rolling transactions (with permanent log reservations and rolling&lt;br /&gt;
dup/commit/re-reserve sequences) would seem to require all the commits in a&lt;br /&gt;
rolling transaction to be including in a single commit record.  If I understand&lt;br /&gt;
history and the original design correctly, these rolling transactions were&lt;br /&gt;
implemented so that large, complex transactions would not pin the tail of the&lt;br /&gt;
log as they progressed.  IOWs, they implicitly use re-logging to keep the tail&lt;br /&gt;
of the log moving forward as they progress and continue to modify items in the&lt;br /&gt;
transaction.&lt;br /&gt;
&lt;br /&gt;
Given we are using asynchronous transaction aggregation as a method of reducing&lt;br /&gt;
re-logging, it would make sense to prevent these sorts of transactions from&lt;br /&gt;
pinning the tail of the log at all. Further, because we are effectively&lt;br /&gt;
disturbing the concept of unique transactions, I don&#039;t think that allowing a&lt;br /&gt;
rolling transaction to span aggregated commits is valid as we are going to be&lt;br /&gt;
ignoring the transaction IDs that are used to identify individual transactions.&lt;br /&gt;
&lt;br /&gt;
Hence I think it is a good idea to simply replace rolling transactions with&lt;br /&gt;
atomic multi-transaction operations. This may also allow us to split some of&lt;br /&gt;
the large compound transactions into smaller, more self contained transactions.&lt;br /&gt;
This would reduce reservation pressure on log space in the common case where&lt;br /&gt;
all the corner cases in the transactions are not taken. In terms of&lt;br /&gt;
implementation, I think we can initially augment the permanent transaction&lt;br /&gt;
reservation/release interface to acheive this. With a working implementation,&lt;br /&gt;
we can then look to changing to a more explicit interface and slowly work to&lt;br /&gt;
remove the &#039;permanent log transaction&#039; concept entirely. This shold simplify&lt;br /&gt;
the log code somewhat....&lt;br /&gt;
&lt;br /&gt;
Note: This asynchronous transaction aggregation is originally based on a&lt;br /&gt;
concept floated by Nathan Scott called &#039;Delayed Logging&#039; after observing how&lt;br /&gt;
ext3 implemented journalling.  This never passed more than a concept&lt;br /&gt;
description phase....&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Operation Based Logging ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The second approach to reducing log traffic is to change exactly what we&lt;br /&gt;
log in the transactions. At the moment, what we log is the exact change to&lt;br /&gt;
the item that is being made. For things like inodes and dquots, this isn&#039;t&lt;br /&gt;
particularly expensive because it is already a very compact form. The issue&lt;br /&gt;
comes with changes that are logged in buffers.&lt;br /&gt;
&lt;br /&gt;
The prime example of this is a btree modification that involves either removing&lt;br /&gt;
or inserting a record into a buffer. The records are kept in compact form, so an&lt;br /&gt;
insert or remove will also move other records around in the buffer. In the worst&lt;br /&gt;
case, a single insert or remove of a 16 byte record can dirty an entire block&lt;br /&gt;
(4k generally, but could be up to 64k). In this case, if we were to log the&lt;br /&gt;
btree operation (e.g. insert {record, index}) rather than the resultant change&lt;br /&gt;
on the buffer the overhead of a btree operation is fixed. Such logging also&lt;br /&gt;
allows us to avoid needing to log the changes due to splits and merges - we just&lt;br /&gt;
replay the operation and subsequent splits/merges get done as part of replay.&lt;br /&gt;
&lt;br /&gt;
The result of this is that complex transactions no longer need as much log space&lt;br /&gt;
as all possible change they can cause - we only log the basic operations that&lt;br /&gt;
are occurring and their result. Hence transaction end up being much smaller,&lt;br /&gt;
vary less in size between empty and full filesystems, etc. An example set of&lt;br /&gt;
operations describing all the changes made by an extent allocation on an inode&lt;br /&gt;
would be:&lt;br /&gt;
&lt;br /&gt;
	- inode X intent to allocate extent {off, len}&lt;br /&gt;
	- AGCNT btree update record in AG X {old rec} {new rec values}&lt;br /&gt;
	- AGBNO btree delete record in AG X {block, len}&lt;br /&gt;
	- inode X BMBT btree insert record {off, block, len}&lt;br /&gt;
	- inode X delta&lt;br /&gt;
&lt;br /&gt;
This comes down to a relatively small, bound amount of space which is close the&lt;br /&gt;
minimun and existing allocation transaction would consume.  However, with this&lt;br /&gt;
method of logging the transaction size does not increase with the size of&lt;br /&gt;
structures or the amount of updates necessary to complete the operations.&lt;br /&gt;
&lt;br /&gt;
A major difference to the existing transaction system is that re-logging&lt;br /&gt;
of items doesn&#039;t fit very neatly with operation based logging. &lt;br /&gt;
&lt;br /&gt;
There are three main disadvantages to this approach:&lt;br /&gt;
&lt;br /&gt;
	- recovery becomes more complex - it will need to change substantially&lt;br /&gt;
	  to accomodate operation replay rather than just reading from disk&lt;br /&gt;
	  and applying deltas.&lt;br /&gt;
	- we have to create a whole new set of item types and add the necessary&lt;br /&gt;
	  hooks into the code to log all the operations correctly.&lt;br /&gt;
	- re-logging is probably not possible, and that introduces &lt;br /&gt;
	  differences to the way we&#039;ll need to track objects for flushing. It&lt;br /&gt;
	  may, in fact, require transaction IDs in all objects to allow us&lt;br /&gt;
	  to determine what the last transaction that modified the item&lt;br /&gt;
	  on disk was during recovery.&lt;br /&gt;
&lt;br /&gt;
Changing the logging strategy as described is a much more fundamental change to&lt;br /&gt;
XFS than asynchronous transaction aggregation. It will be difficult to change&lt;br /&gt;
to such a model in an evolutionary manner; it is more of a &#039;flag day&#039; style&lt;br /&gt;
change where then entire functionality needs to be added in one hit. Given that&lt;br /&gt;
we will also still have to support the old log format, it doesn&#039;t enable us to&lt;br /&gt;
remove any code, either.&lt;br /&gt;
&lt;br /&gt;
Given that we are likely to see major benefits in the problem workloads as a&lt;br /&gt;
result of asynchronous transaction aggregation, it may not be necessary to&lt;br /&gt;
completely rework the transaction subsystem. Combining aggregation with an&lt;br /&gt;
ongoing process of targeted reduction of transaction size will provide benefits&lt;br /&gt;
out to at least the medium term. It is unclear whether this direction will be&lt;br /&gt;
sufficient in the long run until we can measure the benefit that aggregation&lt;br /&gt;
will provide.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Reducing Transaction Overhead ==&lt;br /&gt;
&lt;br /&gt;
Per iclog callback list locks: Done&lt;br /&gt;
&lt;br /&gt;
AIL tail pushing in it&#039;s own thread: Done&lt;br /&gt;
&lt;br /&gt;
Bulk AIL insert and delete operations: Done&lt;br /&gt;
&lt;br /&gt;
Log grant lock split-up: Done&lt;br /&gt;
&lt;br /&gt;
Lock free transaction reserve path: Done&lt;br /&gt;
&lt;br /&gt;
Moving all of the log interfacing out of the direct transaction commit path may provide similar benefits to moving the AIL pushing into it&#039;s own thread. This will mean that there will typically only be a single thread formatting and writing to iclog buffers. This will remove much of the parallelism that puts excessive pressure on many of these locks.&lt;br /&gt;
&lt;br /&gt;
== Reducing Recovery Time ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
With 2GB logs, recovery can take an awfully long time due to the need&lt;br /&gt;
to read each object synchronously as we process the journal. An obvious&lt;br /&gt;
way to avoid this is to add another pass to the processing to do asynchronous&lt;br /&gt;
readahead of all the objects in the log before doing the processing passes.&lt;br /&gt;
This will populate the cache as quickly as possible and hide any read latency&lt;br /&gt;
that could occur as we process commit records.&lt;br /&gt;
&lt;br /&gt;
A logical extension to this is to sort the objects in ascending offset order&lt;br /&gt;
before issuing I/O on them. That will further optimise the readahead I/O&lt;br /&gt;
to reduce seeking and hence should speed up the read phase of recovery&lt;br /&gt;
further.&lt;br /&gt;
&lt;br /&gt;
== ToDo ==&lt;br /&gt;
Further investigation of recovery for future optimisation.&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=Improving_inode_Caching&amp;diff=2988</id>
		<title>Improving inode Caching</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=Improving_inode_Caching&amp;diff=2988"/>
		<updated>2016-04-12T16:06:37Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Future Directions for XFS - Inode Subsystems =&lt;br /&gt;
&lt;br /&gt;
From http://oss.sgi.com/archives/xfs/2008-09/msg00799.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Improving Inode Caching and Operation in XFS ==&lt;br /&gt;
--------------------------------------------&lt;br /&gt;
&lt;br /&gt;
Thousand foot view:&lt;br /&gt;
&lt;br /&gt;
We want to drive inode lookup in a manner that is as parallel, scalable and low&lt;br /&gt;
overhead as possible. This means efficient indexing, lowering memory&lt;br /&gt;
consumption, simplifying the caching heirachy, removing duplication and&lt;br /&gt;
reducing/removing lock traffic.&lt;br /&gt;
&lt;br /&gt;
In addition, we want to provide a good foundation for simplifying inode I/O,&lt;br /&gt;
improving writeback clustering, preventing RMW of inode buffers under memory&lt;br /&gt;
pressure, reducing creation and deletion overhead and removing writeback of&lt;br /&gt;
unlogged changes completely.&lt;br /&gt;
&lt;br /&gt;
There are a variety of features in disconnected trees and patch sets that need&lt;br /&gt;
to be combined to acheive this - the basic structure needed to implement this is&lt;br /&gt;
already in mainline and that is the radix tree inode indexing.  Further&lt;br /&gt;
improvements are going to be based around this structure and using it&lt;br /&gt;
effectively to avoid needing other indexing mechanisms.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Discussion:&lt;br /&gt;
&lt;br /&gt;
== Combining XFS and VFS inodes ==&lt;br /&gt;
&lt;br /&gt;
Status: Done (October 2008)&lt;br /&gt;
&lt;br /&gt;
== Compressed Inode Cache ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The XFs inode cache uses a lot of memory. We can avoid this problem by making&lt;br /&gt;
use of the compressed inode cache - only the active inodes are held in a&lt;br /&gt;
non-compressed form, hence most inodes will end up being cached in compressed&lt;br /&gt;
form rather than in the XFS/linux inode form.  The compressed form can reduce&lt;br /&gt;
the cached inode footprint to 200-300 bytes per inode instead of 1-1.1k that&lt;br /&gt;
they currently take on a 64bit system. Hence by moving to a compressed cache we&lt;br /&gt;
can greatly increase the number of inodes cached in a given amount of memory&lt;br /&gt;
which more that offsets any comparitive increase we will see from inodes in&lt;br /&gt;
reclaim. the compressed cache should really have a LRU and a shrinker as well&lt;br /&gt;
so that memory pressure will slowly trim it as memory demands occur. [Note:&lt;br /&gt;
this compressed cache is discussed further later on in the reclaim context.]&lt;br /&gt;
&lt;br /&gt;
== Fixed Inode Cache Size ==&lt;br /&gt;
&lt;br /&gt;
It is worth noting that for embedded systems and appliances it may be worth while allowing&lt;br /&gt;
the size of the caches to be fixed. Also, to prevent memory fragmentation&lt;br /&gt;
problems, we could simply allocate that memory to the compressed cache slab.&lt;br /&gt;
In effect, this would become a &#039;static slab&#039; in that it has a bound maximum&lt;br /&gt;
size and never frees and memory. When the cache is full, we reclaim an&lt;br /&gt;
object out of it for reuse - this could be done by triggering the shrinker&lt;br /&gt;
to reclaim from the LRU. This would prevent the compressed inode cache from&lt;br /&gt;
consuming excessive amounts of memory in tightly constrained evironments.&lt;br /&gt;
Such an extension to the slab caches does not look difficult to implement,&lt;br /&gt;
and would allow such customisation with minimal deviation from mainline code.&lt;br /&gt;
&lt;br /&gt;
== Bypassing the Linux Inode Cache ==&lt;br /&gt;
&lt;br /&gt;
Lookups: Done (October 2008)&lt;br /&gt;
&lt;br /&gt;
Tracking dirty inodes: Done&lt;br /&gt;
&lt;br /&gt;
Writeback of dirty inodes: Done&lt;br /&gt;
&lt;br /&gt;
Writeback of dirty pages: still executed the by VFS&lt;br /&gt;
&lt;br /&gt;
Now that we can track dirty inodes ourselves, we can pretty much isolate&lt;br /&gt;
writeback of both data and inodes from the generic pdflush code. If we add a&lt;br /&gt;
hook high up in the pdflush path that simply passes us a writeback control&lt;br /&gt;
structure with the current writeback guidelines, we can do writeback within&lt;br /&gt;
those guidelines in the most optimal fashion for XFS.&lt;br /&gt;
&lt;br /&gt;
== Avoiding the Generic pdflush Code ==&lt;br /&gt;
&lt;br /&gt;
Writeback of inodes via AIL: Done&lt;br /&gt;
&lt;br /&gt;
For pdflush driven writeback, we only want to write back data; all other inode&lt;br /&gt;
writeback should be driven from the AIL (our time ordered dirty metadata list)&lt;br /&gt;
or xfssyncd in a manner that is most optimal for XFS.&lt;br /&gt;
&lt;br /&gt;
Furthermore, if we implement our own pdflush method, we can parallelise it in&lt;br /&gt;
several ways. We can ensure that each filesystem has it&#039;s own flush thread or&lt;br /&gt;
thread pool, we can have a thread pool shared by all filesystems (like pdflush&lt;br /&gt;
currently operates), we can have a flush thread per inode radix tree, and so&lt;br /&gt;
one. The method of paralleisation is open for interpretation, but enabling&lt;br /&gt;
multiple flush threads to operate on a single filesystem is one of the necessary&lt;br /&gt;
requirements to avoid data writeback (and hence delayed allocation) being&lt;br /&gt;
limited to the throughput of a single CPU per filesystem.&lt;br /&gt;
&lt;br /&gt;
== Improving Inode Writeback == &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
To optimise inode writeback, we really need to reduce the impact of inode&lt;br /&gt;
buffer read-modify-write cycles. XFS is capable of caching far more inodes in&lt;br /&gt;
memory than it has buffer space available for, so RMW cycles during inode&lt;br /&gt;
writeback under memory pressure are quite common. Firstly, we want to avoid&lt;br /&gt;
blocking pdflush at all costs.  Secondly, we want to issue as much localised&lt;br /&gt;
readahead as possible in ascending offset order to allow both elevator merging&lt;br /&gt;
of readahead and as little seeking as possible. Finally, we want to issue all&lt;br /&gt;
the write cycles as close together as possible to allow the same elevator and&lt;br /&gt;
I/O optimisations to take place.&lt;br /&gt;
&lt;br /&gt;
To do this, firstly we need the non-blocking inode flush semantics to issue&lt;br /&gt;
readahead on buffers that are not up-to-date rather than reading them&lt;br /&gt;
synchronously. Inode writeback already has the interface to handle inodes that&lt;br /&gt;
weren&#039;t flushed - we return EAGAIN from xfs_iflush() and the higher inode&lt;br /&gt;
writeback layers handle this appropriately. It would be easy to add another&lt;br /&gt;
flag to pass down to the buffer layer to say &#039;issue but don&#039;t wait for any&lt;br /&gt;
read&#039;. If we use a radix tree traversal to issue readahead in such a manner,&lt;br /&gt;
we&#039;ll get ascending offset readahead being issued.&lt;br /&gt;
&lt;br /&gt;
One problem with this is that we can issue too much readahead and thrash the&lt;br /&gt;
cache. A possible solution to this is to make the readahead a &#039;delayed read&#039;&lt;br /&gt;
and on I/o completion add it to a queue that holds a reference on the buffer.&lt;br /&gt;
If a followup read occurs soon after, we remove it from the queue and drop that&lt;br /&gt;
reference. This prevents the buffer from being reclaimed in betwen the&lt;br /&gt;
readahead completing and the real read being issued. We should also issue this&lt;br /&gt;
delayed read on buffers that are in the cache so that they don&#039;t get reclaimed&lt;br /&gt;
to make room for the readahead.&lt;br /&gt;
&lt;br /&gt;
To prevent buildup of delayed read buffers, we can periodically purge them -&lt;br /&gt;
those that are older than a given age (say 5 seconds) can be removed from the&lt;br /&gt;
list and their reference dropped. This will free the buffer and allow it&#039;s&lt;br /&gt;
pages to be reclaimed.&lt;br /&gt;
&lt;br /&gt;
Once we have done the readahead pass, we can then do a modify and writeback&lt;br /&gt;
pass over all the inodes, knowing that there will be no read cycles to delay&lt;br /&gt;
this step. Once again, a radix tree traversal gives us ascending order&lt;br /&gt;
writeback and hence the modified buffers we send to the device will be in&lt;br /&gt;
optimal order for merging and minimal seek overhead.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Contiguous Inode Allocation ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
To make optimal use of the radix tree cache and enable wide-scale clustering of&lt;br /&gt;
inode writeback across multiple clusters, we really need to ensure that inode&lt;br /&gt;
allocation occurs in large contiguous chunks on disk. Right now we only&lt;br /&gt;
allocate chunks of 64 inodes at a time; ideally we want to allocate a stripe&lt;br /&gt;
unit (or multiple of) full of inodes at a time. This would allow inode&lt;br /&gt;
writeback clustering to do full stripe writes to the underlying RAID if there&lt;br /&gt;
are dirty inodes spanning the entire stripe unit.&lt;br /&gt;
&lt;br /&gt;
The problem with doing this is that we don&#039;t want to introduce the latency of&lt;br /&gt;
creating megabytes of inodes when only one is needed for the current operation.&lt;br /&gt;
Hence we need to push the inode creation into a background thread and use that&lt;br /&gt;
to create contiguous inode chunks asynchronously. This moves the actual on-disk&lt;br /&gt;
allocation of inodes out of the normal create path; it should always be able to&lt;br /&gt;
find a free inode without doing on disk allocation. This will simplify the&lt;br /&gt;
create path by removing the allocate-on-disk-then-retry-the-create double&lt;br /&gt;
transaction that currently occurs.&lt;br /&gt;
&lt;br /&gt;
As an aside, we could preallocate a small amount of inodes in each AG (10-20MB&lt;br /&gt;
of inodes per AG?) without impacting mkfs time too greatly. This would allow&lt;br /&gt;
the filesystem to be used immediately on the first mount without triggering&lt;br /&gt;
lots of background allocation. This could alsobe done after the first mount&lt;br /&gt;
occurs, but that could interfere with typical benchmarking situations. Another&lt;br /&gt;
good reason for this preallocation is that it will help reduce xfs_repair&lt;br /&gt;
runtime for most common filesystem usages.&lt;br /&gt;
&lt;br /&gt;
One of the issues that the background create will cause is a substantial amount&lt;br /&gt;
of log traffic - every inode buffer initialised will be logged in whole. Hence&lt;br /&gt;
if we create a megabyte of inodes, we&#039;ll be causing a megabyte of log traffic&lt;br /&gt;
just for the inode buffers we&#039;ve initialised.  This is relatively simple to fix&lt;br /&gt;
- we don&#039;t log the buffer, we just log the fact that we need to initialise&lt;br /&gt;
inodes in a given range.  In recovery, when we see this transaction, then we&lt;br /&gt;
build the buffers, initialise them and write them out. Hence, we don&#039;t need to&lt;br /&gt;
log the buffers used to initialise the inodes.&lt;br /&gt;
&lt;br /&gt;
Also, we can use the background allocations to keep track of recently allocated&lt;br /&gt;
inode regions in the per-ag. Using that information to select the next inode to&lt;br /&gt;
be used rather than requiring btree searches on every create will greatly reduce&lt;br /&gt;
the CPU overhead of workloads that create lots of new inodes. It is not clear&lt;br /&gt;
whether a single background thread will be able to allocate enough inodes&lt;br /&gt;
to keep up with demand from the rest of the system - we may need multiple&lt;br /&gt;
threads for large configurations.&lt;br /&gt;
&lt;br /&gt;
== Single Block Inode Allocation ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One of the big problems we have withe filesystems that are approaching&lt;br /&gt;
full is that it can be hard to find a large enough extent to hold 64 inodes.&lt;br /&gt;
We&#039;ve had ENOSPC errors on inode allocation reported on filesystems that&lt;br /&gt;
are only 85% full. This is a sign of free space fragmentation, and it&lt;br /&gt;
prevents inode allocation from succeeding. We could (and should) write&lt;br /&gt;
a free space defragmenter, but that does not solve the problem - it&#039;s&lt;br /&gt;
reactive, not preventative.&lt;br /&gt;
&lt;br /&gt;
The main problem we have is that XFS uses inode chunk size and alignment&lt;br /&gt;
to optimise inode number to disk location conversion. That is, the conversion&lt;br /&gt;
becomes a single set of shifts and masks instead of an AGI btree lookup.&lt;br /&gt;
This optimisation substantially reduces the CPU and I/O overhead of&lt;br /&gt;
inode lookups, but it does limit our flexibility. If we break the&lt;br /&gt;
alignment restriction, every lookup has to go back to a btree search.&lt;br /&gt;
Hence we really want to avoid breaking chunk alignment and size&lt;br /&gt;
rules.&lt;br /&gt;
&lt;br /&gt;
An approach to avoiding violation of this rule is to be able to determine which&lt;br /&gt;
index to look up when parsing the inode number. For example, we could use the&lt;br /&gt;
high bit of the inode number to indicate that it is located in a non-aligned&lt;br /&gt;
inode chunk and hence needs to be looked up in the btree. This would avoid&lt;br /&gt;
the lookup penalty for correctly aligned inode chunks.&lt;br /&gt;
&lt;br /&gt;
If we then redefine the meaning of the contents of the AGI btree record for&lt;br /&gt;
such inode chunks, we do not need a new index to keep these in. Effectively,&lt;br /&gt;
we need to add a bitmask to the record to indicate which blocks inside&lt;br /&gt;
the chunk can actually contain inodes. We still use aligned/sized records,&lt;br /&gt;
but mask out the sections that we are not allowed to allocate inodes in.&lt;br /&gt;
Effectively, this would allow sparse inode chunks. There may be limitations&lt;br /&gt;
on the resolution of sparseness depending on inode size and block size,&lt;br /&gt;
but for the common cases of 4k block size and 256 or 512 byte inodes I&lt;br /&gt;
think we can run a fully sparse mapping for each inode chunk.&lt;br /&gt;
&lt;br /&gt;
This would allow us to allocate inode extents of any alignment and size&lt;br /&gt;
that fits *inside* the existing alignment/size limitations. That is,&lt;br /&gt;
a single extent allocation could not span two btree records, but can&lt;br /&gt;
lie anywhere inside a single record. It also means that we can do&lt;br /&gt;
multiple extent allocations within one btree record to make optimal&lt;br /&gt;
use of the fragmented free space.&lt;br /&gt;
&lt;br /&gt;
It should be noted that this will probably have impact on some of the&lt;br /&gt;
inode cluster buffer mapping and clustering algorithms. It is not clear&lt;br /&gt;
exactly what impact yet, but certainly write clustering will be affected.&lt;br /&gt;
Fortunately we&#039;ll be able to detect the inodes that will have this problem&lt;br /&gt;
by the high bit in the inode number.&lt;br /&gt;
&lt;br /&gt;
== Inode Unlink ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If we turn to look at unlink and reclaim interactions, there are a few&lt;br /&gt;
optimisations that can be made.  Firstly, we don&#039;t need to do inode inactivation&lt;br /&gt;
in reclaim threads - these transactions can easily be pushed to a background&lt;br /&gt;
thread. This means that xfs_inactive would be little more than a vmtruncate()&lt;br /&gt;
call and queuing to a workqueue. This will substantially speed up the processing&lt;br /&gt;
of prune_icache() - we&#039;ll get inodes moved into reclaim much faster than we do&lt;br /&gt;
right now.&lt;br /&gt;
&lt;br /&gt;
This will have a noticable effect, though. When inodes are unlinked the space&lt;br /&gt;
consumed by those inodes may not be immediately freed - it will be returned as&lt;br /&gt;
the inodes are processed through the reclaim threads. This means that userspace&lt;br /&gt;
monitoring tools such as &#039;df&#039; may not immediately reflect the result of a&lt;br /&gt;
completed unlink operation. This will be a user visible change in behaviour,&lt;br /&gt;
though in most cases should not affect anyone and for those that it does affect&lt;br /&gt;
a &#039;sync&#039; should be sufficient to wait for the space to be returned.&lt;br /&gt;
&lt;br /&gt;
Now that inodes to be unlinked are out of general circulation, we can make the&lt;br /&gt;
unlinked path more complex. It is desirable to move the unlinked list from the&lt;br /&gt;
inode buffer to the inode core, but that has locking implications for incore&lt;br /&gt;
unlinked. Hence we really need background thread processing to enable this to&lt;br /&gt;
work (i.e. being able to requeue inodes for later processing). To ensure that&lt;br /&gt;
to overhead of this work is not a limiting factor, we will probably need&lt;br /&gt;
multiple workqueue processing threads for this.&lt;br /&gt;
&lt;br /&gt;
Moving the logging to the inode core enables two things - it allows us to keep&lt;br /&gt;
an in-memory copy of the unlinked list off the perag and that allows us to remove&lt;br /&gt;
xfs_inotobp(). The in-memory unlinked list means we don&#039;t have to read and&lt;br /&gt;
traverse the buffers every time we need to find the previous buffer to remove an&lt;br /&gt;
inode from the list, but it does mean we have to take the inode lock. If the&lt;br /&gt;
previous inode is locked, then we can&#039;t remove the inode from the unlinked list&lt;br /&gt;
so we must requeue it for this to occur at a later time.&lt;br /&gt;
&lt;br /&gt;
Combined with the changes to inode create, we effectively will only use the&lt;br /&gt;
inode buffer in the transaction subsystem for marking the region stale when&lt;br /&gt;
freeing an inode chunk from disk (i.e. the default noikeep configuration). If&lt;br /&gt;
we are using large inode allocation, we don&#039;t want to be freeing random inode&lt;br /&gt;
chunks - this will just leave us with fragmented inode regions and undo all the&lt;br /&gt;
good work that was done originally.&lt;br /&gt;
&lt;br /&gt;
To avoid this, we should not be freeing inode chunks as soon as they no longer&lt;br /&gt;
have any empty inodes in them. We should periodically scan the AGI btree&lt;br /&gt;
looking for contiguous chunks that have no inodes allocated in them, and then&lt;br /&gt;
freeing the large contiguous regions we find in one go. It is likely this can&lt;br /&gt;
be done in a single transaction; it&#039;s one extent to be freed, along with a&lt;br /&gt;
contiguous set of records to be removed from the AGI btree so should not&lt;br /&gt;
require logging much at all. Also, the background scanning could be triggered&lt;br /&gt;
by a number of different events - low space in an AG, a large number of free&lt;br /&gt;
inodes in an AG, etc - as it doesn&#039;t need to be done frequently. As a result&lt;br /&gt;
of the lack of frequency that this needs to be done, it can probably be&lt;br /&gt;
handled by a single thread or delayed workqueue.&lt;br /&gt;
&lt;br /&gt;
Further optimisations are possible here - if we rule that the AGI btree is the&lt;br /&gt;
sole place that inodes are marked free or in-use (with the exception of&lt;br /&gt;
unlinked inodes attached to the AGI lists), then we can avoid the need to&lt;br /&gt;
write back unlinked inodes or read newly created inodes from disk.  This would&lt;br /&gt;
require all inodes to effectively use a random generation number assigned at&lt;br /&gt;
create time as we would not be reading it from disk - writing/reading the current&lt;br /&gt;
generation number appears to be the only real reason for doing this I/O. This&lt;br /&gt;
would require extra checks to determine if an inode is unlinked - we&lt;br /&gt;
need to do an imap lookup rather than reading it and then checking it is&lt;br /&gt;
valid if it is not already in memory. Avoiding the I/O, however, will greatly speed&lt;br /&gt;
up create and remove workloads. Note: the impact of this on the bulkstat algorithm&lt;br /&gt;
has not been determined yet.&lt;br /&gt;
&lt;br /&gt;
One of the issues we need to consider with this background inactivation is that&lt;br /&gt;
we will be able to defer a large quantity of inactivation transactions so we are&lt;br /&gt;
going to need to be careful about how much we allow to be queued. Simple queue&lt;br /&gt;
depth throttling should be all that is needed to keep this under control.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Reclaim Optimizations ==&lt;br /&gt;
&lt;br /&gt;
Tracking inodes for reclaim in radix tree: Done&lt;br /&gt;
&lt;br /&gt;
Using RCU for radix tree reclaim walks: Done&lt;br /&gt;
&lt;br /&gt;
Non-blocking background reclaim: Done&lt;br /&gt;
&lt;br /&gt;
Parallelised shrinker based reclaim: Done&lt;br /&gt;
&lt;br /&gt;
Now that we have efficient unlink, we&#039;ve got to handle the reclaim of all the&lt;br /&gt;
inodes that are now dead or simply not referenced. For inodes that are dirty,&lt;br /&gt;
we need to write them out to clean them. For inodes that are clean and not&lt;br /&gt;
unlinked, we need to compress them down for more compact storage. This involves&lt;br /&gt;
some CPU overhead, but it is worth noting that reclaiming of clean inodes&lt;br /&gt;
typically only occurs when we are under memory pressure.&lt;br /&gt;
&lt;br /&gt;
By compressing the XFS inode in this case, we are effectively reducing the&lt;br /&gt;
memory usage of the inode rather than freeing it directly. If we then get&lt;br /&gt;
another operation on that inode (e.g. the working set is slightly larger than&lt;br /&gt;
can be held in linux+XFS inode pairs, we avoid having to read the inode off&lt;br /&gt;
disk again - it simply gets uncompressed out of the cache. In essence we use&lt;br /&gt;
the compressed inode cache as an exclusive second level cache - it has higher&lt;br /&gt;
density than the primary cache and higher load latency and CPU overhead,&lt;br /&gt;
but it still avoids I/O in exactly the same manner as the primary cache.&lt;br /&gt;
&lt;br /&gt;
We cannot allow unrestricted build-up of reclaimable inodes - the memory they&lt;br /&gt;
consume will be large, so we should be aiming to compress reclaimable inodes as&lt;br /&gt;
soon as they are clean.  This will prevent buildup of memory consuming&lt;br /&gt;
uncompressed inodes that are not likely to be referenced again immediately.&lt;br /&gt;
&lt;br /&gt;
This clean inode reclaimation process can be accelerated by triggering reclaim&lt;br /&gt;
on inode I/O completion. If the inode is clean and reclaimable we should&lt;br /&gt;
trigger immediate reclaim processing of that inode.  This will mean that&lt;br /&gt;
reclaim of newly cleaned inodes will not get held up behind reclaim of dirty&lt;br /&gt;
inodes.&lt;br /&gt;
&lt;br /&gt;
For inodes that are unlinked, we can simply free them in reclaim as theƦ&lt;br /&gt;
are no longer in use. We don&#039;t want to poison the compressed cache with&lt;br /&gt;
unlinked inodes, nor do we need to because we can allocate new inodes&lt;br /&gt;
without incurring I/O.&lt;br /&gt;
&lt;br /&gt;
Still, we may end up with lots of inodes queued for reclaim. We may need&lt;br /&gt;
to implement a throttle mechanism to slow down the rate at which inodes&lt;br /&gt;
are queued for reclaimation in the situation where the reclaim process&lt;br /&gt;
is not able to keep up. It should be noted that if we parallelise inode&lt;br /&gt;
writeback we should also be able to parallelise inode reclaim via&lt;br /&gt;
the same mechanism, so the need for throttling may relatively low&lt;br /&gt;
if we can have multiple inodes under reclaim at once.&lt;br /&gt;
&lt;br /&gt;
It should be noted that complexity is exposed by interactions with concurrent&lt;br /&gt;
lookups, especially if we move to RCU locking on the radix tree. Firstly, we&lt;br /&gt;
need to be able to do an atomic swap of the compressed inode for the&lt;br /&gt;
uncompressed inode in the radix tree (and vice versa), to be able to tell them&lt;br /&gt;
apart (magic #), and to have atomic reference counts to ensure we can avoid use&lt;br /&gt;
after free situations when lookups race with compression or freeing.&lt;br /&gt;
&lt;br /&gt;
Secondly, with the complex unlink/reclaim interactions we will need to be&lt;br /&gt;
careful to detect inodes in the process of reclaim - the lookupp process&lt;br /&gt;
will need to do different things depending on the state of reclaim. Indeed,&lt;br /&gt;
we will need to be able to cancel reclaim of an unlinked inode if we try&lt;br /&gt;
to allocate it before it has been fully unlinked or reclaimed. The same&lt;br /&gt;
can be said for an inode in the process of being compressed - if we get&lt;br /&gt;
a lookup during the compression process, we want to return the existing&lt;br /&gt;
inode, not have to wait, re-allocate and uncompress it again. These&lt;br /&gt;
are all solvable issues - they just add complexity.&lt;br /&gt;
&lt;br /&gt;
== Accelerated Reclaim of buftarg Page Cache for Inodes ==&lt;br /&gt;
----------------------------------------------------&lt;br /&gt;
&lt;br /&gt;
Per-buftarg buffer LRU reclaim: Done&lt;br /&gt;
&lt;br /&gt;
Per-buftarg shrinker: Done&lt;br /&gt;
&lt;br /&gt;
Per-buffer type reclaim prioritisation: Done&lt;br /&gt;
&lt;br /&gt;
For single use inodes or even read-only inodes, we read them in, use them, then&lt;br /&gt;
reclaim them. With the compressed cache, they&#039;ll get compressed and live a lot&lt;br /&gt;
longer in memory. However, we also will have the inode cluster buffer pages&lt;br /&gt;
sitting in memory for some length of time after the inode was read in. This can&lt;br /&gt;
consume a large amount of memory that will never be used again, and does not&lt;br /&gt;
get reclaimed until they are purged from the LRU by the VM.  It would be&lt;br /&gt;
advantageous to accelerate the reclaim of these pages so that they do not build&lt;br /&gt;
up unneccessarily.&lt;br /&gt;
&lt;br /&gt;
A better method would appear to be to leverage the delayed read queue&lt;br /&gt;
mechanism. This delayed read queue pins read buffers for a short period of&lt;br /&gt;
time, and then if they have not been referenced they get torn down.  If, as&lt;br /&gt;
part of this delayed read buffer teardown procedure we all free the backing&lt;br /&gt;
pages completely, we acheive the exact same result as having our own LRUs to&lt;br /&gt;
manage the page cache. This seems much simpler and a much more holistic&lt;br /&gt;
approach to solving the problem than implementing page LRUs.&lt;br /&gt;
&lt;br /&gt;
== Killing Bufferheads (a.k.a &amp;quot;Die, buggerheads, Die!&amp;quot;) ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[This is not strictly about inode caching, but doesn&#039;t fit into&lt;br /&gt;
other areas of development as closely as it does to inode caching&lt;br /&gt;
optimisations.]&lt;br /&gt;
&lt;br /&gt;
XFS is extent based. The Linux page cache is block based. Hence for&lt;br /&gt;
every cached page in memory, we have to attach a structure for mapping&lt;br /&gt;
the blocks on that page back to to the on-disk location. In XFs, we also&lt;br /&gt;
use this to hold state for delayed allocation and unwritten extent blocks&lt;br /&gt;
so the generic code can do the right thing when necessary. We also&lt;br /&gt;
use it to avoid extent lookups at various times within the XFS I/O&lt;br /&gt;
path.&lt;br /&gt;
&lt;br /&gt;
However, this has a massive cost. While XFS might represent the&lt;br /&gt;
disk mapping of a 1GB extent in 24 bytes of memory, the page cache&lt;br /&gt;
requires 262,144 bufferheads (assuming 4k block size) to represent the&lt;br /&gt;
same mapping. That&#039;s roughly 14MB of memory neededtoo represent that.&lt;br /&gt;
&lt;br /&gt;
Chris Mason wrote an extent map representation for page cache state&lt;br /&gt;
and mappings for BTRFS; that code is mostly generic and could be&lt;br /&gt;
adapted to XFS. This would allow us to hold all the page cache state&lt;br /&gt;
in extent format and greatly reduce the memory overhead that it currently&lt;br /&gt;
has. The tradeoff is increased CPU overhead due to tree lookups where&lt;br /&gt;
structure lookups currently are used. Still, this has much lower&lt;br /&gt;
overhead than xfs_bmapi() based lookups, so the penalty is going to&lt;br /&gt;
be lower than if we did these lookups right now.&lt;br /&gt;
&lt;br /&gt;
If we make this change, we would then have three levels of extent&lt;br /&gt;
caching:&lt;br /&gt;
&lt;br /&gt;
	- the BMBT buffers&lt;br /&gt;
	- the XFS incore inode extent tree (iext*)&lt;br /&gt;
	- the page cache extent map tree&lt;br /&gt;
&lt;br /&gt;
Effectively, the XFS incore inode extent tree becomes redundant - all&lt;br /&gt;
the extent state it holds can be moved to the generic page cache tree&lt;br /&gt;
and we can do all our incore operations there. Our logging of changes&lt;br /&gt;
is based on the BMBT buffers, so getting rid of the iext layer would&lt;br /&gt;
not impact the transaction subsystem at all.&lt;br /&gt;
&lt;br /&gt;
Such integration with the generic code will also allow development&lt;br /&gt;
of generic writeback routines for delayed allocation, unwritten&lt;br /&gt;
extents, etc that are not specific to a given filesystem.&lt;br /&gt;
&lt;br /&gt;
== Demand Paging of Large Inode Extent Maps ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Currently the inode extent map is pinned  in memory until the inode is&lt;br /&gt;
reclaimed. Hence an inode with millions of extents will pin a large&lt;br /&gt;
amount of memory and this can cause serious issues in low memory&lt;br /&gt;
situations. Ideally we would like to be able to page the extent&lt;br /&gt;
map in and out once they get to a certain size to avoid this&lt;br /&gt;
problem. This feature requires more investigation before an overall&lt;br /&gt;
approach can be detailed here.&lt;br /&gt;
&lt;br /&gt;
It should be noted that if we move to an extent-based page cache mapping&lt;br /&gt;
tree, the associated extent state tree can be used to track sparse&lt;br /&gt;
regions. That is, regions of the extent map that are not in memory&lt;br /&gt;
can be easily represented and acceesses to an unread region can then&lt;br /&gt;
be used to trigger demand loading.&lt;br /&gt;
&lt;br /&gt;
== Food For Thought (Crazy Ideas) ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If we are not using inode buffers for logging changes to inodes, we should&lt;br /&gt;
consider whether we need them at all. What benefit do the buffers bring us when&lt;br /&gt;
all we will use them for is read or write I/O?  Would it be better to go&lt;br /&gt;
straight to the buftarg page cache and do page based I/O via submit_bio()?&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=Reliable_Detection_and_Repair_of_Metadata_Corruption&amp;diff=2987</id>
		<title>Reliable Detection and Repair of Metadata Corruption</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=Reliable_Detection_and_Repair_of_Metadata_Corruption&amp;diff=2987"/>
		<updated>2016-04-12T16:05:00Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Future Directions - Reliability =&lt;br /&gt;
&lt;br /&gt;
From http://oss.sgi.com/archives/xfs/2008-09/msg00802.html&lt;br /&gt;
&lt;br /&gt;
== Reliable Detection and Repair of Metadata Corruption ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This can be broken down into specific phases. Firstly, we cannot repair a&lt;br /&gt;
corruption we have not detected. Hence the first thing we need to do is&lt;br /&gt;
reliable detection of errors and corruption. Once we can reliably detect errors&lt;br /&gt;
in structures and verified that we are propagating all the errors reported from&lt;br /&gt;
lower layers into XFS correctly, we can look at ways of handling them more&lt;br /&gt;
robustly. In many cases, the same type of error needs to be handled differently&lt;br /&gt;
due to the context in which the error occurs.  This introduces extra complexity&lt;br /&gt;
into this problem.&lt;br /&gt;
&lt;br /&gt;
Rather than continually refering to specific types of problems (such as&lt;br /&gt;
corruption or error handling) I&#039;ll refer to them as &#039;exceptions&#039;. This avoids&lt;br /&gt;
thinking about specific error conditions through specific paths and so helps us&lt;br /&gt;
to look at the issues from a more general or abstract point of view.&lt;br /&gt;
&lt;br /&gt;
== Exception Detection ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Our current approach to exception detection is entirely reactive and rather&lt;br /&gt;
slapdash - we read a metadata block from disk and check certain aspects of it&lt;br /&gt;
(e.g. the magic number) to determine if it is the block we wanted. We have no&lt;br /&gt;
way of verifying that it is the correct block of metadata of the type&lt;br /&gt;
we were trying to read; just that it is one of that specific type. We&lt;br /&gt;
do bounds checking on critical fields, but this can&#039;t detect bit errors&lt;br /&gt;
in those fields. There&#039;s many fields we don&#039;t even bother to check because&lt;br /&gt;
the range of valid values are not limited.&lt;br /&gt;
&lt;br /&gt;
Effectively, this can be broken down into three separate areas:&lt;br /&gt;
&lt;br /&gt;
	- ensuring what we&#039;ve read is exactly what we wrote&lt;br /&gt;
	- ensuring what we&#039;ve read is the block we were supposed to read&lt;br /&gt;
	- robust contents checking&lt;br /&gt;
&lt;br /&gt;
Firstly, if we introduce a mechanism that we can use to ensure what we read is&lt;br /&gt;
something that the filesystem wrote, we can detect a whole range of exceptions&lt;br /&gt;
that are caused in layers below the filesystem (software and hardware). The&lt;br /&gt;
best method for this is to use a guard value that travels with the metadata it&lt;br /&gt;
is guarding. The guard value needs to be derived from the contents of the&lt;br /&gt;
block being guarded. Any event that changes the guard or the contents it is&lt;br /&gt;
guarding will immediately trigger an exception handling process when the&lt;br /&gt;
metadata is read in. Some examples of what this will detect are:&lt;br /&gt;
&lt;br /&gt;
	- bit errors in media/busses/memory after guard is calculated&lt;br /&gt;
	- uninitialised blocks being returned from lower layers (dmcrypt&lt;br /&gt;
	  had a readahead cancelling bug that could do this)&lt;br /&gt;
	- zeroed sectors as a result of double sector failures&lt;br /&gt;
	  in RAID5 systems&lt;br /&gt;
	- overwrite by data blocks&lt;br /&gt;
	- partial overwrites (e.g. due to power failure)&lt;br /&gt;
&lt;br /&gt;
The simplest method for doing this is introducing a checksum or CRC into each&lt;br /&gt;
block. We can calculate this for each different type of metadata being written&lt;br /&gt;
just before they are written to disk, hence we are able to provide a guard that&lt;br /&gt;
travels all the way to and from disk with the metadata itself. Given that&lt;br /&gt;
metadata blocks can be a maximum of 64k in size, we don&#039;t need a hugely complex&lt;br /&gt;
CRC or number of bits to protect blocks of this size. A 32 bit CRC will allow&lt;br /&gt;
us to reliably detect 15 bit errors on a 64k block, so this would catch almost&lt;br /&gt;
all types of bit error exceptions that occur. It will also detect almost all&lt;br /&gt;
other types of major content change that might occur due to an exception.&lt;br /&gt;
It has been noted that we should select the guard algorithm to be one that&lt;br /&gt;
has (or is targetted for) widespread hardware acceleration support.&lt;br /&gt;
&lt;br /&gt;
The other advantage this provides us with is a very fast method of determining&lt;br /&gt;
if a corrupted btree is a result of a lower layer problem or indeed an XFS&lt;br /&gt;
problem. That is, instead of always getting a WANT_CORRUPTED_GOTO btree&lt;br /&gt;
exception and shutdown, we&#039;ll get a&#039;bad CRC&#039; exception before we even start&lt;br /&gt;
processing the contents. This will save us much time when triaging corrupt&lt;br /&gt;
btrees - we won&#039;t spend time chasing problems that result from (potentially&lt;br /&gt;
silent or unhandled) lower layer exceptions.&lt;br /&gt;
&lt;br /&gt;
While a metadata block guard will protect us against content change, it won&#039;t&lt;br /&gt;
protect us against blocks that are written to the wrong location on disk. This,&lt;br /&gt;
unfortunately, happens more often that anyone would like and can be very&lt;br /&gt;
difficult to track down when it does occur. To protect against this problem,&lt;br /&gt;
metadata needs to be self-describing on disk. That is, if we read a block&lt;br /&gt;
on disk, there needs to be enough information in that block to determine&lt;br /&gt;
that it is the correct block for that location.&lt;br /&gt;
&lt;br /&gt;
Currently we have a very simplistic method of determining that we really have&lt;br /&gt;
read the correct block - the magic numbers in each metadata structure.  This&lt;br /&gt;
only enables us to identify type - we still need location and filesystem to&lt;br /&gt;
really determine if the block we&#039;ve read is the correct one. We need the&lt;br /&gt;
filesystem identifier because misdirected writes can cross filesystem&lt;br /&gt;
boundaries.  This is easily done by including the UUID of the filesystem in&lt;br /&gt;
every individually referencable metadata structure on disk.&lt;br /&gt;
&lt;br /&gt;
For block based metadata structures such as btrees, AG headers, etc, we&lt;br /&gt;
can add the block number directly to the header structures hence enabling&lt;br /&gt;
easy checking. e.g. for btree blocks, we already have sibling pointers in the&lt;br /&gt;
header, so adding a long &#039;self&#039; pointer makes a great deal of sense.&lt;br /&gt;
For inodes, adding the inode number into the inode core will provide exactly&lt;br /&gt;
the same protection - we&#039;ll now know that the inode we are reading is the&lt;br /&gt;
one we are supposed to have read. We can make similar modifications to dquots&lt;br /&gt;
to make them self identifying as well.&lt;br /&gt;
&lt;br /&gt;
So now we are able to verify the metadata we read from disk is what we wrote&lt;br /&gt;
and it&#039;s the correct metadata block, the only thing that remains is more&lt;br /&gt;
robust checking of the content. In many cases we already do this in DEBUG&lt;br /&gt;
code but not in runtime code. For example, when we read an inode cluster&lt;br /&gt;
in we only check the first inode for a matching magic number, whereas in&lt;br /&gt;
debug code we check every inode in the cluster.&lt;br /&gt;
&lt;br /&gt;
In some cases, there is not much point in doing this sort of detailed checking;&lt;br /&gt;
it&#039;s pretty hard to check the validity of the contents of a btree block without&lt;br /&gt;
doing a full walk of the tree and that is prohibitive overhead for production&lt;br /&gt;
systems. The added block guards and self identifiers should be sufficient to&lt;br /&gt;
catch all non-filesystem based exceptions in this case, whilst the existing&lt;br /&gt;
exception detection should catch all others. With the btree factoring that&lt;br /&gt;
is being done on for this work, all of the btrees should end up protected by&lt;br /&gt;
WANT_CORRUPTED_GOTO runtime exception checking.&lt;br /&gt;
&lt;br /&gt;
We also need to verify that metadata is sane before we use it. For example, if&lt;br /&gt;
we pull a block number out of a btree record in a block that has passed all&lt;br /&gt;
other validity it still may be invalid due to corruption prior to writing&lt;br /&gt;
it to disk. In these cases we need to ensure the block number lands&lt;br /&gt;
within the filesystem and/or within the bounds of the specific AG.&lt;br /&gt;
&lt;br /&gt;
Similar checking is needed for pretty much any forward or backwards reference&lt;br /&gt;
we are going to follow or using in an algorithm somewhere. This will help&lt;br /&gt;
prevent kernel panics by out of bound references (e.g. using an unchecked ag&lt;br /&gt;
number to index the per-AG array) by turning them into a handled exception&lt;br /&gt;
(which will initially be a shutdown). That is, we will turn a total system&lt;br /&gt;
failure into a (potentially recoverable) filesystem failure.&lt;br /&gt;
&lt;br /&gt;
Another failures that we often have reported is that XFS has &#039;hung&#039; and&lt;br /&gt;
traige indicates that the filesystem appears to be waiting for a metadata&lt;br /&gt;
I/O completion to occur. We have seen in the past I/O errors not being&lt;br /&gt;
propagated from the lower layers back into the filesystem causing these&lt;br /&gt;
sort of problems. We have also seen cases where there have been silent&lt;br /&gt;
I/O errors and the first thing to go wrong is &#039;XFS has hung&#039;.&lt;br /&gt;
&lt;br /&gt;
To catch situations like this, we need to track all I/O we have in flight and&lt;br /&gt;
have some method of timing them out.  That is, if we haven&#039;t completed the I/O&lt;br /&gt;
in N seconds, issue a warning and enter an exception handling process that&lt;br /&gt;
attempts to deal with the problem.&lt;br /&gt;
&lt;br /&gt;
My initial thoughts on this is that it could be implemented via the MRU cache&lt;br /&gt;
without much extra code being needed.  The complexity with this is that we&lt;br /&gt;
can&#039;t catch data read I/O because we use the generic I/O path for read. We do&lt;br /&gt;
our own data write and metadata read/write, so we can easily add hooks to track&lt;br /&gt;
all these types of I/O. Hence we will initially target just metadata I/O as&lt;br /&gt;
this would only need to hook into the xfs_buf I/O submission layer.&lt;br /&gt;
&lt;br /&gt;
To further improve exception detection, once guards and self-describing&lt;br /&gt;
structures are on disk, we can add filesystem scrubbing daemons that can verify&lt;br /&gt;
the structure of the filesystem pro-actively. That is, we can use background&lt;br /&gt;
processes to discovery degradation in the filesystem before it is found by a&lt;br /&gt;
user intiated operation. This gives us the ability to do exception handling in&lt;br /&gt;
a context that enables further checking and potential repair of the exception.&lt;br /&gt;
This sort of exception handling may not be possible if we are in a&lt;br /&gt;
user-initiated I/O context, and certainly not if we are in a transaction&lt;br /&gt;
context.&lt;br /&gt;
&lt;br /&gt;
This will also allow us to detect errors in rarely referenced parts of&lt;br /&gt;
the filesystem, thereby giving us advance warning of degradation in filesystems&lt;br /&gt;
that we might not otherwise get (e.g. in systems without media scrubbing).&lt;br /&gt;
Ideally, data scrubbing woul dneed to be done as well, but without data guards&lt;br /&gt;
it is rather hard to detect that there&#039;s been a change in the data....&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Exception Handling ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Once we can detect exceptions, we need to handle them in a sane manner.&lt;br /&gt;
The method of exception handling is two-fold:&lt;br /&gt;
&lt;br /&gt;
	- retry (write) or cancel (read) asynchronous I/O&lt;br /&gt;
	- shut down the filesystem (fatal).&lt;br /&gt;
&lt;br /&gt;
Effectively, we either defer non-critical failures to a later point in&lt;br /&gt;
time or we come to a complete halt and prevent the filesystem from being&lt;br /&gt;
accessed further. We have no other methods of handling exceptions.&lt;br /&gt;
&lt;br /&gt;
If we look at the different types of exceptions we can have, they&lt;br /&gt;
broadly fall into:&lt;br /&gt;
&lt;br /&gt;
	- media read errors&lt;br /&gt;
	- media write errors&lt;br /&gt;
	- successful media read, corrupted contents&lt;br /&gt;
&lt;br /&gt;
The context in which the errors occur also influences the exception processing&lt;br /&gt;
that is required. For example, an unrecoverable metadata read error within a&lt;br /&gt;
dirty transaction is a fatal error, whilst the same error during a read-only&lt;br /&gt;
operation will simply log the error to syslog and return an error to userspace.&lt;br /&gt;
&lt;br /&gt;
Furthermore, the storage subsystem plays a part indeciding how to handle&lt;br /&gt;
errors. The reason is that in many storage configurations I/O errors can be&lt;br /&gt;
transient. For example, in a SAN a broken fibre can cause a failover to a&lt;br /&gt;
redundant path, however the inflight I/O on the failed is usually timed out and&lt;br /&gt;
an error returned. We don&#039;t want to shut down the filesystem on such an error -&lt;br /&gt;
we want to wait for failover to a redundant path and then retry the I/O. If the&lt;br /&gt;
failover succeeds, then the I/O will succeed. Hence any robust method of&lt;br /&gt;
exception handling needs to consider that I/O exceptions may be transient.&lt;br /&gt;
&lt;br /&gt;
In the abscence of redundant metadata, there is little we can do right now&lt;br /&gt;
on a permanent media read error. There are a number of approaches we&lt;br /&gt;
can take for handling the exception:&lt;br /&gt;
&lt;br /&gt;
	- try reading the block again. Normally we don&#039;t get an error&lt;br /&gt;
	  returned until the device has given up on trying to recover it.&lt;br /&gt;
	  If it&#039;s a transient failure, then we should eventually get a&lt;br /&gt;
	  good block back. If a retry fails, then:&lt;br /&gt;
&lt;br /&gt;
	- inform the lower layer that it needs to perform recovery on that&lt;br /&gt;
	  block before trying to read it again. For path failover situations,&lt;br /&gt;
	  this should block until a redundant path is brought online. If no&lt;br /&gt;
	  redundant path exists or recovery from parity/error coding blocks&lt;br /&gt;
	  fails, then we cannot recover the block and we have a fatal error&lt;br /&gt;
	  situation.&lt;br /&gt;
&lt;br /&gt;
Ultimately, however, we reach a point where we have to give up - the metadata&lt;br /&gt;
no longer exists on disk and we have to enter a repair process to fix the&lt;br /&gt;
problem. That is, shut down the filesystem and get a human to intervene&lt;br /&gt;
and fix the problem.&lt;br /&gt;
&lt;br /&gt;
At this point, the only way we can prevent a shutdown situation from occurring&lt;br /&gt;
is to have redundant metadata on disk. That is, whenever we get an error&lt;br /&gt;
reported, we can immediately retry by reading from an alternate metadata block.&lt;br /&gt;
If we can read from the alternate block, we can continue onwards without&lt;br /&gt;
the user even knowing there is a block in the filesystem. Of course, we&#039;d&lt;br /&gt;
need to log the event for the administrator to take action on at some point&lt;br /&gt;
in the future.&lt;br /&gt;
&lt;br /&gt;
Even better, we can mostly avoid this intervention if we have alternate&lt;br /&gt;
metadata blocks. That is, we can repair blocks that are returning read errors&lt;br /&gt;
during the exception processing. In the case of media errors, they can&lt;br /&gt;
generally be corrected simply by re-writing the block that was returning the&lt;br /&gt;
error. This will force drives to remap the bad blocks internally so the next&lt;br /&gt;
read from that location will return valid data. This, if my understanding is&lt;br /&gt;
correct, is the same process that ZFS and BTRFS use to recover from and correct&lt;br /&gt;
such errors.&lt;br /&gt;
&lt;br /&gt;
NOTE: Adding redundant metadata can be done in several different ways. I&#039;m not&lt;br /&gt;
going to address that here as it is a topic all to itself. The focus of this&lt;br /&gt;
document is to outline how the redundant metadata could be used to enhance&lt;br /&gt;
exception processing and prevent a large number of cases where we currently&lt;br /&gt;
shut down the filesystem.&lt;br /&gt;
&lt;br /&gt;
TODO:&lt;br /&gt;
	Transient write error&lt;br /&gt;
	Permanent write error&lt;br /&gt;
	Corrupted data on read&lt;br /&gt;
	Corrupted data on write (detected during guard calculation)&lt;br /&gt;
	I/O timeouts&lt;br /&gt;
	Memory corruption&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Reverse Mapping ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
It is worth noting that even redundant metadata doesn&#039;t solve all our&lt;br /&gt;
problems. Realistically, all that redundant metadata gives us is the ability&lt;br /&gt;
to recover from top-down traversal exceptions. It does not help exception&lt;br /&gt;
handling of occurences such as double sector failures (i.e. loss of redundancy&lt;br /&gt;
and a metadata block). Double sector failures are the most common cause&lt;br /&gt;
of RAID5 data loss - loss of a disk followed by a sector read error during&lt;br /&gt;
rebuild on one of the remaining disks.&lt;br /&gt;
&lt;br /&gt;
In this case, we&#039;ve got a block on disk that is corrupt. We know what block it&lt;br /&gt;
is, but we have no idea who the owner of the block is. If it is a metadata&lt;br /&gt;
block, then we can recover it if we have redundant metadata.  Even if this is&lt;br /&gt;
user data, we still want to be able to tell them what file got corrupted by the&lt;br /&gt;
failure event.  However, without doing a top-down traverse of the filesystem we&lt;br /&gt;
cannot find the owner of the block that was corrupted.&lt;br /&gt;
&lt;br /&gt;
This is where we need a reverse block map. Every time we do an allocation of&lt;br /&gt;
an extent we know who the owner of the block is. If we record this information&lt;br /&gt;
in a separate tree then we can do a simple lookup to find the owner of any&lt;br /&gt;
block and start an exception handling process to repair the damage. Ideally&lt;br /&gt;
we also need to include information about the type of block as well. For&lt;br /&gt;
example, and inode can own:&lt;br /&gt;
&lt;br /&gt;
	- data blocks&lt;br /&gt;
	- data fork BMBT blocks&lt;br /&gt;
	- attribute blocks&lt;br /&gt;
	- attribute fork BMBT blocks&lt;br /&gt;
&lt;br /&gt;
So keeping track of owner + type would help indicate what sort of exception&lt;br /&gt;
handling needs to take place. For example, a missing data fork BMBT block means there&lt;br /&gt;
will be unreferenced extents across the filesystem. These &#039;lost extents&#039;&lt;br /&gt;
could be recovered by reverse map traversal to find all the BMBT and data&lt;br /&gt;
blocks owned by that inode and finding the ones that are not referenced.&lt;br /&gt;
If the reverse map held suffient extra metadata - such as the offset within the&lt;br /&gt;
file for the extent - the exception handling process could rebuild the BMBT&lt;br /&gt;
tree completely without needing ænd external help.&lt;br /&gt;
&lt;br /&gt;
It would seem to me that the reverse map needs to be a long-pointer format&lt;br /&gt;
btree and held per-AG. it needs long pointers because the owner of an extent&lt;br /&gt;
can be anywhere in the filesystem, and it needs to be per-AG to avoid adverse&lt;br /&gt;
effect on allocation parallelism.&lt;br /&gt;
&lt;br /&gt;
The format of the reverse map record will be dependent on the amount of&lt;br /&gt;
metdata we need to store. We need:&lt;br /&gt;
&lt;br /&gt;
	- owner (64 bit, primary record)&lt;br /&gt;
	- {block, len} extent descriptor&lt;br /&gt;
	- type&lt;br /&gt;
	- per-type specific metadata (e.g. offset for data types).&lt;br /&gt;
&lt;br /&gt;
Looking at worst case here, say we have 32 bytes per record, the worst case&lt;br /&gt;
space usage of the reverse map btree woul dbe roughly 62 records per 4k&lt;br /&gt;
block. With a 1TB allocation group, we have 228 4k blocks in the AG&lt;br /&gt;
that could require unique reverse mappings. That gives us roughly 222&lt;br /&gt;
4k blocks to for the reverse map, or 234 bytes - roughly 16GB per 1TB&lt;br /&gt;
of space.&lt;br /&gt;
&lt;br /&gt;
It may be a good idea to allocate this space at mkfs time (tagged as unwritten&lt;br /&gt;
so it doesn&#039;t need zeroing) to avoid allocation overhead and potential free&lt;br /&gt;
space fragmentation as the reverse map index grows and shrinks. If we do&lt;br /&gt;
this we could even treat this as a array/skip list where a given block in the&lt;br /&gt;
AG has a fixed location in the map. This will require more study to determine&lt;br /&gt;
the advantages and disadvantages of such approaches.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Recovering From Errors During Transactions ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
One of the big problems we face with exception recovery is what to do&lt;br /&gt;
when we take an exception inside a dirty transaction. At present, any&lt;br /&gt;
error is treated as a fatal error, the transaction is cancelled and&lt;br /&gt;
the filesystem is shut down. Even though we may have a context which&lt;br /&gt;
can return an error, we are unable to revert the changes we have&lt;br /&gt;
made during the transaction and so cannot back out.&lt;br /&gt;
&lt;br /&gt;
Effectively, a cancelled dirty transaction looks exactly like in-memory&lt;br /&gt;
structure corruption. That is, what is in memory is different to that&lt;br /&gt;
on disk, in the log or in asynchronous transactions yet to be written&lt;br /&gt;
to the log. Hence we cannot simply return an error and continue.&lt;br /&gt;
&lt;br /&gt;
To be able to do this, we need to be able to undo changes made in a given&lt;br /&gt;
transaction. The method XFS uses for journalling - write-ahead logging -&lt;br /&gt;
makes this diffcult to do. A transaction proceeds in the following&lt;br /&gt;
order:&lt;br /&gt;
&lt;br /&gt;
	- allocate transaction&lt;br /&gt;
	- reserve space in the journal for transaction&lt;br /&gt;
	- repeat until change is complete:&lt;br /&gt;
		- lock item&lt;br /&gt;
		- join item to transaction&lt;br /&gt;
		- modify item&lt;br /&gt;
		- record region of change to item&lt;br /&gt;
	- transaction commit&lt;br /&gt;
&lt;br /&gt;
Effectively, we modify structures in memory then record where we&lt;br /&gt;
changed them for the transaction commit to write to disk. Unfortunately,&lt;br /&gt;
this means we overwrite the original state of the items in memory,&lt;br /&gt;
leaving us with no way to back out those changes from memory if&lt;br /&gt;
something goes wrong.&lt;br /&gt;
&lt;br /&gt;
However, based on the observation that we are supposed to join an item to the&lt;br /&gt;
transaction *before* we start modifying it, it is possible to record the state&lt;br /&gt;
of the item before we start changing it. That is, we have a hoook that can&lt;br /&gt;
allow us take a copy of the unmodified item when we join it to the&lt;br /&gt;
transaction.&lt;br /&gt;
&lt;br /&gt;
If we have an unmodified copy of the item in memory, then if the transaction&lt;br /&gt;
is cancelled when dirty, we have the information necessary to undo, or roll&lt;br /&gt;
back, the changes made in the transaction. This would allow us to return&lt;br /&gt;
the in-memory state to that prior to the transaction starting, thereby&lt;br /&gt;
ensuring that the in-memory state matches the rest of the filesystem and&lt;br /&gt;
allowing us to return an error to the calling context.&lt;br /&gt;
&lt;br /&gt;
This is not without overhead. we would have to copy every metadata item&lt;br /&gt;
entirely in every transaction. This will increase the CPU overhead&lt;br /&gt;
of each transaction as well as the memory required. It is the memory&lt;br /&gt;
requirement more than the CPU overhead that concerns me - we may need&lt;br /&gt;
to ensure we have a memory pool associated with transaction reservation&lt;br /&gt;
that guarantees us enough memory is available to complete the transaction.&lt;br /&gt;
However, given that we could roll back transactions, we could now *fail&lt;br /&gt;
transactions* with ENOMEM and not have to shut down the filesystem, so this&lt;br /&gt;
may be an acceptible trade-off.&lt;br /&gt;
&lt;br /&gt;
In terms of implementation, it is worth noting that there is debug code in&lt;br /&gt;
the xfs_buf_log_item for checking that all the modified regions of a buffer&lt;br /&gt;
were logged. Importantly, this is implemented by copying the original buffer&lt;br /&gt;
in the item initialisation when it is first attached to a transaction. In&lt;br /&gt;
other words, this debug code implements the mechanism we need to be able&lt;br /&gt;
to rollback changes made in a transaction. Other item types would require&lt;br /&gt;
similar changes to be made.&lt;br /&gt;
&lt;br /&gt;
Overall, this doesn&#039;t look like a particularly complex change to make; the&lt;br /&gt;
only real question is how much overhead is it going to introduce. With CPUs&lt;br /&gt;
growing more cores all the time, and XFS being aimed at extremely&lt;br /&gt;
multi-threaded workloads, this overhead may not be a concern for long.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Failure Domains ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If we plan to have redundant metadata, or even try to provide fault isolation&lt;br /&gt;
between different parts of the filesystem namespace, we need to know about&lt;br /&gt;
independent regions of the filesystem. &#039;Independent Regions&#039; (IR) are ranges&lt;br /&gt;
of the filesystem block address space that don&#039;t share resources with&lt;br /&gt;
any other range.&lt;br /&gt;
&lt;br /&gt;
A classic example of a filesystem made up of multiple IRs is a linear&lt;br /&gt;
concatenation of multiple drives into a larger address space.  The address&lt;br /&gt;
space associated with each drive can operate independently from the other&lt;br /&gt;
drives, and a failure of one drive will not affect the operation of the address&lt;br /&gt;
spaces associated with other drives in the linear concatenation.&lt;br /&gt;
&lt;br /&gt;
A Failure Domain (FD) is made up of one or more IRs. IRs cannot be shared&lt;br /&gt;
between FDs - IRs are not independent if they are shared! Effectively, an&lt;br /&gt;
ID is an encoding of the address space within the filesystem that lower level&lt;br /&gt;
failures (from below the filesystem) will not propagate outside. The geometry&lt;br /&gt;
and redundancy in the underlying storage will determine the nature of the&lt;br /&gt;
IRs available to the filesystem.&lt;br /&gt;
&lt;br /&gt;
To use redundant metadata effectively for recovering from fatal lower layer&lt;br /&gt;
loss or corruption, we really need to be able to place said redundant&lt;br /&gt;
metadata in a different FDs. That way a loss in one domain can be recovered&lt;br /&gt;
from a domain that is still intact. It also means that it is extremely&lt;br /&gt;
difficult to lose or corrupt all copies of a given piece of metadata;&lt;br /&gt;
that would require multiple independent faults to occur in a localised&lt;br /&gt;
temporaral window. Concurrent multiple component failure in multiple&lt;br /&gt;
IRs is considered to be quite unlikely - if such an event were to&lt;br /&gt;
occur, it is likely that there is more to worry about than filesystem&lt;br /&gt;
consistency (like putting out the fire in the data center).&lt;br /&gt;
&lt;br /&gt;
Another use of FDs is to try to minimise the number of domain boundaries&lt;br /&gt;
each object in the filesystem crosses. If an object is wholly contained&lt;br /&gt;
within a FD, and that object is corrupted, then the repair problem is&lt;br /&gt;
isolated to that FD, not the entire filesystem. That is, by making&lt;br /&gt;
allocation strategies and placement decisions aware of failure domain&lt;br /&gt;
boundaries we can constraint the location of related data and metadata.&lt;br /&gt;
Once locality is constrained, the scope of repairing an object if&lt;br /&gt;
it becomes corrupted is reduced to that of ensuring the FD is consistent.&lt;br /&gt;
&lt;br /&gt;
There are many ways of limiting cross-domain dependencies; I will&lt;br /&gt;
not try to detail them here. Likewise, there are many ways of introducing&lt;br /&gt;
such information into XFS - mkfs, dynamically via allocation policies,&lt;br /&gt;
etc - so I won&#039;t try to detail them, either. The main point to be&lt;br /&gt;
made is that to make full use of redundant metadata and to reduce&lt;br /&gt;
the scope of common reapir problems we need to pay attention to &lt;br /&gt;
how the system can fail to ensure that we can recover from failures&lt;br /&gt;
as quickly as possible.&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2984</id>
		<title>XFS FAQ</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2984"/>
		<updated>2015-12-14T16:39:32Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: /* Q. Should barriers be enabled with storage which has a persistent write cache? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about XFS? ==&lt;br /&gt;
&lt;br /&gt;
The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.&lt;br /&gt;
&lt;br /&gt;
You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the &#039;&#039;&#039;&amp;lt;nowiki&amp;gt;#xfs&amp;lt;/nowiki&amp;gt;&#039;&#039;&#039; IRC channel on &#039;&#039;irc.freenode.net&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about ACLs? ==&lt;br /&gt;
&lt;br /&gt;
Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;acl(5)&#039;&#039;&#039; manual page is also quite extensive.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find information about the internals of XFS? ==&lt;br /&gt;
&lt;br /&gt;
An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.&lt;br /&gt;
&lt;br /&gt;
Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.&lt;br /&gt;
&lt;br /&gt;
== Q: What partition type should I use for XFS on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Linux native filesystem (83).&lt;br /&gt;
&lt;br /&gt;
== Q: What mount options does XFS have? ==&lt;br /&gt;
&lt;br /&gt;
There are a number of mount options influencing XFS filesystems - refer to the &#039;&#039;&#039;mount(8)&#039;&#039;&#039; manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])&lt;br /&gt;
&lt;br /&gt;
== Q: Is there any relation between the XFS utilities and the kernel version? ==&lt;br /&gt;
&lt;br /&gt;
No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Does it run on platforms other than i386? ==&lt;br /&gt;
&lt;br /&gt;
XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Do quotas work on XFS? ==&lt;br /&gt;
&lt;br /&gt;
Yes.&lt;br /&gt;
&lt;br /&gt;
To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/  http://sourceforge.net/projects/linuxquota/] or use &#039;&#039;&#039;xfs_quota(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: What&#039;s project quota? ==&lt;br /&gt;
&lt;br /&gt;
The  project  quota  is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Can group quota and project quota be used at the same time? ==&lt;br /&gt;
&lt;br /&gt;
No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==&lt;br /&gt;
&lt;br /&gt;
To be answered.&lt;br /&gt;
&lt;br /&gt;
== Q: Are there any dump/restore tools for XFS? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and &#039;&#039;&#039;xfsrestore(8)&#039;&#039;&#039; are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.&lt;br /&gt;
&lt;br /&gt;
== Q: Does LILO work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
This depends on where you install LILO.&lt;br /&gt;
&lt;br /&gt;
Yes, for MBR (Master Boot Record) installations.&lt;br /&gt;
&lt;br /&gt;
No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.&lt;br /&gt;
&lt;br /&gt;
== Q: Does GRUB work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.&lt;br /&gt;
&lt;br /&gt;
== Q: Can XFS be used for a root filesystem? ==&lt;br /&gt;
&lt;br /&gt;
Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the &amp;quot;rootflags=&amp;quot; kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit &amp;quot;logdev=&amp;quot; specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]&lt;br /&gt;
&lt;br /&gt;
== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be &amp;quot;clean&amp;quot; when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.&lt;br /&gt;
&lt;br /&gt;
== Q: Is there a way to make a XFS filesystem larger or smaller? ==&lt;br /&gt;
&lt;br /&gt;
You can &#039;&#039;NOT&#039;&#039; make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.&lt;br /&gt;
&lt;br /&gt;
An XFS filesystem may be enlarged by using &#039;&#039;&#039;xfs_growfs(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the &#039;&#039;exact same&#039;&#039; starting point. Run &#039;&#039;&#039;xfs_growfs&#039;&#039;&#039; to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.&lt;br /&gt;
&lt;br /&gt;
Using XFS filesystems on top of a volume manager makes this a lot easier.&lt;br /&gt;
&lt;br /&gt;
== Q: What information should I include when reporting a problem? ==&lt;br /&gt;
&lt;br /&gt;
What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:&lt;br /&gt;
&lt;br /&gt;
* kernel version (uname -a)&lt;br /&gt;
* xfsprogs version (xfs_repair -V)&lt;br /&gt;
* number of CPUs&lt;br /&gt;
* contents of /proc/meminfo&lt;br /&gt;
* contents of /proc/mounts&lt;br /&gt;
* contents of /proc/partitions&lt;br /&gt;
* RAID layout (hardware and/or software)&lt;br /&gt;
* LVM configuration&lt;br /&gt;
* type of disks you are using&lt;br /&gt;
* write cache status of drives&lt;br /&gt;
* size of BBWC and mode it is running in&lt;br /&gt;
* xfs_info output on the filesystem in question&lt;br /&gt;
* dmesg output showing all error messages and stack traces&lt;br /&gt;
 &lt;br /&gt;
Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:&lt;br /&gt;
&lt;br /&gt;
# iostat -x -d -m 5&lt;br /&gt;
# vmstat 5&lt;br /&gt;
 &lt;br /&gt;
can give us insight into the IO and memory utilisation of your machine at the time of the problem.&lt;br /&gt;
&lt;br /&gt;
If the filesystem is hanging, then capture the output of the dmesg command after running:&lt;br /&gt;
&lt;br /&gt;
 # echo w &amp;gt; /proc/sysrq-trigger&lt;br /&gt;
 # dmesg&lt;br /&gt;
&lt;br /&gt;
will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.&lt;br /&gt;
&lt;br /&gt;
And for advanced users, capturing an event trace using &#039;&#039;&#039;trace-cmd&#039;&#039;&#039; (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it&#039;s a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd record -e xfs\*&lt;br /&gt;
&lt;br /&gt;
before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd report &amp;gt; trace_report.txt&lt;br /&gt;
&lt;br /&gt;
Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.&lt;br /&gt;
&lt;br /&gt;
If you have a problem with &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039;, make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using &#039;&#039;&#039;xfs_metadump(8)&#039;&#039;&#039; (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.&lt;br /&gt;
&lt;br /&gt;
== Q: Mounting an XFS filesystem does not work - what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
If mount prints an error message something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     mount: /dev/hda5 has wrong major or minor number&lt;br /&gt;
&lt;br /&gt;
you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the &amp;quot;-t xfs&amp;quot; option on mount or the &amp;quot;xfs&amp;quot; option in &amp;lt;tt&amp;gt;/etc/fstab&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
If you get something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 mount: wrong fs type, bad option, bad superblock on /dev/sda1,&lt;br /&gt;
        or too many mounted file systems&lt;br /&gt;
&lt;br /&gt;
Refer to your system log file (&amp;lt;tt&amp;gt;/var/log/messages&amp;lt;/tt&amp;gt;) for a detailed diagnostic message from the kernel.&lt;br /&gt;
&lt;br /&gt;
== Q: Does the filesystem have an undelete capability? ==&lt;br /&gt;
&lt;br /&gt;
There is no undelete in XFS.&lt;br /&gt;
&lt;br /&gt;
However, if an inode is unlinked but neither it nor its associated data blocks get immediately re-used and overwritten, there is some small chance to recover the file from the disk.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;photorec&#039;&#039;, &#039;&#039;xfs_irecover&#039;&#039; or &#039;&#039;xfsr&#039;&#039; are some tools which attempt to do this, with varying success.&lt;br /&gt;
&lt;br /&gt;
There are also commercial data recovery services and closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS] which claims to recover data, although this has not been tested by the XFS developers.&lt;br /&gt;
&lt;br /&gt;
As always, the best advice is to keep good backups.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I backup a XFS filesystem and ACLs? ==&lt;br /&gt;
&lt;br /&gt;
You can backup a XFS filesystem with utilities like &#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and standard &#039;&#039;&#039;tar(1)&#039;&#039;&#039; for standard files. If you want to backup ACLs you will need to use &#039;&#039;&#039;xfsdump&#039;&#039;&#039; or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (&amp;gt; version 3.1.4) or [http://rsync.samba.org/ rsync] (&amp;gt;= version 3.0.0) to backup ACLs and EAs. &#039;&#039;&#039;xfsdump&#039;&#039;&#039; can also be integrated with [http://www.amanda.org/ amanda(8)].&lt;br /&gt;
&lt;br /&gt;
== Q: I see applications returning error 990 or &amp;quot;Structure needs cleaning&amp;quot;, what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], &amp;quot;Structure needs cleaning.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.&lt;br /&gt;
&lt;br /&gt;
There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.&lt;br /&gt;
&lt;br /&gt;
You can use xfs_repair to remedy the problem (with the file system unmounted).&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==&lt;br /&gt;
&lt;br /&gt;
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.&lt;br /&gt;
&lt;br /&gt;
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.&lt;br /&gt;
&lt;br /&gt;
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you&#039;ll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the &#039;&#039;&#039;xfs_bmap(8)&#039;&#039;&#039; command).&lt;br /&gt;
&lt;br /&gt;
== Q: What is the problem with the write cache on journaled filesystems? ==&lt;br /&gt;
&lt;br /&gt;
Many drives use a write back cache in order to speed up the performance of writes.  However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk.  Further, the drive can de-stage data from the write cache to the platters in any order that it chooses.  This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk.  When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.&lt;br /&gt;
&lt;br /&gt;
With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information.  In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.&lt;br /&gt;
&lt;br /&gt;
With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued.  A powerfail &amp;quot;only&amp;quot; loses data in the cache but no essential ordering is violated, and corruption will not occur.&lt;br /&gt;
&lt;br /&gt;
With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance.  But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I tell if I have the disk write cache enabled? ==&lt;br /&gt;
&lt;br /&gt;
For SCSI/SATA:&lt;br /&gt;
&lt;br /&gt;
* Look in dmesg(8) output for a driver line, such as:&amp;lt;br /&amp;gt; &amp;quot;SCSI device sda: drive cache: write back&amp;quot;&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# sginfo -c /dev/sda | grep -i &#039;write cache&#039; &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -I /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; and look under &amp;quot;Enabled Supported&amp;quot; for &amp;quot;Write cache&amp;quot;&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
== Q: How can I address the problem with the disk write cache? ==&lt;br /&gt;
&lt;br /&gt;
=== Disabling the disk write back cache. ===&lt;br /&gt;
&lt;br /&gt;
For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -W0 /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # hdparm -W0 /dev/hda&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# blktool /dev/sda wcache off&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # blktool /dev/hda wcache off&lt;br /&gt;
&lt;br /&gt;
For SCSI:&lt;br /&gt;
&lt;br /&gt;
* Using sginfo(8) which is a little tedious&amp;lt;br /&amp;gt; It takes 3 steps. For example:&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -c /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives a list of attribute names and values&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cX /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives an array of cache values which you must match up with from step 1, e.g.&amp;lt;br /&amp;gt; 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; allows you to reset the value of the cache attributes.&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Using an external log. ===&lt;br /&gt;
&lt;br /&gt;
Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will &#039;&#039;&#039;not&#039;&#039;&#039; solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won&#039;t be able to guarantee that if the metadata is on a drive with the write cache enabled.&lt;br /&gt;
&lt;br /&gt;
In fact using an external log will disable XFS&#039; write barrier support.&lt;br /&gt;
&lt;br /&gt;
=== Write barrier support. ===&lt;br /&gt;
&lt;br /&gt;
Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with &amp;quot;nobarrier&amp;quot;. Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported with external log device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported by the underlying device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, trial barrier write failed&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If the filesystem is mounted with an external log device then we currently don&#039;t support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn&#039;t support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.&lt;br /&gt;
&lt;br /&gt;
== Q. Should barriers be enabled with storage which has a persistent write cache? ==&lt;br /&gt;
&lt;br /&gt;
Many hardware RAIDs have a persistent write cache which is preserved across power failure, interface resets, system crashes, etc.  The same may be true of some SSD devices.  This sort of hardware should report to the operating system that no flushes are required, and in that case barriers will not be issued, even without the &amp;quot;nobarrier&amp;quot; option.  Quoting Christoph Hellwig [http://oss.sgi.com/archives/xfs/2015-12/msg00281.html on the xfs list],&lt;br /&gt;
  If the device does not need cache flushes it should not report requiring&lt;br /&gt;
  flushes, in which case nobarrier will be a noop.  Or to phrase it&lt;br /&gt;
  differently:  If nobarrier makes a difference skipping it is not safe.&lt;br /&gt;
On modern kernels with hardware which properly reports write cache behavior, there is no need to change barrier options at mount time.&lt;br /&gt;
&lt;br /&gt;
== Q. Which settings does my RAID controller need ? ==&lt;br /&gt;
&lt;br /&gt;
It&#039;s hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:&lt;br /&gt;
&lt;br /&gt;
Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory &amp;quot;[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]&amp;quot;) which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.&lt;br /&gt;
&lt;br /&gt;
If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.&lt;br /&gt;
&lt;br /&gt;
* onboard RAID controllers: there are so many different types it&#039;s hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn&#039;t even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.&lt;br /&gt;
&lt;br /&gt;
* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86); &lt;br /&gt;
&lt;br /&gt;
* Adaptec: allows setting individual drives cache&lt;br /&gt;
arcconf setcache &amp;lt;disk&amp;gt; wb|wt&lt;br /&gt;
wb=write back, which means write cache on, wt=write through, which means write cache off. So &amp;quot;wt&amp;quot; should be chosen.&lt;br /&gt;
&lt;br /&gt;
* Areca: In archttp under &amp;quot;System Controls&amp;quot; -&amp;gt; &amp;quot;System Configuration&amp;quot; there&#039;s the option &amp;quot;Disk Write Cache Mode&amp;quot; (defaults &amp;quot;Auto&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Off&amp;quot;: disk write cache is turned off&lt;br /&gt;
&lt;br /&gt;
&amp;quot;On&amp;quot;: disk write cache is enabled, this is not safe for your data but fast&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Auto&amp;quot;: If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to &amp;quot;On&amp;quot;, because neither controller cache nor disk cache is safe so you don&#039;t seem to care about your data and just want high speed (which you get then).&lt;br /&gt;
&lt;br /&gt;
That&#039;s a very sensible default so you can let it &amp;quot;Auto&amp;quot; or enforce &amp;quot;Off&amp;quot; to be sure.&lt;br /&gt;
&lt;br /&gt;
* LSI MegaRAID: allows setting individual disks cache:&lt;br /&gt;
 MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL                          # flushes the controller cache&lt;br /&gt;
 MegaCli -LDGetProp -Cache    -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the controller cache settings&lt;br /&gt;
 MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the disk cache settings (for all phys. disks in logical disk)&lt;br /&gt;
 MegaCli -LDSetProp -EnDskCache|DisDskCache  -LN|-L0,1,2|-LAll  -aN|-a0,1,2|-aALL # set disk cache setting&lt;br /&gt;
&lt;br /&gt;
* Xyratex: from the docs: &amp;quot;Write cache includes the disk drive cache and controller cache.&amp;quot;. So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.&lt;br /&gt;
&lt;br /&gt;
== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==&lt;br /&gt;
&lt;br /&gt;
The biggest problem is that those products seem to also virtualize disk &lt;br /&gt;
writes in a way that even barriers don&#039;t work any more, which means even &lt;br /&gt;
a fsync is not reliable. Tests confirm that unplugging the power from &lt;br /&gt;
such a system even with RAID controller with battery backed cache and &lt;br /&gt;
hard disk cache turned off (which is safe on a normal host) you can &lt;br /&gt;
destroy a database within the virtual machine (client, domU whatever you &lt;br /&gt;
call it).&lt;br /&gt;
&lt;br /&gt;
In qemu you can specify cache=off on the line specifying the virtual &lt;br /&gt;
disk. For others information is missing.&lt;br /&gt;
&lt;br /&gt;
== Q: What is the issue with directory corruption in Linux 2.6.17? ==&lt;br /&gt;
&lt;br /&gt;
In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some &amp;quot;sparse&amp;quot; endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: the fix is included in 2.6.17.7 and later kernels.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
To add insult to injury, &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039; is currently not correcting these directories on detection of this corrupt state either. This &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; issue is actively being worked on, and a fixed version will be available shortly.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfs_repair -n&#039;&#039;&#039; should be able to detect any directory corruption.&lt;br /&gt;
&lt;br /&gt;
Until a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; binary is available, one can make use of the &#039;&#039;&#039;xfs_db(8)&#039;&#039;&#039; command to mark the problem directory for removal (see the example below). A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; invocation will remove the directory and move all contents into &amp;quot;lost+found&amp;quot;, named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 core.mode = 040755&lt;br /&gt;
 core.version = 2&lt;br /&gt;
 core.format = 3 (btree)&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; write core.mode 0&lt;br /&gt;
 xfs_db&amp;amp;gt; quit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; will clear the directory, and add new entries (named by inode number) in lost+found.&lt;br /&gt;
&lt;br /&gt;
The easiest way to map inode numbers to full paths is via &#039;&#039;&#039;xfs_ncheck(8)&#039;&#039;&#039;&amp;lt;nowiki&amp;gt;: &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_ncheck -i 14101 -i 14102 /dev/sdXXX&lt;br /&gt;
       14101 full/path/mumble_fratz_foo_bar_1495&lt;br /&gt;
       14102 full/path/mumble_fratz_foo_bar_1494&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 ...&lt;br /&gt;
 next_unlinked = null&lt;br /&gt;
 u.bmbt.level = 1&lt;br /&gt;
 u.bmbt.numrecs = 1&lt;br /&gt;
 u.bmbt.keys[1] = [startoff] 1:[0]&lt;br /&gt;
 u.bmbt.ptrs[1] = 1:3628&lt;br /&gt;
 xfs_db&amp;amp;gt; fsblock 3628&lt;br /&gt;
 xfs_db&amp;amp;gt; type bmapbtd&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 magic = 0x424d4150&lt;br /&gt;
 level = 0&lt;br /&gt;
 numrecs = 19&lt;br /&gt;
 leftsib = null&lt;br /&gt;
 rightsib = null&lt;br /&gt;
 recs[1-19] = [startoff,startblock,blockcount,extentflag]&lt;br /&gt;
        1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]&lt;br /&gt;
        5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]&lt;br /&gt;
        9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]&lt;br /&gt;
        12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]&lt;br /&gt;
        15:[33554436,3488,8,0] 16:[33554444,3629,4,0]&lt;br /&gt;
        17:[33554448,3748,4,0] 18:[33554452,3900,4,0]&lt;br /&gt;
        19:[67108864,3364,4,0]&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the &#039;&#039;&#039;xfs_db&#039;&#039;&#039; dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; dblock 20&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 dhdr.magic = 0x58443244&lt;br /&gt;
 dhdr.bestfree[0].offset = 0&lt;br /&gt;
 dhdr.bestfree[0].length = 0&lt;br /&gt;
 dhdr.bestfree[1].offset = 0&lt;br /&gt;
 dhdr.bestfree[1].length = 0&lt;br /&gt;
 dhdr.bestfree[2].offset = 0&lt;br /&gt;
 dhdr.bestfree[2].length = 0&lt;br /&gt;
 du[0].inumber = 13937&lt;br /&gt;
 du[0].namelen = 25&lt;br /&gt;
 du[0].name = &amp;quot;mumble_fratz_foo_bar_1595&amp;quot;&lt;br /&gt;
 du[0].tag = 0x10&lt;br /&gt;
 du[1].inumber = 13938&lt;br /&gt;
 du[1].namelen = 25&lt;br /&gt;
 du[1].name = &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;&lt;br /&gt;
 du[1].tag = 0x38&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
So, here we can see that inode number 13938 matches up with name &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;. Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at &amp;quot;lost+found&amp;quot; (once &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; has removed the corrupt directory).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why does my &amp;gt; 2TB XFS partition disappear when I reboot ? ==&lt;br /&gt;
&lt;br /&gt;
Strictly speaking this is not an XFS problem.&lt;br /&gt;
&lt;br /&gt;
To support &amp;gt; 2TB partitions you need two things: a kernel that supports large block devices (&amp;lt;tt&amp;gt;CONFIG_LBD=y&amp;lt;/tt&amp;gt;) and a partition table format that can hold large partitions.  The default DOS partition tables don&#039;t.  The best partition format for&lt;br /&gt;
&amp;gt; 2TB partitions is the EFI GPT format (&amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Without CONFIG_LBD=y you can&#039;t even create the filesystem, but without &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt; it works fine until you reboot at which point the partition will disappear.  Note that you need to enable the &amp;lt;tt&amp;gt;CONFIG_PARTITION_ADVANCED&amp;lt;/tt&amp;gt; option before you can set &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I receive &amp;lt;tt&amp;gt;No space left on device&amp;lt;/tt&amp;gt; after &amp;lt;tt&amp;gt;xfs_growfs&amp;lt;/tt&amp;gt;? ==&lt;br /&gt;
&lt;br /&gt;
After [http://oss.sgi.com/archives/xfs/2009-01/msg01023.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. This was an issue with the older &amp;quot;inode32&amp;quot; inode allocation mode, where inode allocation is restricted to lower filesysetm blocks.  To fix this, [http://oss.sgi.com/archives/xfs/2009-01/msg01031.html Dave Chinner advised]:&lt;br /&gt;
&lt;br /&gt;
  The only way to fix this is to move data around to free up space&lt;br /&gt;
  below 1TB. Find your oldest data (i.e. that was around before even&lt;br /&gt;
  the first grow) and move it off the filesystem (move, not copy).&lt;br /&gt;
  Then if you copy it back on, the data blocks will end up above 1TB&lt;br /&gt;
  and that should leave you with plenty of space for inodes below 1TB.&lt;br /&gt;
  &lt;br /&gt;
  A complete dump and restore will also fix the problem ;)&lt;br /&gt;
&lt;br /&gt;
Alternately, you can add &#039;inode64&#039; to your mount options to allow inodes to live above 1TB.&lt;br /&gt;
&lt;br /&gt;
example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&amp;amp;forum=38 No space left on device on xfs filesystem with 7.7TB free]&lt;br /&gt;
&lt;br /&gt;
However, &#039;inode64&#039; has been the default behavior since kernel v3.7...&lt;br /&gt;
&lt;br /&gt;
Unfortunately, v3.7 also added a bug present from kernel v3.7 to v3.17 which caused new allocation groups added by growfs to be unavailable for inode allocation.  This was fixed by commit &amp;lt;tt&amp;gt;[http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9de67c3ba9ea961ba420573d56479d09d33a7587 9de67c3b xfs: allow inode allocations in post-growfs disk space.]&amp;lt;/tt&amp;gt; in kernel v3.17.&lt;br /&gt;
Without that commit, the problem can be worked around by doing a &amp;quot;mount -o remount,inode64&amp;quot; after the growfs operation.&lt;br /&gt;
&lt;br /&gt;
== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==&lt;br /&gt;
&lt;br /&gt;
The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons. &lt;br /&gt;
&lt;br /&gt;
Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.&lt;br /&gt;
&lt;br /&gt;
== Q: How to get around a bad inode repair is unable to clean up ==&lt;br /&gt;
&lt;br /&gt;
The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.&lt;br /&gt;
&lt;br /&gt;
  xfs_db -x -c &#039;inode XXX&#039; -c &#039;write core.nextents 0&#039; -c &#039;write core.size 0&#039; /dev/hdXX&lt;br /&gt;
&lt;br /&gt;
== Q: How to calculate the correct sunit,swidth values for optimal performance ==&lt;br /&gt;
&lt;br /&gt;
XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.&lt;br /&gt;
&lt;br /&gt;
These options can be sometimes autodetected (for example with md raid and recent enough kernel (&amp;gt;= 2.6.32) and xfsprogs (&amp;gt;= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.&lt;br /&gt;
&lt;br /&gt;
The calculation of these values is quite simple:&lt;br /&gt;
&lt;br /&gt;
  su = &amp;lt;RAID controllers stripe size in BYTES (or KiBytes when used with k)&amp;gt;&lt;br /&gt;
  sw = &amp;lt;# of data disks (don&#039;t count parity disks)&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use&lt;br /&gt;
&lt;br /&gt;
  su = 64k&lt;br /&gt;
  sw = 6 (RAID-6 of 8 disks has 6 data disks)&lt;br /&gt;
&lt;br /&gt;
A RAID stripe size of 256KB with a RAID-10 over 16 disks should use&lt;br /&gt;
&lt;br /&gt;
  su = 256k&lt;br /&gt;
  sw = 8 (RAID-10 of 16 disks has 8 data disks)&lt;br /&gt;
&lt;br /&gt;
Alternatively, you can use &amp;quot;sunit&amp;quot; instead of &amp;quot;su&amp;quot; and &amp;quot;swidth&amp;quot; instead of &amp;quot;sw&amp;quot; but then sunit/swidth values need to be specified in &amp;quot;number of 512B sectors&amp;quot;!&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; interpret sunit and swidth as being specified in units of 512B sectors; that&#039;s unfortunately not the unit they&#039;re reported in, however.&lt;br /&gt;
&amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; report them in multiples of your basic block size (bsize) and not in 512B sectors.&lt;br /&gt;
&lt;br /&gt;
Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.&lt;br /&gt;
&lt;br /&gt;
When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.&lt;br /&gt;
&lt;br /&gt;
== Q: Why doesn&#039;t NFS-exporting subdirectories of inode64-mounted filesystem work? ==&lt;br /&gt;
&lt;br /&gt;
The default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; type encodes only 32-bit of the inode number for subdirectory exports.  However, exporting the root of the filesystem works, or using one of the non-default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; types (&amp;lt;tt&amp;gt;fsid=uuid&amp;lt;/tt&amp;gt; in &amp;lt;tt&amp;gt;/etc/exports&amp;lt;/tt&amp;gt; with recent &amp;lt;tt&amp;gt;nfs-utils&amp;lt;/tt&amp;gt;) should work as well. (Thanks, Christoph!)&lt;br /&gt;
&lt;br /&gt;
== Q: What is the inode64 mount option for? ==&lt;br /&gt;
&lt;br /&gt;
By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like &amp;quot;disk full&amp;quot; when you still have plenty space free, but there&#039;s no more place in the first TB to create a new inode. Also, performance sucks.&lt;br /&gt;
&lt;br /&gt;
To come around this, use the inode64 mount options for filesystems &amp;gt;1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.&lt;br /&gt;
&lt;br /&gt;
Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.&lt;br /&gt;
&lt;br /&gt;
== Q: Can I just try the inode64 option to see if it helps me? ==&lt;br /&gt;
&lt;br /&gt;
Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can&#039;t access files &amp;amp; dirs that have been created with an inode &amp;gt;32bit anymore.&lt;br /&gt;
&lt;br /&gt;
== Q: Performance: mkfs.xfs -n size=64k option ==&lt;br /&gt;
&lt;br /&gt;
Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:&lt;br /&gt;
&lt;br /&gt;
Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a&lt;br /&gt;
directory entry is determined by the length of the name.&lt;br /&gt;
&lt;br /&gt;
There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there&#039;s the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.&lt;br /&gt;
&lt;br /&gt;
For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.&lt;br /&gt;
&lt;br /&gt;
In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don&#039;t have any numbers on what the difference might be - I&#039;m getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....&lt;br /&gt;
&lt;br /&gt;
== Q: I want to tune my XFS filesystems for &amp;lt;something&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Premature optimization is the root of all evil.&#039;&#039; - Donald Knuth&lt;br /&gt;
&lt;br /&gt;
The standard answer you will get to this question is this: use the defaults.&lt;br /&gt;
&lt;br /&gt;
There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to  configure the filesystem appropriately.&lt;br /&gt;
&lt;br /&gt;
There are a lot of &amp;quot;XFS tuning guides&amp;quot; that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don&#039;t expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.&lt;br /&gt;
&lt;br /&gt;
In most cases, the only thing you need to to consider for &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; mount options. Increasing &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; reduces the number of journal IOs for a given workload, and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; will reduce them even further. The trade off for this increase in metadata performance is that more operations may be &amp;quot;missing&amp;quot; after recovery if the system crashes while actively making modifications.&lt;br /&gt;
&lt;br /&gt;
As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.&lt;br /&gt;
&lt;br /&gt;
== Q: Which factors influence the memory usage of xfs_repair? ==&lt;br /&gt;
&lt;br /&gt;
This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -n -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2096.&lt;br /&gt;
  #&lt;br /&gt;
&lt;br /&gt;
xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,&lt;br /&gt;
of which 2,097,152KB is needed for tracking free space. &lt;br /&gt;
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)&lt;br /&gt;
&lt;br /&gt;
Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2289.&lt;br /&gt;
&lt;br /&gt;
That is now needs at least another 200MB of RAM to run.&lt;br /&gt;
&lt;br /&gt;
The numbers reported by xfs_repair are the absolute minimum required and approximate at that;&lt;br /&gt;
more RAM than this may be required to complete successfully.&lt;br /&gt;
Also, if you only give xfs_repair the minimum required RAM, it will be slow;&lt;br /&gt;
for best repair performance, the more RAM you can give it the better.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why some files of my filesystem shows as &amp;quot;?????????? ? ?      ?          ?                ? filename&amp;quot; ? ==&lt;br /&gt;
&lt;br /&gt;
If ls -l shows you a listing as&lt;br /&gt;
&lt;br /&gt;
  # ?????????? ? ?      ?          ?                ? file1&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file2&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file3&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file4&lt;br /&gt;
&lt;br /&gt;
and errors like:&lt;br /&gt;
  # ls /pathtodir/&lt;br /&gt;
    ls: cannot access /pathtodir/file1: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file2: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file3: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file4: Invalid argument&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
or even:&lt;br /&gt;
  # failed to stat /pathtodir/file1&lt;br /&gt;
&lt;br /&gt;
It is very probable your filesystem must be mounted with inode64&lt;br /&gt;
  # mount -oremount,inode64 /dev/diskpart /mnt/xfs&lt;br /&gt;
&lt;br /&gt;
should make it work ok again.&lt;br /&gt;
If it works, add the option to fstab.&lt;br /&gt;
&lt;br /&gt;
== Q: The xfs_db &amp;quot;frag&amp;quot; command says I&#039;m over 50%.  Is that bad? ==&lt;br /&gt;
&lt;br /&gt;
It depends.  It&#039;s important to know how the value is calculated.  xfs_db looks at the extents in all files, and returns:&lt;br /&gt;
&lt;br /&gt;
  (actual extents - ideal extents) / actual extents&lt;br /&gt;
&lt;br /&gt;
This means that if, for example, you have an average of 2 extents per file, you&#039;ll get an answer of 50%.  4 extents per file would give you 75%.  This may or may not be a problem, especially depending on the size of the files in question.  (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented).  The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.&lt;br /&gt;
&lt;br /&gt;
Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:&lt;br /&gt;
[[Image:Frag_factor.png|500px]]&lt;br /&gt;
&lt;br /&gt;
== Q: I&#039;m getting &amp;quot;Internal error xfs_sb_read_verify&amp;quot; errors when I try to run xfs_growfs under kernels v3.10 through v3.12 ==&lt;br /&gt;
&lt;br /&gt;
This may happen when running xfs_growfs under a v3.10-v3.12 kernel,&lt;br /&gt;
if the filesystem was previously grown under a kernel prior to v3.8.&lt;br /&gt;
&lt;br /&gt;
Old kernel versions prior to v3.8 did not zero the empty part of&lt;br /&gt;
new secondary superblocks when growing the filesystem with xfs_growfs.&lt;br /&gt;
&lt;br /&gt;
Kernels v3.10 and later began detecting this non-zero part of the&lt;br /&gt;
superblock as corruption, and emit the &lt;br /&gt;
&lt;br /&gt;
    Internal error xfs_sb_read_verify&lt;br /&gt;
&lt;br /&gt;
error message.&lt;br /&gt;
&lt;br /&gt;
Kernels v3.13 and later are more forgiving about this - if the non-zero &lt;br /&gt;
data is found on a Version 4 superblock, it will not be flagged as&lt;br /&gt;
corruption.&lt;br /&gt;
&lt;br /&gt;
The problematic secondary superblocks may be repaired by using an xfs_repair&lt;br /&gt;
version 3.2.0-alpha1 or above.&lt;br /&gt;
&lt;br /&gt;
The relevant kernelspace commits are as follows:&lt;br /&gt;
&lt;br /&gt;
    v3.8  1375cb6 xfs: growfs: don&#039;t read garbage for new secondary superblocks &amp;lt;- fixed underlying problem &lt;br /&gt;
    v3.10 04a1e6c xfs: add CRC checks to the superblock &amp;lt;- detected old underlying problem&lt;br /&gt;
    v3.13 10e6e65 xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields &amp;lt;- is more forgiving of old underlying problem&lt;br /&gt;
&lt;br /&gt;
This commit allows xfs_repair to detect and correct the problem:&lt;br /&gt;
&lt;br /&gt;
    v3.2.0-alpha1 cbd7508 xfs_repair: zero out unused parts of superblocks&lt;br /&gt;
&lt;br /&gt;
== Q: Why do files on XFS use more data blocks than expected? ==&lt;br /&gt;
&lt;br /&gt;
The XFS speculative preallocation algorithm allocates extra blocks beyond end of file (EOF) to minimize file fragmentation during buffered write workloads. Workloads that benefit from this behaviour include slowly growing files, concurrent writers and mixed reader/writer workloads. It also provides fragmentation resistance in situations where memory pressure prevents adequate buffering of dirty data to allow formation of large contiguous regions of data in memory.&lt;br /&gt;
&lt;br /&gt;
This post-EOF block allocation is accounted identically to blocks within EOF. It is visible in &#039;st_blocks&#039; counts via stat() system calls, accounted as globally allocated space and against quotas that apply to the associated file. The space is reported by various userspace utilities (stat, du, df, ls) and thus provides a common source of confusion for administrators. Post-EOF blocks are temporary in most situations and are usually reclaimed via several possible mechanisms in XFS.&lt;br /&gt;
&lt;br /&gt;
See the FAQ entry on speculative preallocation for details.&lt;br /&gt;
&lt;br /&gt;
== Q: What is speculative preallocation? ==&lt;br /&gt;
&lt;br /&gt;
XFS speculatively preallocates post-EOF blocks on file extending writes in anticipation of future extending writes. The size of a preallocation is dynamic and depends on the runtime state of the file and fs. Generally speaking, preallocation is disabled for very small files and preallocation sizes grow as files grow larger.&lt;br /&gt;
&lt;br /&gt;
Preallocations are capped to the maximum extent size supported by the filesystem. Preallocation size is throttled automatically as the filesystem approaches low free space conditions or other allocation limits on a file (such as a quota).&lt;br /&gt;
&lt;br /&gt;
In most cases, speculative preallocation is automatically reclaimed when a file is closed. Applications that repeatedly trigger preallocation and reclaim cycles (e.g., this is common in file server or log file workloads) can cause fragmentation. Therefore, this pattern is detected and causes the preallocation to persist beyond the lifecycle of the file descriptor.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I speed up or avoid delayed removal of speculative preallocation?  ==&lt;br /&gt;
&lt;br /&gt;
Linux 3.8 (and later) includes a scanner to perform background trimming of files with lingering post-EOF preallocations. The scanner bypasses dirty files to avoid interference with ongoing writes. A 5 minute scan interval is used by default and can be adjusted via the following file (value in seconds):&lt;br /&gt;
&lt;br /&gt;
        /proc/sys/fs/xfs/speculative_prealloc_lifetime&lt;br /&gt;
&lt;br /&gt;
== Q: Is speculative preallocation permanent? ==&lt;br /&gt;
&lt;br /&gt;
Preallocated blocks are normally reclaimed on file close, inode reclaim, unmount or in the background once file write activity subsides. They can be explicitly made permanent via fallocate or a similar interface. They can be implicitly made permanent in situations where file size is extended beyond a range of post-EOF blocks (i.e., via an extending truncate) or following a crash. In the event of a crash, the in-memory state used to track and reclaim the speculative preallocation is lost.&lt;br /&gt;
&lt;br /&gt;
== Q: My workload has known characteristics - can I disable speculative preallocation or tune it to an optimal fixed size? ==&lt;br /&gt;
&lt;br /&gt;
Speculative preallocation can not be disabled but XFS can be tuned to a fixed allocation size with the &#039;allocsize=&#039; mount option. Speculative preallocation is not dynamically resized when the allocsize mount option is set and thus the potential for fragmentation is increased. Use &#039;allocsize=64k&#039; to revert to the default XFS behavior prior to support for dynamic speculative preallocation.&lt;br /&gt;
&lt;br /&gt;
== Q: mount (or umount) takes minutes or even hours - what could be the reason ? ==&lt;br /&gt;
&lt;br /&gt;
In some cases xfs log (journal) can become quite big. For example if it accumulates many entries and doesn&#039;t get chance to apply these to disk (due to lockup, crash, hard reset etc). xfs will try to reapply these at mount (in dmesg: &amp;quot;Starting recovery (logdev: internal)&amp;quot;).&lt;br /&gt;
&lt;br /&gt;
That process with big log to be reapplied can take very long time (minutes or even hours). Similar problem can happen with unmount taking hours when there are hundreds of thousands of dirty inode in memory that need to be flushed to disk.&lt;br /&gt;
&lt;br /&gt;
(http://oss.sgi.com/pipermail/xfs/2015-October/044457.html)&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2948</id>
		<title>XFS FAQ</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2948"/>
		<updated>2014-10-07T22:54:54Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: /* Q: Why do I receive No space left on device after xfs_growfs? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about XFS? ==&lt;br /&gt;
&lt;br /&gt;
The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.&lt;br /&gt;
&lt;br /&gt;
You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the &#039;&#039;&#039;&amp;lt;nowiki&amp;gt;#xfs&amp;lt;/nowiki&amp;gt;&#039;&#039;&#039; IRC channel on &#039;&#039;irc.freenode.net&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about ACLs? ==&lt;br /&gt;
&lt;br /&gt;
Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;acl(5)&#039;&#039;&#039; manual page is also quite extensive.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find information about the internals of XFS? ==&lt;br /&gt;
&lt;br /&gt;
An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.&lt;br /&gt;
&lt;br /&gt;
Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.&lt;br /&gt;
&lt;br /&gt;
== Q: What partition type should I use for XFS on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Linux native filesystem (83).&lt;br /&gt;
&lt;br /&gt;
== Q: What mount options does XFS have? ==&lt;br /&gt;
&lt;br /&gt;
There are a number of mount options influencing XFS filesystems - refer to the &#039;&#039;&#039;mount(8)&#039;&#039;&#039; manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])&lt;br /&gt;
&lt;br /&gt;
== Q: Is there any relation between the XFS utilities and the kernel version? ==&lt;br /&gt;
&lt;br /&gt;
No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Does it run on platforms other than i386? ==&lt;br /&gt;
&lt;br /&gt;
XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Do quotas work on XFS? ==&lt;br /&gt;
&lt;br /&gt;
Yes.&lt;br /&gt;
&lt;br /&gt;
To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/  http://sourceforge.net/projects/linuxquota/] or use &#039;&#039;&#039;xfs_quota(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: What&#039;s project quota? ==&lt;br /&gt;
&lt;br /&gt;
The  project  quota  is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Can group quota and project quota be used at the same time? ==&lt;br /&gt;
&lt;br /&gt;
No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==&lt;br /&gt;
&lt;br /&gt;
To be answered.&lt;br /&gt;
&lt;br /&gt;
== Q: Are there any dump/restore tools for XFS? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and &#039;&#039;&#039;xfsrestore(8)&#039;&#039;&#039; are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.&lt;br /&gt;
&lt;br /&gt;
== Q: Does LILO work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
This depends on where you install LILO.&lt;br /&gt;
&lt;br /&gt;
Yes, for MBR (Master Boot Record) installations.&lt;br /&gt;
&lt;br /&gt;
No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.&lt;br /&gt;
&lt;br /&gt;
== Q: Does GRUB work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.&lt;br /&gt;
&lt;br /&gt;
== Q: Can XFS be used for a root filesystem? ==&lt;br /&gt;
&lt;br /&gt;
Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the &amp;quot;rootflags=&amp;quot; kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit &amp;quot;logdev=&amp;quot; specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]&lt;br /&gt;
&lt;br /&gt;
== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be &amp;quot;clean&amp;quot; when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.&lt;br /&gt;
&lt;br /&gt;
== Q: Is there a way to make a XFS filesystem larger or smaller? ==&lt;br /&gt;
&lt;br /&gt;
You can &#039;&#039;NOT&#039;&#039; make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.&lt;br /&gt;
&lt;br /&gt;
An XFS filesystem may be enlarged by using &#039;&#039;&#039;xfs_growfs(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the &#039;&#039;exact same&#039;&#039; starting point. Run &#039;&#039;&#039;xfs_growfs&#039;&#039;&#039; to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.&lt;br /&gt;
&lt;br /&gt;
Using XFS filesystems on top of a volume manager makes this a lot easier.&lt;br /&gt;
&lt;br /&gt;
== Q: What information should I include when reporting a problem? ==&lt;br /&gt;
&lt;br /&gt;
What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:&lt;br /&gt;
&lt;br /&gt;
* kernel version (uname -a)&lt;br /&gt;
* xfsprogs version (xfs_repair -V)&lt;br /&gt;
* number of CPUs&lt;br /&gt;
* contents of /proc/meminfo&lt;br /&gt;
* contents of /proc/mounts&lt;br /&gt;
* contents of /proc/partitions&lt;br /&gt;
* RAID layout (hardware and/or software)&lt;br /&gt;
* LVM configuration&lt;br /&gt;
* type of disks you are using&lt;br /&gt;
* write cache status of drives&lt;br /&gt;
* size of BBWC and mode it is running in&lt;br /&gt;
* xfs_info output on the filesystem in question&lt;br /&gt;
* dmesg output showing all error messages and stack traces&lt;br /&gt;
 &lt;br /&gt;
Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:&lt;br /&gt;
&lt;br /&gt;
# iostat -x -d -m 5&lt;br /&gt;
# vmstat 5&lt;br /&gt;
 &lt;br /&gt;
can give us insight into the IO and memory utilisation of your machine at the time of the problem.&lt;br /&gt;
&lt;br /&gt;
If the filesystem is hanging, then capture the output of the dmesg command after running:&lt;br /&gt;
&lt;br /&gt;
 # echo w &amp;gt; /proc/sysrq-trigger&lt;br /&gt;
 # dmesg&lt;br /&gt;
&lt;br /&gt;
will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.&lt;br /&gt;
&lt;br /&gt;
And for advanced users, capturing an event trace using &#039;&#039;&#039;trace-cmd&#039;&#039;&#039; (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it&#039;s a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd record -e xfs\*&lt;br /&gt;
&lt;br /&gt;
before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd report &amp;gt; trace_report.txt&lt;br /&gt;
&lt;br /&gt;
Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.&lt;br /&gt;
&lt;br /&gt;
If you have a problem with &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039;, make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using &#039;&#039;&#039;xfs_metadump(8)&#039;&#039;&#039; (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.&lt;br /&gt;
&lt;br /&gt;
== Q: Mounting an XFS filesystem does not work - what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
If mount prints an error message something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     mount: /dev/hda5 has wrong major or minor number&lt;br /&gt;
&lt;br /&gt;
you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the &amp;quot;-t xfs&amp;quot; option on mount or the &amp;quot;xfs&amp;quot; option in &amp;lt;tt&amp;gt;/etc/fstab&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
If you get something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 mount: wrong fs type, bad option, bad superblock on /dev/sda1,&lt;br /&gt;
        or too many mounted file systems&lt;br /&gt;
&lt;br /&gt;
Refer to your system log file (&amp;lt;tt&amp;gt;/var/log/messages&amp;lt;/tt&amp;gt;) for a detailed diagnostic message from the kernel.&lt;br /&gt;
&lt;br /&gt;
== Q: Does the filesystem have an undelete capability? ==&lt;br /&gt;
&lt;br /&gt;
There is no undelete in XFS.&lt;br /&gt;
&lt;br /&gt;
However, if an inode is unlinked but neither it nor its associated data blocks get immediately re-used and overwritten, there is some small chance to recover the file from the disk.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;photorec&#039;&#039;, &#039;&#039;xfs_irecover&#039;&#039; or &#039;&#039;xfsr&#039;&#039; are some tools which attempt to do this, with varying success.&lt;br /&gt;
&lt;br /&gt;
There are also commercial data recovery services and closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS] which claims to recover data, although this has not been tested by the XFS developers.&lt;br /&gt;
&lt;br /&gt;
As always, the best advice is to keep good backups.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I backup a XFS filesystem and ACLs? ==&lt;br /&gt;
&lt;br /&gt;
You can backup a XFS filesystem with utilities like &#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and standard &#039;&#039;&#039;tar(1)&#039;&#039;&#039; for standard files. If you want to backup ACLs you will need to use &#039;&#039;&#039;xfsdump&#039;&#039;&#039; or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (&amp;gt; version 3.1.4) or [http://rsync.samba.org/ rsync] (&amp;gt;= version 3.0.0) to backup ACLs and EAs. &#039;&#039;&#039;xfsdump&#039;&#039;&#039; can also be integrated with [http://www.amanda.org/ amanda(8)].&lt;br /&gt;
&lt;br /&gt;
== Q: I see applications returning error 990 or &amp;quot;Structure needs cleaning&amp;quot;, what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], &amp;quot;Structure needs cleaning.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.&lt;br /&gt;
&lt;br /&gt;
There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.&lt;br /&gt;
&lt;br /&gt;
You can use xfs_repair to remedy the problem (with the file system unmounted).&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==&lt;br /&gt;
&lt;br /&gt;
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.&lt;br /&gt;
&lt;br /&gt;
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.&lt;br /&gt;
&lt;br /&gt;
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you&#039;ll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the &#039;&#039;&#039;xfs_bmap(8)&#039;&#039;&#039; command).&lt;br /&gt;
&lt;br /&gt;
== Q: What is the problem with the write cache on journaled filesystems? ==&lt;br /&gt;
&lt;br /&gt;
Many drives use a write back cache in order to speed up the performance of writes.  However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk.  Further, the drive can de-stage data from the write cache to the platters in any order that it chooses.  This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk.  When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.&lt;br /&gt;
&lt;br /&gt;
With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information.  In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.&lt;br /&gt;
&lt;br /&gt;
With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued.  A powerfail &amp;quot;only&amp;quot; loses data in the cache but no essential ordering is violated, and corruption will not occur.&lt;br /&gt;
&lt;br /&gt;
With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance.  But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I tell if I have the disk write cache enabled? ==&lt;br /&gt;
&lt;br /&gt;
For SCSI/SATA:&lt;br /&gt;
&lt;br /&gt;
* Look in dmesg(8) output for a driver line, such as:&amp;lt;br /&amp;gt; &amp;quot;SCSI device sda: drive cache: write back&amp;quot;&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# sginfo -c /dev/sda | grep -i &#039;write cache&#039; &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -I /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; and look under &amp;quot;Enabled Supported&amp;quot; for &amp;quot;Write cache&amp;quot;&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
== Q: How can I address the problem with the disk write cache? ==&lt;br /&gt;
&lt;br /&gt;
=== Disabling the disk write back cache. ===&lt;br /&gt;
&lt;br /&gt;
For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -W0 /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # hdparm -W0 /dev/hda&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# blktool /dev/sda wcache off&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # blktool /dev/hda wcache off&lt;br /&gt;
&lt;br /&gt;
For SCSI:&lt;br /&gt;
&lt;br /&gt;
* Using sginfo(8) which is a little tedious&amp;lt;br /&amp;gt; It takes 3 steps. For example:&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -c /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives a list of attribute names and values&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cX /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives an array of cache values which you must match up with from step 1, e.g.&amp;lt;br /&amp;gt; 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; allows you to reset the value of the cache attributes.&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Using an external log. ===&lt;br /&gt;
&lt;br /&gt;
Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will &#039;&#039;&#039;not&#039;&#039;&#039; solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won&#039;t be able to guarantee that if the metadata is on a drive with the write cache enabled.&lt;br /&gt;
&lt;br /&gt;
In fact using an external log will disable XFS&#039; write barrier support.&lt;br /&gt;
&lt;br /&gt;
=== Write barrier support. ===&lt;br /&gt;
&lt;br /&gt;
Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with &amp;quot;nobarrier&amp;quot;. Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported with external log device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported by the underlying device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, trial barrier write failed&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If the filesystem is mounted with an external log device then we currently don&#039;t support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn&#039;t support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.&lt;br /&gt;
&lt;br /&gt;
== Q. Should barriers be enabled with storage which has a persistent write cache? ==&lt;br /&gt;
&lt;br /&gt;
Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with &amp;quot;nobarrier&amp;quot;, assuming your RAID controller is infallible and not resetting randomly like some common ones do.  But take care about the hard disk write cache, which should be off.&lt;br /&gt;
&lt;br /&gt;
== Q. Which settings does my RAID controller need ? ==&lt;br /&gt;
&lt;br /&gt;
It&#039;s hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:&lt;br /&gt;
&lt;br /&gt;
Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory &amp;quot;[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]&amp;quot;) which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.&lt;br /&gt;
&lt;br /&gt;
If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.&lt;br /&gt;
&lt;br /&gt;
* onboard RAID controllers: there are so many different types it&#039;s hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn&#039;t even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.&lt;br /&gt;
&lt;br /&gt;
* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86); &lt;br /&gt;
&lt;br /&gt;
* Adaptec: allows setting individual drives cache&lt;br /&gt;
arcconf setcache &amp;lt;disk&amp;gt; wb|wt&lt;br /&gt;
wb=write back, which means write cache on, wt=write through, which means write cache off. So &amp;quot;wt&amp;quot; should be chosen.&lt;br /&gt;
&lt;br /&gt;
* Areca: In archttp under &amp;quot;System Controls&amp;quot; -&amp;gt; &amp;quot;System Configuration&amp;quot; there&#039;s the option &amp;quot;Disk Write Cache Mode&amp;quot; (defaults &amp;quot;Auto&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Off&amp;quot;: disk write cache is turned off&lt;br /&gt;
&lt;br /&gt;
&amp;quot;On&amp;quot;: disk write cache is enabled, this is not safe for your data but fast&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Auto&amp;quot;: If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to &amp;quot;On&amp;quot;, because neither controller cache nor disk cache is safe so you don&#039;t seem to care about your data and just want high speed (which you get then).&lt;br /&gt;
&lt;br /&gt;
That&#039;s a very sensible default so you can let it &amp;quot;Auto&amp;quot; or enforce &amp;quot;Off&amp;quot; to be sure.&lt;br /&gt;
&lt;br /&gt;
* LSI MegaRAID: allows setting individual disks cache:&lt;br /&gt;
 MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL                          # flushes the controller cache&lt;br /&gt;
 MegaCli -LDGetProp -Cache    -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the controller cache settings&lt;br /&gt;
 MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the disk cache settings (for all phys. disks in logical disk)&lt;br /&gt;
 MegaCli -LDSetProp -EnDskCache|DisDskCache  -LN|-L0,1,2|-LAll  -aN|-a0,1,2|-aALL # set disk cache setting&lt;br /&gt;
&lt;br /&gt;
* Xyratex: from the docs: &amp;quot;Write cache includes the disk drive cache and controller cache.&amp;quot;. So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.&lt;br /&gt;
&lt;br /&gt;
== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==&lt;br /&gt;
&lt;br /&gt;
The biggest problem is that those products seem to also virtualize disk &lt;br /&gt;
writes in a way that even barriers don&#039;t work any more, which means even &lt;br /&gt;
a fsync is not reliable. Tests confirm that unplugging the power from &lt;br /&gt;
such a system even with RAID controller with battery backed cache and &lt;br /&gt;
hard disk cache turned off (which is safe on a normal host) you can &lt;br /&gt;
destroy a database within the virtual machine (client, domU whatever you &lt;br /&gt;
call it).&lt;br /&gt;
&lt;br /&gt;
In qemu you can specify cache=off on the line specifying the virtual &lt;br /&gt;
disk. For others information is missing.&lt;br /&gt;
&lt;br /&gt;
== Q: What is the issue with directory corruption in Linux 2.6.17? ==&lt;br /&gt;
&lt;br /&gt;
In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some &amp;quot;sparse&amp;quot; endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: the fix is included in 2.6.17.7 and later kernels.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
To add insult to injury, &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039; is currently not correcting these directories on detection of this corrupt state either. This &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; issue is actively being worked on, and a fixed version will be available shortly.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfs_repair -n&#039;&#039;&#039; should be able to detect any directory corruption.&lt;br /&gt;
&lt;br /&gt;
Until a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; binary is available, one can make use of the &#039;&#039;&#039;xfs_db(8)&#039;&#039;&#039; command to mark the problem directory for removal (see the example below). A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; invocation will remove the directory and move all contents into &amp;quot;lost+found&amp;quot;, named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 core.mode = 040755&lt;br /&gt;
 core.version = 2&lt;br /&gt;
 core.format = 3 (btree)&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; write core.mode 0&lt;br /&gt;
 xfs_db&amp;amp;gt; quit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; will clear the directory, and add new entries (named by inode number) in lost+found.&lt;br /&gt;
&lt;br /&gt;
The easiest way to map inode numbers to full paths is via &#039;&#039;&#039;xfs_ncheck(8)&#039;&#039;&#039;&amp;lt;nowiki&amp;gt;: &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_ncheck -i 14101 -i 14102 /dev/sdXXX&lt;br /&gt;
       14101 full/path/mumble_fratz_foo_bar_1495&lt;br /&gt;
       14102 full/path/mumble_fratz_foo_bar_1494&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 ...&lt;br /&gt;
 next_unlinked = null&lt;br /&gt;
 u.bmbt.level = 1&lt;br /&gt;
 u.bmbt.numrecs = 1&lt;br /&gt;
 u.bmbt.keys[1] = [startoff] 1:[0]&lt;br /&gt;
 u.bmbt.ptrs[1] = 1:3628&lt;br /&gt;
 xfs_db&amp;amp;gt; fsblock 3628&lt;br /&gt;
 xfs_db&amp;amp;gt; type bmapbtd&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 magic = 0x424d4150&lt;br /&gt;
 level = 0&lt;br /&gt;
 numrecs = 19&lt;br /&gt;
 leftsib = null&lt;br /&gt;
 rightsib = null&lt;br /&gt;
 recs[1-19] = [startoff,startblock,blockcount,extentflag]&lt;br /&gt;
        1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]&lt;br /&gt;
        5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]&lt;br /&gt;
        9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]&lt;br /&gt;
        12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]&lt;br /&gt;
        15:[33554436,3488,8,0] 16:[33554444,3629,4,0]&lt;br /&gt;
        17:[33554448,3748,4,0] 18:[33554452,3900,4,0]&lt;br /&gt;
        19:[67108864,3364,4,0]&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the &#039;&#039;&#039;xfs_db&#039;&#039;&#039; dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; dblock 20&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 dhdr.magic = 0x58443244&lt;br /&gt;
 dhdr.bestfree[0].offset = 0&lt;br /&gt;
 dhdr.bestfree[0].length = 0&lt;br /&gt;
 dhdr.bestfree[1].offset = 0&lt;br /&gt;
 dhdr.bestfree[1].length = 0&lt;br /&gt;
 dhdr.bestfree[2].offset = 0&lt;br /&gt;
 dhdr.bestfree[2].length = 0&lt;br /&gt;
 du[0].inumber = 13937&lt;br /&gt;
 du[0].namelen = 25&lt;br /&gt;
 du[0].name = &amp;quot;mumble_fratz_foo_bar_1595&amp;quot;&lt;br /&gt;
 du[0].tag = 0x10&lt;br /&gt;
 du[1].inumber = 13938&lt;br /&gt;
 du[1].namelen = 25&lt;br /&gt;
 du[1].name = &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;&lt;br /&gt;
 du[1].tag = 0x38&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
So, here we can see that inode number 13938 matches up with name &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;. Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at &amp;quot;lost+found&amp;quot; (once &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; has removed the corrupt directory).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why does my &amp;gt; 2TB XFS partition disappear when I reboot ? ==&lt;br /&gt;
&lt;br /&gt;
Strictly speaking this is not an XFS problem.&lt;br /&gt;
&lt;br /&gt;
To support &amp;gt; 2TB partitions you need two things: a kernel that supports large block devices (&amp;lt;tt&amp;gt;CONFIG_LBD=y&amp;lt;/tt&amp;gt;) and a partition table format that can hold large partitions.  The default DOS partition tables don&#039;t.  The best partition format for&lt;br /&gt;
&amp;gt; 2TB partitions is the EFI GPT format (&amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Without CONFIG_LBD=y you can&#039;t even create the filesystem, but without &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt; it works fine until you reboot at which point the partition will disappear.  Note that you need to enable the &amp;lt;tt&amp;gt;CONFIG_PARTITION_ADVANCED&amp;lt;/tt&amp;gt; option before you can set &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I receive &amp;lt;tt&amp;gt;No space left on device&amp;lt;/tt&amp;gt; after &amp;lt;tt&amp;gt;xfs_growfs&amp;lt;/tt&amp;gt;? ==&lt;br /&gt;
&lt;br /&gt;
After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. This was an issue with the older &amp;quot;inode32&amp;quot; inode allocation mode, where inode allocation is restricted to lower filesysetm blocks.  To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:&lt;br /&gt;
&lt;br /&gt;
  The only way to fix this is to move data around to free up space&lt;br /&gt;
  below 1TB. Find your oldest data (i.e. that was around before even&lt;br /&gt;
  the first grow) and move it off the filesystem (move, not copy).&lt;br /&gt;
  Then if you copy it back on, the data blocks will end up above 1TB&lt;br /&gt;
  and that should leave you with plenty of space for inodes below 1TB.&lt;br /&gt;
  &lt;br /&gt;
  A complete dump and restore will also fix the problem ;)&lt;br /&gt;
&lt;br /&gt;
Alternately, you can add &#039;inode64&#039; to your mount options to allow inodes to live above 1TB.&lt;br /&gt;
&lt;br /&gt;
example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&amp;amp;forum=38 No space left on device on xfs filesystem with 7.7TB free]&lt;br /&gt;
&lt;br /&gt;
However, &#039;inode64&#039; has been the default behavior since kernel v3.7...&lt;br /&gt;
&lt;br /&gt;
Unfortunately, v3.7 also added a bug present from kernel v3.7 to v3.17 which caused new allocation groups added by growfs to be unavailable for inode allocation.  This was fixed by commit &amp;lt;tt&amp;gt;[http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=9de67c3ba9ea961ba420573d56479d09d33a7587 9de67c3b xfs: allow inode allocations in post-growfs disk space.]&amp;lt;/tt&amp;gt; in kernel v3.17.&lt;br /&gt;
Without that commit, the problem can be worked around by doing a &amp;quot;mount -o remount,inode64&amp;quot; after the growfs operation.&lt;br /&gt;
&lt;br /&gt;
== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==&lt;br /&gt;
&lt;br /&gt;
The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons. &lt;br /&gt;
&lt;br /&gt;
Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.&lt;br /&gt;
&lt;br /&gt;
== Q: How to get around a bad inode repair is unable to clean up ==&lt;br /&gt;
&lt;br /&gt;
The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.&lt;br /&gt;
&lt;br /&gt;
  xfs_db -x -c &#039;inode XXX&#039; -c &#039;write core.nextents 0&#039; -c &#039;write core.size 0&#039; /dev/hdXX&lt;br /&gt;
&lt;br /&gt;
== Q: How to calculate the correct sunit,swidth values for optimal performance ==&lt;br /&gt;
&lt;br /&gt;
XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.&lt;br /&gt;
&lt;br /&gt;
These options can be sometimes autodetected (for example with md raid and recent enough kernel (&amp;gt;= 2.6.32) and xfsprogs (&amp;gt;= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.&lt;br /&gt;
&lt;br /&gt;
The calculation of these values is quite simple:&lt;br /&gt;
&lt;br /&gt;
  su = &amp;lt;RAID controllers stripe size in BYTES (or KiBytes when used with k)&amp;gt;&lt;br /&gt;
  sw = &amp;lt;# of data disks (don&#039;t count parity disks)&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use&lt;br /&gt;
&lt;br /&gt;
  su = 64k&lt;br /&gt;
  sw = 6 (RAID-6 of 8 disks has 6 data disks)&lt;br /&gt;
&lt;br /&gt;
A RAID stripe size of 256KB with a RAID-10 over 16 disks should use&lt;br /&gt;
&lt;br /&gt;
  su = 256k&lt;br /&gt;
  sw = 8 (RAID-10 of 16 disks has 8 data disks)&lt;br /&gt;
&lt;br /&gt;
Alternatively, you can use &amp;quot;sunit&amp;quot; instead of &amp;quot;su&amp;quot; and &amp;quot;swidth&amp;quot; instead of &amp;quot;sw&amp;quot; but then sunit/swidth values need to be specified in &amp;quot;number of 512B sectors&amp;quot;!&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; interpret sunit and swidth as being specified in units of 512B sectors; that&#039;s unfortunately not the unit they&#039;re reported in, however.&lt;br /&gt;
&amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; report them in multiples of your basic block size (bsize) and not in 512B sectors.&lt;br /&gt;
&lt;br /&gt;
Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.&lt;br /&gt;
&lt;br /&gt;
When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.&lt;br /&gt;
&lt;br /&gt;
== Q: Why doesn&#039;t NFS-exporting subdirectories of inode64-mounted filesystem work? ==&lt;br /&gt;
&lt;br /&gt;
The default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; type encodes only 32-bit of the inode number for subdirectory exports.  However, exporting the root of the filesystem works, or using one of the non-default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; types (&amp;lt;tt&amp;gt;fsid=uuid&amp;lt;/tt&amp;gt; in &amp;lt;tt&amp;gt;/etc/exports&amp;lt;/tt&amp;gt; with recent &amp;lt;tt&amp;gt;nfs-utils&amp;lt;/tt&amp;gt;) should work as well. (Thanks, Christoph!)&lt;br /&gt;
&lt;br /&gt;
== Q: What is the inode64 mount option for? ==&lt;br /&gt;
&lt;br /&gt;
By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like &amp;quot;disk full&amp;quot; when you still have plenty space free, but there&#039;s no more place in the first TB to create a new inode. Also, performance sucks.&lt;br /&gt;
&lt;br /&gt;
To come around this, use the inode64 mount options for filesystems &amp;gt;1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.&lt;br /&gt;
&lt;br /&gt;
Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.&lt;br /&gt;
&lt;br /&gt;
== Q: Can I just try the inode64 option to see if it helps me? ==&lt;br /&gt;
&lt;br /&gt;
Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can&#039;t access files &amp;amp; dirs that have been created with an inode &amp;gt;32bit anymore.&lt;br /&gt;
&lt;br /&gt;
== Q: Performance: mkfs.xfs -n size=64k option ==&lt;br /&gt;
&lt;br /&gt;
Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:&lt;br /&gt;
&lt;br /&gt;
Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a&lt;br /&gt;
directory entry is determined by the length of the name.&lt;br /&gt;
&lt;br /&gt;
There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there&#039;s the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.&lt;br /&gt;
&lt;br /&gt;
For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.&lt;br /&gt;
&lt;br /&gt;
In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don&#039;t have any numbers on what the difference might be - I&#039;m getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....&lt;br /&gt;
&lt;br /&gt;
== Q: I want to tune my XFS filesystems for &amp;lt;something&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Premature optimization is the root of all evil.&#039;&#039; - Donald Knuth&lt;br /&gt;
&lt;br /&gt;
The standard answer you will get to this question is this: use the defaults.&lt;br /&gt;
&lt;br /&gt;
There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to  configure the filesystem appropriately.&lt;br /&gt;
&lt;br /&gt;
There are a lot of &amp;quot;XFS tuning guides&amp;quot; that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don&#039;t expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.&lt;br /&gt;
&lt;br /&gt;
In most cases, the only thing you need to to consider for &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; mount options. Increasing &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; reduces the number of journal IOs for a given workload, and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; will reduce them even further. The trade off for this increase in metadata performance is that more operations may be &amp;quot;missing&amp;quot; after recovery if the system crashes while actively making modifications.&lt;br /&gt;
&lt;br /&gt;
As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.&lt;br /&gt;
&lt;br /&gt;
== Q: Which factors influence the memory usage of xfs_repair? ==&lt;br /&gt;
&lt;br /&gt;
This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -n -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2096.&lt;br /&gt;
  #&lt;br /&gt;
&lt;br /&gt;
xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,&lt;br /&gt;
of which 2,097,152KB is needed for tracking free space. &lt;br /&gt;
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)&lt;br /&gt;
&lt;br /&gt;
Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2289.&lt;br /&gt;
&lt;br /&gt;
That is now needs at least another 200MB of RAM to run.&lt;br /&gt;
&lt;br /&gt;
The numbers reported by xfs_repair are the absolute minimum required and approximate at that;&lt;br /&gt;
more RAM than this may be required to complete successfully.&lt;br /&gt;
Also, if you only give xfs_repair the minimum required RAM, it will be slow;&lt;br /&gt;
for best repair performance, the more RAM you can give it the better.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why some files of my filesystem shows as &amp;quot;?????????? ? ?      ?          ?                ? filename&amp;quot; ? ==&lt;br /&gt;
&lt;br /&gt;
If ls -l shows you a listing as&lt;br /&gt;
&lt;br /&gt;
  # ?????????? ? ?      ?          ?                ? file1&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file2&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file3&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file4&lt;br /&gt;
&lt;br /&gt;
and errors like:&lt;br /&gt;
  # ls /pathtodir/&lt;br /&gt;
    ls: cannot access /pathtodir/file1: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file2: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file3: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file4: Invalid argument&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
or even:&lt;br /&gt;
  # failed to stat /pathtodir/file1&lt;br /&gt;
&lt;br /&gt;
It is very probable your filesystem must be mounted with inode64&lt;br /&gt;
  # mount -oremount,inode64 /dev/diskpart /mnt/xfs&lt;br /&gt;
&lt;br /&gt;
should make it work ok again.&lt;br /&gt;
If it works, add the option to fstab.&lt;br /&gt;
&lt;br /&gt;
== Q: The xfs_db &amp;quot;frag&amp;quot; command says I&#039;m over 50%.  Is that bad? ==&lt;br /&gt;
&lt;br /&gt;
It depends.  It&#039;s important to know how the value is calculated.  xfs_db looks at the extents in all files, and returns:&lt;br /&gt;
&lt;br /&gt;
  (actual extents - ideal extents) / actual extents&lt;br /&gt;
&lt;br /&gt;
This means that if, for example, you have an average of 2 extents per file, you&#039;ll get an answer of 50%.  4 extents per file would give you 75%.  This may or may not be a problem, especially depending on the size of the files in question.  (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented).  The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.&lt;br /&gt;
&lt;br /&gt;
Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:&lt;br /&gt;
[[Image:Frag_factor.png|500px]]&lt;br /&gt;
&lt;br /&gt;
== Q: I&#039;m getting &amp;quot;Internal error xfs_sb_read_verify&amp;quot; errors when I try to run xfs_growfs under kernels v3.10 through v3.12 ==&lt;br /&gt;
&lt;br /&gt;
This may happen when running xfs_growfs under a v3.10-v3.12 kernel,&lt;br /&gt;
if the filesystem was previously grown under a kernel prior to v3.8.&lt;br /&gt;
&lt;br /&gt;
Old kernel versions prior to v3.8 did not zero the empty part of&lt;br /&gt;
new secondary superblocks when growing the filesystem with xfs_growfs.&lt;br /&gt;
&lt;br /&gt;
Kernels v3.10 and later began detecting this non-zero part of the&lt;br /&gt;
superblock as corruption, and emit the &lt;br /&gt;
&lt;br /&gt;
    Internal error xfs_sb_read_verify&lt;br /&gt;
&lt;br /&gt;
error message.&lt;br /&gt;
&lt;br /&gt;
Kernels v3.13 and later are more forgiving about this - if the non-zero &lt;br /&gt;
data is found on a Version 4 superblock, it will not be flagged as&lt;br /&gt;
corruption.&lt;br /&gt;
&lt;br /&gt;
The problematic secondary superblocks may be repaired by using an xfs_repair&lt;br /&gt;
version 3.2.0-alpha1 or above.&lt;br /&gt;
&lt;br /&gt;
The relevant kernelspace commits are as follows:&lt;br /&gt;
&lt;br /&gt;
    v3.8  1375cb6 xfs: growfs: don&#039;t read garbage for new secondary superblocks &amp;lt;- fixed underlying problem &lt;br /&gt;
    v3.10 04a1e6c xfs: add CRC checks to the superblock &amp;lt;- detected old underlying problem&lt;br /&gt;
    v3.13 10e6e65 xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields &amp;lt;- is more forgiving of old underlying problem&lt;br /&gt;
&lt;br /&gt;
This commit allows xfs_repair to detect and correct the problem:&lt;br /&gt;
&lt;br /&gt;
    v3.2.0-alpha1 cbd7508 xfs_repair: zero out unused parts of superblocks&lt;br /&gt;
&lt;br /&gt;
== Q: Why do files on XFS use more data blocks than expected? ==&lt;br /&gt;
&lt;br /&gt;
The XFS speculative preallocation algorithm allocates extra blocks beyond end of file (EOF) to minimize file fragmentation during buffered write workloads. Workloads that benefit from this behaviour include slowly growing files, concurrent writers and mixed reader/writer workloads. It also provides fragmentation resistance in situations where memory pressure prevents adequate buffering of dirty data to allow formation of large contiguous regions of data in memory.&lt;br /&gt;
&lt;br /&gt;
This post-EOF block allocation is accounted identically to blocks within EOF. It is visible in &#039;st_blocks&#039; counts via stat() system calls, accounted as globally allocated space and against quotas that apply to the associated file. The space is reported by various userspace utilities (stat, du, df, ls) and thus provides a common source of confusion for administrators. Post-EOF blocks are temporary in most situations and are usually reclaimed via several possible mechanisms in XFS.&lt;br /&gt;
&lt;br /&gt;
See the FAQ entry on speculative preallocation for details.&lt;br /&gt;
&lt;br /&gt;
== Q: What is speculative preallocation? ==&lt;br /&gt;
&lt;br /&gt;
XFS speculatively preallocates post-EOF blocks on file extending writes in anticipation of future extending writes. The size of a preallocation is dynamic and depends on the runtime state of the file and fs. Generally speaking, preallocation is disabled for very small files and preallocation sizes grow as files grow larger.&lt;br /&gt;
&lt;br /&gt;
Preallocations are capped to the maximum extent size supported by the filesystem. Preallocation size is throttled automatically as the filesystem approaches low free space conditions or other allocation limits on a file (such as a quota).&lt;br /&gt;
&lt;br /&gt;
In most cases, speculative preallocation is automatically reclaimed when a file is closed. Applications that repeatedly trigger preallocation and reclaim cycles (e.g., this is common in file server or log file workloads) can cause fragmentation. Therefore, this pattern is detected and causes the preallocation to persist beyond the lifecycle of the file descriptor.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I speed up or avoid delayed removal of speculative preallocation?  ==&lt;br /&gt;
&lt;br /&gt;
Linux 3.8 (and later) includes a scanner to perform background trimming of files with lingering post-EOF preallocations. The scanner bypasses dirty files to avoid interference with ongoing writes. A 5 minute scan interval is used by default and can be adjusted via the following file (value in seconds):&lt;br /&gt;
&lt;br /&gt;
        /proc/sys/fs/xfs/speculative_prealloc_lifetime&lt;br /&gt;
&lt;br /&gt;
== Q: Is speculative preallocation permanent? ==&lt;br /&gt;
&lt;br /&gt;
Preallocated blocks are normally reclaimed on file close, inode reclaim, unmount or in the background once file write activity subsides. They can be explicitly made permanent via fallocate or a similar interface. They can be implicitly made permanent in situations where file size is extended beyond a range of post-EOF blocks (i.e., via an extending truncate) or following a crash. In the event of a crash, the in-memory state used to track and reclaim the speculative preallocation is lost.&lt;br /&gt;
&lt;br /&gt;
== Q: My workload has known characteristics - can I disable speculative preallocation or tune it to an optimal fixed size? ==&lt;br /&gt;
&lt;br /&gt;
Speculative preallocation can not be disabled but XFS can be tuned to a fixed allocation size with the &#039;allocsize=&#039; mount option. Speculative preallocation is not dynamically resized when the allocsize mount option is set and thus the potential for fragmentation is increased. Use &#039;allocsize=64k&#039; to revert to the default XFS behavior prior to support for dynamic speculative preallocation.&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2936</id>
		<title>XFS FAQ</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2936"/>
		<updated>2014-01-21T01:44:56Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: /* Q: I&amp;#039;m getting &amp;quot;Internal error xfs_sb_read_verify&amp;quot; errors when I try to run xfs_growfs under kernels v3.10 through v3.12 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about XFS? ==&lt;br /&gt;
&lt;br /&gt;
The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.&lt;br /&gt;
&lt;br /&gt;
You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the &#039;&#039;&#039;&amp;lt;nowiki&amp;gt;#xfs&amp;lt;/nowiki&amp;gt;&#039;&#039;&#039; IRC channel on &#039;&#039;irc.freenode.net&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about ACLs? ==&lt;br /&gt;
&lt;br /&gt;
Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;acl(5)&#039;&#039;&#039; manual page is also quite extensive.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find information about the internals of XFS? ==&lt;br /&gt;
&lt;br /&gt;
An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.&lt;br /&gt;
&lt;br /&gt;
Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.&lt;br /&gt;
&lt;br /&gt;
== Q: What partition type should I use for XFS on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Linux native filesystem (83).&lt;br /&gt;
&lt;br /&gt;
== Q: What mount options does XFS have? ==&lt;br /&gt;
&lt;br /&gt;
There are a number of mount options influencing XFS filesystems - refer to the &#039;&#039;&#039;mount(8)&#039;&#039;&#039; manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])&lt;br /&gt;
&lt;br /&gt;
== Q: Is there any relation between the XFS utilities and the kernel version? ==&lt;br /&gt;
&lt;br /&gt;
No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Does it run on platforms other than i386? ==&lt;br /&gt;
&lt;br /&gt;
XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Do quotas work on XFS? ==&lt;br /&gt;
&lt;br /&gt;
Yes.&lt;br /&gt;
&lt;br /&gt;
To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/  http://sourceforge.net/projects/linuxquota/] or use &#039;&#039;&#039;xfs_quota(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: What&#039;s project quota? ==&lt;br /&gt;
&lt;br /&gt;
The  project  quota  is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Can group quota and project quota be used at the same time? ==&lt;br /&gt;
&lt;br /&gt;
No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==&lt;br /&gt;
&lt;br /&gt;
To be answered.&lt;br /&gt;
&lt;br /&gt;
== Q: Are there any dump/restore tools for XFS? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and &#039;&#039;&#039;xfsrestore(8)&#039;&#039;&#039; are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.&lt;br /&gt;
&lt;br /&gt;
== Q: Does LILO work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
This depends on where you install LILO.&lt;br /&gt;
&lt;br /&gt;
Yes, for MBR (Master Boot Record) installations.&lt;br /&gt;
&lt;br /&gt;
No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.&lt;br /&gt;
&lt;br /&gt;
== Q: Does GRUB work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.&lt;br /&gt;
&lt;br /&gt;
== Q: Can XFS be used for a root filesystem? ==&lt;br /&gt;
&lt;br /&gt;
Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the &amp;quot;rootflags=&amp;quot; kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit &amp;quot;logdev=&amp;quot; specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]&lt;br /&gt;
&lt;br /&gt;
== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be &amp;quot;clean&amp;quot; when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.&lt;br /&gt;
&lt;br /&gt;
== Q: Is there a way to make a XFS filesystem larger or smaller? ==&lt;br /&gt;
&lt;br /&gt;
You can &#039;&#039;NOT&#039;&#039; make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.&lt;br /&gt;
&lt;br /&gt;
An XFS filesystem may be enlarged by using &#039;&#039;&#039;xfs_growfs(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the &#039;&#039;exact same&#039;&#039; starting point. Run &#039;&#039;&#039;xfs_growfs&#039;&#039;&#039; to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.&lt;br /&gt;
&lt;br /&gt;
Using XFS filesystems on top of a volume manager makes this a lot easier.&lt;br /&gt;
&lt;br /&gt;
== Q: What information should I include when reporting a problem? ==&lt;br /&gt;
&lt;br /&gt;
What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:&lt;br /&gt;
&lt;br /&gt;
* kernel version (uname -a)&lt;br /&gt;
* xfsprogs version (xfs_repair -V)&lt;br /&gt;
* number of CPUs&lt;br /&gt;
* contents of /proc/meminfo&lt;br /&gt;
* contents of /proc/mounts&lt;br /&gt;
* contents of /proc/partitions&lt;br /&gt;
* RAID layout (hardware and/or software)&lt;br /&gt;
* LVM configuration&lt;br /&gt;
* type of disks you are using&lt;br /&gt;
* write cache status of drives&lt;br /&gt;
* size of BBWC and mode it is running in&lt;br /&gt;
* xfs_info output on the filesystem in question&lt;br /&gt;
* dmesg output showing all error messages and stack traces&lt;br /&gt;
 &lt;br /&gt;
Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:&lt;br /&gt;
&lt;br /&gt;
# iostat -x -d -m 5&lt;br /&gt;
# vmstat 5&lt;br /&gt;
 &lt;br /&gt;
can give us insight into the IO and memory utilisation of your machine at the time of the problem.&lt;br /&gt;
&lt;br /&gt;
If the filesystem is hanging, then capture the output of the dmesg command after running:&lt;br /&gt;
&lt;br /&gt;
 # echo w &amp;gt; /proc/sysrq-trigger&lt;br /&gt;
 # dmesg&lt;br /&gt;
&lt;br /&gt;
will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.&lt;br /&gt;
&lt;br /&gt;
And for advanced users, capturing an event trace using &#039;&#039;&#039;trace-cmd&#039;&#039;&#039; (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it&#039;s a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd record -e xfs\*&lt;br /&gt;
&lt;br /&gt;
before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd report &amp;gt; trace_report.txt&lt;br /&gt;
&lt;br /&gt;
Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.&lt;br /&gt;
&lt;br /&gt;
If you have a problem with &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039;, make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using &#039;&#039;&#039;xfs_metadump(8)&#039;&#039;&#039; (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.&lt;br /&gt;
&lt;br /&gt;
== Q: Mounting an XFS filesystem does not work - what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
If mount prints an error message something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     mount: /dev/hda5 has wrong major or minor number&lt;br /&gt;
&lt;br /&gt;
you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the &amp;quot;-t xfs&amp;quot; option on mount or the &amp;quot;xfs&amp;quot; option in &amp;lt;tt&amp;gt;/etc/fstab&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
If you get something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 mount: wrong fs type, bad option, bad superblock on /dev/sda1,&lt;br /&gt;
        or too many mounted file systems&lt;br /&gt;
&lt;br /&gt;
Refer to your system log file (&amp;lt;tt&amp;gt;/var/log/messages&amp;lt;/tt&amp;gt;) for a detailed diagnostic message from the kernel.&lt;br /&gt;
&lt;br /&gt;
== Q: Does the filesystem have an undelete capability? ==&lt;br /&gt;
&lt;br /&gt;
There is no undelete in XFS.&lt;br /&gt;
&lt;br /&gt;
However, if an inode is unlinked but neither it nor its associated data blocks get immediately re-used and overwritten, there is some small chance to recover the file from the disk.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;photorec&#039;&#039;, &#039;&#039;xfs_irecover&#039;&#039; or &#039;&#039;xfsr&#039;&#039; are some tools which attempt to do this, with varying success.&lt;br /&gt;
&lt;br /&gt;
There are also commercial data recovery services and closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS] which claims to recover data, although this has not been tested by the XFS developers.&lt;br /&gt;
&lt;br /&gt;
As always, the best advice is to keep good backups.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I backup a XFS filesystem and ACLs? ==&lt;br /&gt;
&lt;br /&gt;
You can backup a XFS filesystem with utilities like &#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and standard &#039;&#039;&#039;tar(1)&#039;&#039;&#039; for standard files. If you want to backup ACLs you will need to use &#039;&#039;&#039;xfsdump&#039;&#039;&#039; or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (&amp;gt; version 3.1.4) or [http://rsync.samba.org/ rsync] (&amp;gt;= version 3.0.0) to backup ACLs and EAs. &#039;&#039;&#039;xfsdump&#039;&#039;&#039; can also be integrated with [http://www.amanda.org/ amanda(8)].&lt;br /&gt;
&lt;br /&gt;
== Q: I see applications returning error 990 or &amp;quot;Structure needs cleaning&amp;quot;, what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], &amp;quot;Structure needs cleaning.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.&lt;br /&gt;
&lt;br /&gt;
There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.&lt;br /&gt;
&lt;br /&gt;
You can use xfs_repair to remedy the problem (with the file system unmounted).&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==&lt;br /&gt;
&lt;br /&gt;
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.&lt;br /&gt;
&lt;br /&gt;
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.&lt;br /&gt;
&lt;br /&gt;
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you&#039;ll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the &#039;&#039;&#039;xfs_bmap(8)&#039;&#039;&#039; command).&lt;br /&gt;
&lt;br /&gt;
== Q: What is the problem with the write cache on journaled filesystems? ==&lt;br /&gt;
&lt;br /&gt;
Many drives use a write back cache in order to speed up the performance of writes.  However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk.  Further, the drive can de-stage data from the write cache to the platters in any order that it chooses.  This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk.  When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.&lt;br /&gt;
&lt;br /&gt;
With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information.  In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.&lt;br /&gt;
&lt;br /&gt;
With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued.  A powerfail &amp;quot;only&amp;quot; loses data in the cache but no essential ordering is violated, and corruption will not occur.&lt;br /&gt;
&lt;br /&gt;
With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance.  But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I tell if I have the disk write cache enabled? ==&lt;br /&gt;
&lt;br /&gt;
For SCSI/SATA:&lt;br /&gt;
&lt;br /&gt;
* Look in dmesg(8) output for a driver line, such as:&amp;lt;br /&amp;gt; &amp;quot;SCSI device sda: drive cache: write back&amp;quot;&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# sginfo -c /dev/sda | grep -i &#039;write cache&#039; &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -I /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; and look under &amp;quot;Enabled Supported&amp;quot; for &amp;quot;Write cache&amp;quot;&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
== Q: How can I address the problem with the disk write cache? ==&lt;br /&gt;
&lt;br /&gt;
=== Disabling the disk write back cache. ===&lt;br /&gt;
&lt;br /&gt;
For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -W0 /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # hdparm -W0 /dev/hda&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# blktool /dev/sda wcache off&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # blktool /dev/hda wcache off&lt;br /&gt;
&lt;br /&gt;
For SCSI:&lt;br /&gt;
&lt;br /&gt;
* Using sginfo(8) which is a little tedious&amp;lt;br /&amp;gt; It takes 3 steps. For example:&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -c /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives a list of attribute names and values&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cX /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives an array of cache values which you must match up with from step 1, e.g.&amp;lt;br /&amp;gt; 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; allows you to reset the value of the cache attributes.&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Using an external log. ===&lt;br /&gt;
&lt;br /&gt;
Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will &#039;&#039;&#039;not&#039;&#039;&#039; solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won&#039;t be able to guarantee that if the metadata is on a drive with the write cache enabled.&lt;br /&gt;
&lt;br /&gt;
In fact using an external log will disable XFS&#039; write barrier support.&lt;br /&gt;
&lt;br /&gt;
=== Write barrier support. ===&lt;br /&gt;
&lt;br /&gt;
Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with &amp;quot;nobarrier&amp;quot;. Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported with external log device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported by the underlying device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, trial barrier write failed&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If the filesystem is mounted with an external log device then we currently don&#039;t support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn&#039;t support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.&lt;br /&gt;
&lt;br /&gt;
== Q. Should barriers be enabled with storage which has a persistent write cache? ==&lt;br /&gt;
&lt;br /&gt;
Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with &amp;quot;nobarrier&amp;quot;, assuming your RAID controller is infallible and not resetting randomly like some common ones do.  But take care about the hard disk write cache, which should be off.&lt;br /&gt;
&lt;br /&gt;
== Q. Which settings does my RAID controller need ? ==&lt;br /&gt;
&lt;br /&gt;
It&#039;s hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:&lt;br /&gt;
&lt;br /&gt;
Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory &amp;quot;[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]&amp;quot;) which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.&lt;br /&gt;
&lt;br /&gt;
If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.&lt;br /&gt;
&lt;br /&gt;
* onboard RAID controllers: there are so many different types it&#039;s hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn&#039;t even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.&lt;br /&gt;
&lt;br /&gt;
* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86); &lt;br /&gt;
&lt;br /&gt;
* Adaptec: allows setting individual drives cache&lt;br /&gt;
arcconf setcache &amp;lt;disk&amp;gt; wb|wt&lt;br /&gt;
wb=write back, which means write cache on, wt=write through, which means write cache off. So &amp;quot;wt&amp;quot; should be chosen.&lt;br /&gt;
&lt;br /&gt;
* Areca: In archttp under &amp;quot;System Controls&amp;quot; -&amp;gt; &amp;quot;System Configuration&amp;quot; there&#039;s the option &amp;quot;Disk Write Cache Mode&amp;quot; (defaults &amp;quot;Auto&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Off&amp;quot;: disk write cache is turned off&lt;br /&gt;
&lt;br /&gt;
&amp;quot;On&amp;quot;: disk write cache is enabled, this is not safe for your data but fast&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Auto&amp;quot;: If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to &amp;quot;On&amp;quot;, because neither controller cache nor disk cache is safe so you don&#039;t seem to care about your data and just want high speed (which you get then).&lt;br /&gt;
&lt;br /&gt;
That&#039;s a very sensible default so you can let it &amp;quot;Auto&amp;quot; or enforce &amp;quot;Off&amp;quot; to be sure.&lt;br /&gt;
&lt;br /&gt;
* LSI MegaRAID: allows setting individual disks cache:&lt;br /&gt;
 MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL                          # flushes the controller cache&lt;br /&gt;
 MegaCli -LDGetProp -Cache    -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the controller cache settings&lt;br /&gt;
 MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the disk cache settings (for all phys. disks in logical disk)&lt;br /&gt;
 MegaCli -LDSetProp -EnDskCache|DisDskCache  -LN|-L0,1,2|-LAll  -aN|-a0,1,2|-aALL # set disk cache setting&lt;br /&gt;
&lt;br /&gt;
* Xyratex: from the docs: &amp;quot;Write cache includes the disk drive cache and controller cache.&amp;quot;. So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.&lt;br /&gt;
&lt;br /&gt;
== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==&lt;br /&gt;
&lt;br /&gt;
The biggest problem is that those products seem to also virtualize disk &lt;br /&gt;
writes in a way that even barriers don&#039;t work any more, which means even &lt;br /&gt;
a fsync is not reliable. Tests confirm that unplugging the power from &lt;br /&gt;
such a system even with RAID controller with battery backed cache and &lt;br /&gt;
hard disk cache turned off (which is safe on a normal host) you can &lt;br /&gt;
destroy a database within the virtual machine (client, domU whatever you &lt;br /&gt;
call it).&lt;br /&gt;
&lt;br /&gt;
In qemu you can specify cache=off on the line specifying the virtual &lt;br /&gt;
disk. For others information is missing.&lt;br /&gt;
&lt;br /&gt;
== Q: What is the issue with directory corruption in Linux 2.6.17? ==&lt;br /&gt;
&lt;br /&gt;
In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some &amp;quot;sparse&amp;quot; endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: the fix is included in 2.6.17.7 and later kernels.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
To add insult to injury, &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039; is currently not correcting these directories on detection of this corrupt state either. This &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; issue is actively being worked on, and a fixed version will be available shortly.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfs_repair -n&#039;&#039;&#039; should be able to detect any directory corruption.&lt;br /&gt;
&lt;br /&gt;
Until a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; binary is available, one can make use of the &#039;&#039;&#039;xfs_db(8)&#039;&#039;&#039; command to mark the problem directory for removal (see the example below). A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; invocation will remove the directory and move all contents into &amp;quot;lost+found&amp;quot;, named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 core.mode = 040755&lt;br /&gt;
 core.version = 2&lt;br /&gt;
 core.format = 3 (btree)&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; write core.mode 0&lt;br /&gt;
 xfs_db&amp;amp;gt; quit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; will clear the directory, and add new entries (named by inode number) in lost+found.&lt;br /&gt;
&lt;br /&gt;
The easiest way to map inode numbers to full paths is via &#039;&#039;&#039;xfs_ncheck(8)&#039;&#039;&#039;&amp;lt;nowiki&amp;gt;: &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_ncheck -i 14101 -i 14102 /dev/sdXXX&lt;br /&gt;
       14101 full/path/mumble_fratz_foo_bar_1495&lt;br /&gt;
       14102 full/path/mumble_fratz_foo_bar_1494&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 ...&lt;br /&gt;
 next_unlinked = null&lt;br /&gt;
 u.bmbt.level = 1&lt;br /&gt;
 u.bmbt.numrecs = 1&lt;br /&gt;
 u.bmbt.keys[1] = [startoff] 1:[0]&lt;br /&gt;
 u.bmbt.ptrs[1] = 1:3628&lt;br /&gt;
 xfs_db&amp;amp;gt; fsblock 3628&lt;br /&gt;
 xfs_db&amp;amp;gt; type bmapbtd&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 magic = 0x424d4150&lt;br /&gt;
 level = 0&lt;br /&gt;
 numrecs = 19&lt;br /&gt;
 leftsib = null&lt;br /&gt;
 rightsib = null&lt;br /&gt;
 recs[1-19] = [startoff,startblock,blockcount,extentflag]&lt;br /&gt;
        1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]&lt;br /&gt;
        5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]&lt;br /&gt;
        9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]&lt;br /&gt;
        12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]&lt;br /&gt;
        15:[33554436,3488,8,0] 16:[33554444,3629,4,0]&lt;br /&gt;
        17:[33554448,3748,4,0] 18:[33554452,3900,4,0]&lt;br /&gt;
        19:[67108864,3364,4,0]&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the &#039;&#039;&#039;xfs_db&#039;&#039;&#039; dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; dblock 20&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 dhdr.magic = 0x58443244&lt;br /&gt;
 dhdr.bestfree[0].offset = 0&lt;br /&gt;
 dhdr.bestfree[0].length = 0&lt;br /&gt;
 dhdr.bestfree[1].offset = 0&lt;br /&gt;
 dhdr.bestfree[1].length = 0&lt;br /&gt;
 dhdr.bestfree[2].offset = 0&lt;br /&gt;
 dhdr.bestfree[2].length = 0&lt;br /&gt;
 du[0].inumber = 13937&lt;br /&gt;
 du[0].namelen = 25&lt;br /&gt;
 du[0].name = &amp;quot;mumble_fratz_foo_bar_1595&amp;quot;&lt;br /&gt;
 du[0].tag = 0x10&lt;br /&gt;
 du[1].inumber = 13938&lt;br /&gt;
 du[1].namelen = 25&lt;br /&gt;
 du[1].name = &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;&lt;br /&gt;
 du[1].tag = 0x38&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
So, here we can see that inode number 13938 matches up with name &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;. Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at &amp;quot;lost+found&amp;quot; (once &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; has removed the corrupt directory).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why does my &amp;gt; 2TB XFS partition disappear when I reboot ? ==&lt;br /&gt;
&lt;br /&gt;
Strictly speaking this is not an XFS problem.&lt;br /&gt;
&lt;br /&gt;
To support &amp;gt; 2TB partitions you need two things: a kernel that supports large block devices (&amp;lt;tt&amp;gt;CONFIG_LBD=y&amp;lt;/tt&amp;gt;) and a partition table format that can hold large partitions.  The default DOS partition tables don&#039;t.  The best partition format for&lt;br /&gt;
&amp;gt; 2TB partitions is the EFI GPT format (&amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Without CONFIG_LBD=y you can&#039;t even create the filesystem, but without &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt; it works fine until you reboot at which point the partition will disappear.  Note that you need to enable the &amp;lt;tt&amp;gt;CONFIG_PARTITION_ADVANCED&amp;lt;/tt&amp;gt; option before you can set &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I receive &amp;lt;tt&amp;gt;No space left on device&amp;lt;/tt&amp;gt; after &amp;lt;tt&amp;gt;xfs_growfs&amp;lt;/tt&amp;gt;? ==&lt;br /&gt;
&lt;br /&gt;
After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:&lt;br /&gt;
&lt;br /&gt;
  The only way to fix this is to move data around to free up space&lt;br /&gt;
  below 1TB. Find your oldest data (i.e. that was around before even&lt;br /&gt;
  the first grow) and move it off the filesystem (move, not copy).&lt;br /&gt;
  Then if you copy it back on, the data blocks will end up above 1TB&lt;br /&gt;
  and that should leave you with plenty of space for inodes below 1TB.&lt;br /&gt;
  &lt;br /&gt;
  A complete dump and restore will also fix the problem ;)&lt;br /&gt;
&lt;br /&gt;
Also, you can add &#039;inode64&#039; to your mount options to allow inodes to live above 1TB.&lt;br /&gt;
&lt;br /&gt;
example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&amp;amp;forum=38 | No space left on device on xfs filesystem with 7.7TB free]&lt;br /&gt;
&lt;br /&gt;
== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==&lt;br /&gt;
&lt;br /&gt;
The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons. &lt;br /&gt;
&lt;br /&gt;
Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.&lt;br /&gt;
&lt;br /&gt;
== Q: How to get around a bad inode repair is unable to clean up ==&lt;br /&gt;
&lt;br /&gt;
The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.&lt;br /&gt;
&lt;br /&gt;
  xfs_db -x -c &#039;inode XXX&#039; -c &#039;write core.nextents 0&#039; -c &#039;write core.size 0&#039; /dev/hdXX&lt;br /&gt;
&lt;br /&gt;
== Q: How to calculate the correct sunit,swidth values for optimal performance ==&lt;br /&gt;
&lt;br /&gt;
XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.&lt;br /&gt;
&lt;br /&gt;
These options can be sometimes autodetected (for example with md raid and recent enough kernel (&amp;gt;= 2.6.32) and xfsprogs (&amp;gt;= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.&lt;br /&gt;
&lt;br /&gt;
The calculation of these values is quite simple:&lt;br /&gt;
&lt;br /&gt;
  su = &amp;lt;RAID controllers stripe size in BYTES (or KiBytes when used with k)&amp;gt;&lt;br /&gt;
  sw = &amp;lt;# of data disks (don&#039;t count parity disks)&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use&lt;br /&gt;
&lt;br /&gt;
  su = 64k&lt;br /&gt;
  sw = 6 (RAID-6 of 8 disks has 6 data disks)&lt;br /&gt;
&lt;br /&gt;
A RAID stripe size of 256KB with a RAID-10 over 16 disks should use&lt;br /&gt;
&lt;br /&gt;
  su = 256k&lt;br /&gt;
  sw = 8 (RAID-10 of 16 disks has 8 data disks)&lt;br /&gt;
&lt;br /&gt;
Alternatively, you can use &amp;quot;sunit&amp;quot; instead of &amp;quot;su&amp;quot; and &amp;quot;swidth&amp;quot; instead of &amp;quot;sw&amp;quot; but then sunit/swidth values need to be specified in &amp;quot;number of 512B sectors&amp;quot;!&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; interpret sunit and swidth as being specified in units of 512B sectors; that&#039;s unfortunately not the unit they&#039;re reported in, however.&lt;br /&gt;
&amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; report them in multiples of your basic block size (bsize) and not in 512B sectors.&lt;br /&gt;
&lt;br /&gt;
Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.&lt;br /&gt;
&lt;br /&gt;
When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.&lt;br /&gt;
&lt;br /&gt;
== Q: Why doesn&#039;t NFS-exporting subdirectories of inode64-mounted filesystem work? ==&lt;br /&gt;
&lt;br /&gt;
The default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; type encodes only 32-bit of the inode number for subdirectory exports.  However, exporting the root of the filesystem works, or using one of the non-default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; types (&amp;lt;tt&amp;gt;fsid=uuid&amp;lt;/tt&amp;gt; in &amp;lt;tt&amp;gt;/etc/exports&amp;lt;/tt&amp;gt; with recent &amp;lt;tt&amp;gt;nfs-utils&amp;lt;/tt&amp;gt;) should work as well. (Thanks, Christoph!)&lt;br /&gt;
&lt;br /&gt;
== Q: What is the inode64 mount option for? ==&lt;br /&gt;
&lt;br /&gt;
By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like &amp;quot;disk full&amp;quot; when you still have plenty space free, but there&#039;s no more place in the first TB to create a new inode. Also, performance sucks.&lt;br /&gt;
&lt;br /&gt;
To come around this, use the inode64 mount options for filesystems &amp;gt;1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.&lt;br /&gt;
&lt;br /&gt;
Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.&lt;br /&gt;
&lt;br /&gt;
== Q: Can I just try the inode64 option to see if it helps me? ==&lt;br /&gt;
&lt;br /&gt;
Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can&#039;t access files &amp;amp; dirs that have been created with an inode &amp;gt;32bit anymore.&lt;br /&gt;
&lt;br /&gt;
== Q: Performance: mkfs.xfs -n size=64k option ==&lt;br /&gt;
&lt;br /&gt;
Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:&lt;br /&gt;
&lt;br /&gt;
Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a&lt;br /&gt;
directory entry is determined by the length of the name.&lt;br /&gt;
&lt;br /&gt;
There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there&#039;s the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.&lt;br /&gt;
&lt;br /&gt;
For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.&lt;br /&gt;
&lt;br /&gt;
In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don&#039;t have any numbers on what the difference might be - I&#039;m getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....&lt;br /&gt;
&lt;br /&gt;
== Q: I want to tune my XFS filesystems for &amp;lt;something&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Premature optimization is the root of all evil.&#039;&#039; - Donald Knuth&lt;br /&gt;
&lt;br /&gt;
The standard answer you will get to this question is this: use the defaults.&lt;br /&gt;
&lt;br /&gt;
There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to  configure the filesystem appropriately.&lt;br /&gt;
&lt;br /&gt;
There are a lot of &amp;quot;XFS tuning guides&amp;quot; that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don&#039;t expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.&lt;br /&gt;
&lt;br /&gt;
In most cases, the only thing you need to to consider for &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; mount options. Increasing &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; reduces the number of journal IOs for a given workload, and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; will reduce them even further. The trade off for this increase in metadata performance is that more operations may be &amp;quot;missing&amp;quot; after recovery if the system crashes while actively making modifications.&lt;br /&gt;
&lt;br /&gt;
As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.&lt;br /&gt;
&lt;br /&gt;
== Q: Which factors influence the memory usage of xfs_repair? ==&lt;br /&gt;
&lt;br /&gt;
This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -n -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2096.&lt;br /&gt;
  #&lt;br /&gt;
&lt;br /&gt;
xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,&lt;br /&gt;
of which 2,097,152KB is needed for tracking free space. &lt;br /&gt;
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)&lt;br /&gt;
&lt;br /&gt;
Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2289.&lt;br /&gt;
&lt;br /&gt;
That is now needs at least another 200MB of RAM to run.&lt;br /&gt;
&lt;br /&gt;
The numbers reported by xfs_repair are the absolute minimum required and approximate at that;&lt;br /&gt;
more RAM than this may be required to complete successfully.&lt;br /&gt;
Also, if you only give xfs_repair the minimum required RAM, it will be slow;&lt;br /&gt;
for best repair performance, the more RAM you can give it the better.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why some files of my filesystem shows as &amp;quot;?????????? ? ?      ?          ?                ? filename&amp;quot; ? ==&lt;br /&gt;
&lt;br /&gt;
If ls -l shows you a listing as&lt;br /&gt;
&lt;br /&gt;
  # ?????????? ? ?      ?          ?                ? file1&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file2&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file3&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file4&lt;br /&gt;
&lt;br /&gt;
and errors like:&lt;br /&gt;
  # ls /pathtodir/&lt;br /&gt;
    ls: cannot access /pathtodir/file1: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file2: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file3: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file4: Invalid argument&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
or even:&lt;br /&gt;
  # failed to stat /pathtodir/file1&lt;br /&gt;
&lt;br /&gt;
It is very probable your filesystem must be mounted with inode64&lt;br /&gt;
  # mount -oremount,inode64 /dev/diskpart /mnt/xfs&lt;br /&gt;
&lt;br /&gt;
should make it work ok again.&lt;br /&gt;
If it works, add the option to fstab.&lt;br /&gt;
&lt;br /&gt;
== Q: The xfs_db &amp;quot;frag&amp;quot; command says I&#039;m over 50%.  Is that bad? ==&lt;br /&gt;
&lt;br /&gt;
It depends.  It&#039;s important to know how the value is calculated.  xfs_db looks at the extents in all files, and returns:&lt;br /&gt;
&lt;br /&gt;
  (actual extents - ideal extents) / actual extents&lt;br /&gt;
&lt;br /&gt;
This means that if, for example, you have an average of 2 extents per file, you&#039;ll get an answer of 50%.  4 extents per file would give you 75%.  This may or may not be a problem, especially depending on the size of the files in question.  (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented).  The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.&lt;br /&gt;
&lt;br /&gt;
Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:&lt;br /&gt;
[[Image:Frag_factor.png|500px]]&lt;br /&gt;
&lt;br /&gt;
== Q: I&#039;m getting &amp;quot;Internal error xfs_sb_read_verify&amp;quot; errors when I try to run xfs_growfs under kernels v3.10 through v3.12 ==&lt;br /&gt;
&lt;br /&gt;
This may happen when running xfs_growfs under a v3.10-v3.12 kernel,&lt;br /&gt;
if the filesystem was previously grown under a kernel prior to v3.8.&lt;br /&gt;
&lt;br /&gt;
Old kernel versions prior to v3.8 did not zero the empty part of&lt;br /&gt;
new secondary superblocks when growing the filesystem with xfs_growfs.&lt;br /&gt;
&lt;br /&gt;
Kernels v3.10 and later began detecting this non-zero part of the&lt;br /&gt;
superblock as corruption, and emit the &lt;br /&gt;
&lt;br /&gt;
    Internal error xfs_sb_read_verify&lt;br /&gt;
&lt;br /&gt;
error message.&lt;br /&gt;
&lt;br /&gt;
Kernels v3.13 and later are more forgiving about this - if the non-zero &lt;br /&gt;
data is found on a Version 4 superblock, it will not be flagged as&lt;br /&gt;
corruption.&lt;br /&gt;
&lt;br /&gt;
The problematic secondary superblocks may be repaired by using an xfs_repair&lt;br /&gt;
version 3.2.0-alpha1 or above.&lt;br /&gt;
&lt;br /&gt;
The relevant kernelspace commits are as follows:&lt;br /&gt;
&lt;br /&gt;
    v3.8  1375cb6 xfs: growfs: don&#039;t read garbage for new secondary superblocks &amp;lt;- fixed underlying problem &lt;br /&gt;
    v3.10 04a1e6c xfs: add CRC checks to the superblock &amp;lt;- detected old underlying problem&lt;br /&gt;
    v3.13 10e6e65 xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields &amp;lt;- is more forgiving of old underlying problem&lt;br /&gt;
&lt;br /&gt;
This commit allows xfs_repair to detect and correct the problem:&lt;br /&gt;
&lt;br /&gt;
    v3.2.0-alpha1 cbd7508 xfs_repair: zero out unused parts of superblocks&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2935</id>
		<title>XFS FAQ</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2935"/>
		<updated>2014-01-21T01:44:21Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: /* Q: I&amp;#039;m getting &amp;quot;Internal error xfs_sb_read_verify&amp;quot; errors when I try to run xfs_growfs under kernels v3.9 through v3.12 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about XFS? ==&lt;br /&gt;
&lt;br /&gt;
The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.&lt;br /&gt;
&lt;br /&gt;
You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the &#039;&#039;&#039;&amp;lt;nowiki&amp;gt;#xfs&amp;lt;/nowiki&amp;gt;&#039;&#039;&#039; IRC channel on &#039;&#039;irc.freenode.net&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about ACLs? ==&lt;br /&gt;
&lt;br /&gt;
Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;acl(5)&#039;&#039;&#039; manual page is also quite extensive.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find information about the internals of XFS? ==&lt;br /&gt;
&lt;br /&gt;
An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.&lt;br /&gt;
&lt;br /&gt;
Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.&lt;br /&gt;
&lt;br /&gt;
== Q: What partition type should I use for XFS on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Linux native filesystem (83).&lt;br /&gt;
&lt;br /&gt;
== Q: What mount options does XFS have? ==&lt;br /&gt;
&lt;br /&gt;
There are a number of mount options influencing XFS filesystems - refer to the &#039;&#039;&#039;mount(8)&#039;&#039;&#039; manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])&lt;br /&gt;
&lt;br /&gt;
== Q: Is there any relation between the XFS utilities and the kernel version? ==&lt;br /&gt;
&lt;br /&gt;
No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Does it run on platforms other than i386? ==&lt;br /&gt;
&lt;br /&gt;
XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Do quotas work on XFS? ==&lt;br /&gt;
&lt;br /&gt;
Yes.&lt;br /&gt;
&lt;br /&gt;
To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/  http://sourceforge.net/projects/linuxquota/] or use &#039;&#039;&#039;xfs_quota(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: What&#039;s project quota? ==&lt;br /&gt;
&lt;br /&gt;
The  project  quota  is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Can group quota and project quota be used at the same time? ==&lt;br /&gt;
&lt;br /&gt;
No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==&lt;br /&gt;
&lt;br /&gt;
To be answered.&lt;br /&gt;
&lt;br /&gt;
== Q: Are there any dump/restore tools for XFS? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and &#039;&#039;&#039;xfsrestore(8)&#039;&#039;&#039; are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.&lt;br /&gt;
&lt;br /&gt;
== Q: Does LILO work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
This depends on where you install LILO.&lt;br /&gt;
&lt;br /&gt;
Yes, for MBR (Master Boot Record) installations.&lt;br /&gt;
&lt;br /&gt;
No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.&lt;br /&gt;
&lt;br /&gt;
== Q: Does GRUB work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.&lt;br /&gt;
&lt;br /&gt;
== Q: Can XFS be used for a root filesystem? ==&lt;br /&gt;
&lt;br /&gt;
Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the &amp;quot;rootflags=&amp;quot; kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit &amp;quot;logdev=&amp;quot; specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]&lt;br /&gt;
&lt;br /&gt;
== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be &amp;quot;clean&amp;quot; when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.&lt;br /&gt;
&lt;br /&gt;
== Q: Is there a way to make a XFS filesystem larger or smaller? ==&lt;br /&gt;
&lt;br /&gt;
You can &#039;&#039;NOT&#039;&#039; make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.&lt;br /&gt;
&lt;br /&gt;
An XFS filesystem may be enlarged by using &#039;&#039;&#039;xfs_growfs(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the &#039;&#039;exact same&#039;&#039; starting point. Run &#039;&#039;&#039;xfs_growfs&#039;&#039;&#039; to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.&lt;br /&gt;
&lt;br /&gt;
Using XFS filesystems on top of a volume manager makes this a lot easier.&lt;br /&gt;
&lt;br /&gt;
== Q: What information should I include when reporting a problem? ==&lt;br /&gt;
&lt;br /&gt;
What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:&lt;br /&gt;
&lt;br /&gt;
* kernel version (uname -a)&lt;br /&gt;
* xfsprogs version (xfs_repair -V)&lt;br /&gt;
* number of CPUs&lt;br /&gt;
* contents of /proc/meminfo&lt;br /&gt;
* contents of /proc/mounts&lt;br /&gt;
* contents of /proc/partitions&lt;br /&gt;
* RAID layout (hardware and/or software)&lt;br /&gt;
* LVM configuration&lt;br /&gt;
* type of disks you are using&lt;br /&gt;
* write cache status of drives&lt;br /&gt;
* size of BBWC and mode it is running in&lt;br /&gt;
* xfs_info output on the filesystem in question&lt;br /&gt;
* dmesg output showing all error messages and stack traces&lt;br /&gt;
 &lt;br /&gt;
Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:&lt;br /&gt;
&lt;br /&gt;
# iostat -x -d -m 5&lt;br /&gt;
# vmstat 5&lt;br /&gt;
 &lt;br /&gt;
can give us insight into the IO and memory utilisation of your machine at the time of the problem.&lt;br /&gt;
&lt;br /&gt;
If the filesystem is hanging, then capture the output of the dmesg command after running:&lt;br /&gt;
&lt;br /&gt;
 # echo w &amp;gt; /proc/sysrq-trigger&lt;br /&gt;
 # dmesg&lt;br /&gt;
&lt;br /&gt;
will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.&lt;br /&gt;
&lt;br /&gt;
And for advanced users, capturing an event trace using &#039;&#039;&#039;trace-cmd&#039;&#039;&#039; (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it&#039;s a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd record -e xfs\*&lt;br /&gt;
&lt;br /&gt;
before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd report &amp;gt; trace_report.txt&lt;br /&gt;
&lt;br /&gt;
Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.&lt;br /&gt;
&lt;br /&gt;
If you have a problem with &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039;, make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using &#039;&#039;&#039;xfs_metadump(8)&#039;&#039;&#039; (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.&lt;br /&gt;
&lt;br /&gt;
== Q: Mounting an XFS filesystem does not work - what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
If mount prints an error message something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     mount: /dev/hda5 has wrong major or minor number&lt;br /&gt;
&lt;br /&gt;
you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the &amp;quot;-t xfs&amp;quot; option on mount or the &amp;quot;xfs&amp;quot; option in &amp;lt;tt&amp;gt;/etc/fstab&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
If you get something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 mount: wrong fs type, bad option, bad superblock on /dev/sda1,&lt;br /&gt;
        or too many mounted file systems&lt;br /&gt;
&lt;br /&gt;
Refer to your system log file (&amp;lt;tt&amp;gt;/var/log/messages&amp;lt;/tt&amp;gt;) for a detailed diagnostic message from the kernel.&lt;br /&gt;
&lt;br /&gt;
== Q: Does the filesystem have an undelete capability? ==&lt;br /&gt;
&lt;br /&gt;
There is no undelete in XFS.&lt;br /&gt;
&lt;br /&gt;
However, if an inode is unlinked but neither it nor its associated data blocks get immediately re-used and overwritten, there is some small chance to recover the file from the disk.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;photorec&#039;&#039;, &#039;&#039;xfs_irecover&#039;&#039; or &#039;&#039;xfsr&#039;&#039; are some tools which attempt to do this, with varying success.&lt;br /&gt;
&lt;br /&gt;
There are also commercial data recovery services and closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS] which claims to recover data, although this has not been tested by the XFS developers.&lt;br /&gt;
&lt;br /&gt;
As always, the best advice is to keep good backups.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I backup a XFS filesystem and ACLs? ==&lt;br /&gt;
&lt;br /&gt;
You can backup a XFS filesystem with utilities like &#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and standard &#039;&#039;&#039;tar(1)&#039;&#039;&#039; for standard files. If you want to backup ACLs you will need to use &#039;&#039;&#039;xfsdump&#039;&#039;&#039; or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (&amp;gt; version 3.1.4) or [http://rsync.samba.org/ rsync] (&amp;gt;= version 3.0.0) to backup ACLs and EAs. &#039;&#039;&#039;xfsdump&#039;&#039;&#039; can also be integrated with [http://www.amanda.org/ amanda(8)].&lt;br /&gt;
&lt;br /&gt;
== Q: I see applications returning error 990 or &amp;quot;Structure needs cleaning&amp;quot;, what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], &amp;quot;Structure needs cleaning.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.&lt;br /&gt;
&lt;br /&gt;
There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.&lt;br /&gt;
&lt;br /&gt;
You can use xfs_repair to remedy the problem (with the file system unmounted).&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==&lt;br /&gt;
&lt;br /&gt;
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.&lt;br /&gt;
&lt;br /&gt;
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.&lt;br /&gt;
&lt;br /&gt;
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you&#039;ll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the &#039;&#039;&#039;xfs_bmap(8)&#039;&#039;&#039; command).&lt;br /&gt;
&lt;br /&gt;
== Q: What is the problem with the write cache on journaled filesystems? ==&lt;br /&gt;
&lt;br /&gt;
Many drives use a write back cache in order to speed up the performance of writes.  However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk.  Further, the drive can de-stage data from the write cache to the platters in any order that it chooses.  This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk.  When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.&lt;br /&gt;
&lt;br /&gt;
With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information.  In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.&lt;br /&gt;
&lt;br /&gt;
With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued.  A powerfail &amp;quot;only&amp;quot; loses data in the cache but no essential ordering is violated, and corruption will not occur.&lt;br /&gt;
&lt;br /&gt;
With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance.  But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I tell if I have the disk write cache enabled? ==&lt;br /&gt;
&lt;br /&gt;
For SCSI/SATA:&lt;br /&gt;
&lt;br /&gt;
* Look in dmesg(8) output for a driver line, such as:&amp;lt;br /&amp;gt; &amp;quot;SCSI device sda: drive cache: write back&amp;quot;&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# sginfo -c /dev/sda | grep -i &#039;write cache&#039; &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -I /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; and look under &amp;quot;Enabled Supported&amp;quot; for &amp;quot;Write cache&amp;quot;&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
== Q: How can I address the problem with the disk write cache? ==&lt;br /&gt;
&lt;br /&gt;
=== Disabling the disk write back cache. ===&lt;br /&gt;
&lt;br /&gt;
For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -W0 /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # hdparm -W0 /dev/hda&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# blktool /dev/sda wcache off&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # blktool /dev/hda wcache off&lt;br /&gt;
&lt;br /&gt;
For SCSI:&lt;br /&gt;
&lt;br /&gt;
* Using sginfo(8) which is a little tedious&amp;lt;br /&amp;gt; It takes 3 steps. For example:&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -c /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives a list of attribute names and values&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cX /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives an array of cache values which you must match up with from step 1, e.g.&amp;lt;br /&amp;gt; 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; allows you to reset the value of the cache attributes.&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Using an external log. ===&lt;br /&gt;
&lt;br /&gt;
Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will &#039;&#039;&#039;not&#039;&#039;&#039; solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won&#039;t be able to guarantee that if the metadata is on a drive with the write cache enabled.&lt;br /&gt;
&lt;br /&gt;
In fact using an external log will disable XFS&#039; write barrier support.&lt;br /&gt;
&lt;br /&gt;
=== Write barrier support. ===&lt;br /&gt;
&lt;br /&gt;
Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with &amp;quot;nobarrier&amp;quot;. Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported with external log device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported by the underlying device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, trial barrier write failed&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If the filesystem is mounted with an external log device then we currently don&#039;t support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn&#039;t support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.&lt;br /&gt;
&lt;br /&gt;
== Q. Should barriers be enabled with storage which has a persistent write cache? ==&lt;br /&gt;
&lt;br /&gt;
Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with &amp;quot;nobarrier&amp;quot;, assuming your RAID controller is infallible and not resetting randomly like some common ones do.  But take care about the hard disk write cache, which should be off.&lt;br /&gt;
&lt;br /&gt;
== Q. Which settings does my RAID controller need ? ==&lt;br /&gt;
&lt;br /&gt;
It&#039;s hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:&lt;br /&gt;
&lt;br /&gt;
Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory &amp;quot;[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]&amp;quot;) which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.&lt;br /&gt;
&lt;br /&gt;
If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.&lt;br /&gt;
&lt;br /&gt;
* onboard RAID controllers: there are so many different types it&#039;s hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn&#039;t even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.&lt;br /&gt;
&lt;br /&gt;
* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86); &lt;br /&gt;
&lt;br /&gt;
* Adaptec: allows setting individual drives cache&lt;br /&gt;
arcconf setcache &amp;lt;disk&amp;gt; wb|wt&lt;br /&gt;
wb=write back, which means write cache on, wt=write through, which means write cache off. So &amp;quot;wt&amp;quot; should be chosen.&lt;br /&gt;
&lt;br /&gt;
* Areca: In archttp under &amp;quot;System Controls&amp;quot; -&amp;gt; &amp;quot;System Configuration&amp;quot; there&#039;s the option &amp;quot;Disk Write Cache Mode&amp;quot; (defaults &amp;quot;Auto&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Off&amp;quot;: disk write cache is turned off&lt;br /&gt;
&lt;br /&gt;
&amp;quot;On&amp;quot;: disk write cache is enabled, this is not safe for your data but fast&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Auto&amp;quot;: If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to &amp;quot;On&amp;quot;, because neither controller cache nor disk cache is safe so you don&#039;t seem to care about your data and just want high speed (which you get then).&lt;br /&gt;
&lt;br /&gt;
That&#039;s a very sensible default so you can let it &amp;quot;Auto&amp;quot; or enforce &amp;quot;Off&amp;quot; to be sure.&lt;br /&gt;
&lt;br /&gt;
* LSI MegaRAID: allows setting individual disks cache:&lt;br /&gt;
 MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL                          # flushes the controller cache&lt;br /&gt;
 MegaCli -LDGetProp -Cache    -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the controller cache settings&lt;br /&gt;
 MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the disk cache settings (for all phys. disks in logical disk)&lt;br /&gt;
 MegaCli -LDSetProp -EnDskCache|DisDskCache  -LN|-L0,1,2|-LAll  -aN|-a0,1,2|-aALL # set disk cache setting&lt;br /&gt;
&lt;br /&gt;
* Xyratex: from the docs: &amp;quot;Write cache includes the disk drive cache and controller cache.&amp;quot;. So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.&lt;br /&gt;
&lt;br /&gt;
== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==&lt;br /&gt;
&lt;br /&gt;
The biggest problem is that those products seem to also virtualize disk &lt;br /&gt;
writes in a way that even barriers don&#039;t work any more, which means even &lt;br /&gt;
a fsync is not reliable. Tests confirm that unplugging the power from &lt;br /&gt;
such a system even with RAID controller with battery backed cache and &lt;br /&gt;
hard disk cache turned off (which is safe on a normal host) you can &lt;br /&gt;
destroy a database within the virtual machine (client, domU whatever you &lt;br /&gt;
call it).&lt;br /&gt;
&lt;br /&gt;
In qemu you can specify cache=off on the line specifying the virtual &lt;br /&gt;
disk. For others information is missing.&lt;br /&gt;
&lt;br /&gt;
== Q: What is the issue with directory corruption in Linux 2.6.17? ==&lt;br /&gt;
&lt;br /&gt;
In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some &amp;quot;sparse&amp;quot; endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: the fix is included in 2.6.17.7 and later kernels.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
To add insult to injury, &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039; is currently not correcting these directories on detection of this corrupt state either. This &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; issue is actively being worked on, and a fixed version will be available shortly.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfs_repair -n&#039;&#039;&#039; should be able to detect any directory corruption.&lt;br /&gt;
&lt;br /&gt;
Until a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; binary is available, one can make use of the &#039;&#039;&#039;xfs_db(8)&#039;&#039;&#039; command to mark the problem directory for removal (see the example below). A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; invocation will remove the directory and move all contents into &amp;quot;lost+found&amp;quot;, named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 core.mode = 040755&lt;br /&gt;
 core.version = 2&lt;br /&gt;
 core.format = 3 (btree)&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; write core.mode 0&lt;br /&gt;
 xfs_db&amp;amp;gt; quit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; will clear the directory, and add new entries (named by inode number) in lost+found.&lt;br /&gt;
&lt;br /&gt;
The easiest way to map inode numbers to full paths is via &#039;&#039;&#039;xfs_ncheck(8)&#039;&#039;&#039;&amp;lt;nowiki&amp;gt;: &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_ncheck -i 14101 -i 14102 /dev/sdXXX&lt;br /&gt;
       14101 full/path/mumble_fratz_foo_bar_1495&lt;br /&gt;
       14102 full/path/mumble_fratz_foo_bar_1494&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 ...&lt;br /&gt;
 next_unlinked = null&lt;br /&gt;
 u.bmbt.level = 1&lt;br /&gt;
 u.bmbt.numrecs = 1&lt;br /&gt;
 u.bmbt.keys[1] = [startoff] 1:[0]&lt;br /&gt;
 u.bmbt.ptrs[1] = 1:3628&lt;br /&gt;
 xfs_db&amp;amp;gt; fsblock 3628&lt;br /&gt;
 xfs_db&amp;amp;gt; type bmapbtd&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 magic = 0x424d4150&lt;br /&gt;
 level = 0&lt;br /&gt;
 numrecs = 19&lt;br /&gt;
 leftsib = null&lt;br /&gt;
 rightsib = null&lt;br /&gt;
 recs[1-19] = [startoff,startblock,blockcount,extentflag]&lt;br /&gt;
        1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]&lt;br /&gt;
        5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]&lt;br /&gt;
        9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]&lt;br /&gt;
        12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]&lt;br /&gt;
        15:[33554436,3488,8,0] 16:[33554444,3629,4,0]&lt;br /&gt;
        17:[33554448,3748,4,0] 18:[33554452,3900,4,0]&lt;br /&gt;
        19:[67108864,3364,4,0]&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the &#039;&#039;&#039;xfs_db&#039;&#039;&#039; dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; dblock 20&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 dhdr.magic = 0x58443244&lt;br /&gt;
 dhdr.bestfree[0].offset = 0&lt;br /&gt;
 dhdr.bestfree[0].length = 0&lt;br /&gt;
 dhdr.bestfree[1].offset = 0&lt;br /&gt;
 dhdr.bestfree[1].length = 0&lt;br /&gt;
 dhdr.bestfree[2].offset = 0&lt;br /&gt;
 dhdr.bestfree[2].length = 0&lt;br /&gt;
 du[0].inumber = 13937&lt;br /&gt;
 du[0].namelen = 25&lt;br /&gt;
 du[0].name = &amp;quot;mumble_fratz_foo_bar_1595&amp;quot;&lt;br /&gt;
 du[0].tag = 0x10&lt;br /&gt;
 du[1].inumber = 13938&lt;br /&gt;
 du[1].namelen = 25&lt;br /&gt;
 du[1].name = &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;&lt;br /&gt;
 du[1].tag = 0x38&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
So, here we can see that inode number 13938 matches up with name &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;. Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at &amp;quot;lost+found&amp;quot; (once &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; has removed the corrupt directory).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why does my &amp;gt; 2TB XFS partition disappear when I reboot ? ==&lt;br /&gt;
&lt;br /&gt;
Strictly speaking this is not an XFS problem.&lt;br /&gt;
&lt;br /&gt;
To support &amp;gt; 2TB partitions you need two things: a kernel that supports large block devices (&amp;lt;tt&amp;gt;CONFIG_LBD=y&amp;lt;/tt&amp;gt;) and a partition table format that can hold large partitions.  The default DOS partition tables don&#039;t.  The best partition format for&lt;br /&gt;
&amp;gt; 2TB partitions is the EFI GPT format (&amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Without CONFIG_LBD=y you can&#039;t even create the filesystem, but without &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt; it works fine until you reboot at which point the partition will disappear.  Note that you need to enable the &amp;lt;tt&amp;gt;CONFIG_PARTITION_ADVANCED&amp;lt;/tt&amp;gt; option before you can set &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I receive &amp;lt;tt&amp;gt;No space left on device&amp;lt;/tt&amp;gt; after &amp;lt;tt&amp;gt;xfs_growfs&amp;lt;/tt&amp;gt;? ==&lt;br /&gt;
&lt;br /&gt;
After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:&lt;br /&gt;
&lt;br /&gt;
  The only way to fix this is to move data around to free up space&lt;br /&gt;
  below 1TB. Find your oldest data (i.e. that was around before even&lt;br /&gt;
  the first grow) and move it off the filesystem (move, not copy).&lt;br /&gt;
  Then if you copy it back on, the data blocks will end up above 1TB&lt;br /&gt;
  and that should leave you with plenty of space for inodes below 1TB.&lt;br /&gt;
  &lt;br /&gt;
  A complete dump and restore will also fix the problem ;)&lt;br /&gt;
&lt;br /&gt;
Also, you can add &#039;inode64&#039; to your mount options to allow inodes to live above 1TB.&lt;br /&gt;
&lt;br /&gt;
example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&amp;amp;forum=38 | No space left on device on xfs filesystem with 7.7TB free]&lt;br /&gt;
&lt;br /&gt;
== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==&lt;br /&gt;
&lt;br /&gt;
The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons. &lt;br /&gt;
&lt;br /&gt;
Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.&lt;br /&gt;
&lt;br /&gt;
== Q: How to get around a bad inode repair is unable to clean up ==&lt;br /&gt;
&lt;br /&gt;
The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.&lt;br /&gt;
&lt;br /&gt;
  xfs_db -x -c &#039;inode XXX&#039; -c &#039;write core.nextents 0&#039; -c &#039;write core.size 0&#039; /dev/hdXX&lt;br /&gt;
&lt;br /&gt;
== Q: How to calculate the correct sunit,swidth values for optimal performance ==&lt;br /&gt;
&lt;br /&gt;
XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.&lt;br /&gt;
&lt;br /&gt;
These options can be sometimes autodetected (for example with md raid and recent enough kernel (&amp;gt;= 2.6.32) and xfsprogs (&amp;gt;= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.&lt;br /&gt;
&lt;br /&gt;
The calculation of these values is quite simple:&lt;br /&gt;
&lt;br /&gt;
  su = &amp;lt;RAID controllers stripe size in BYTES (or KiBytes when used with k)&amp;gt;&lt;br /&gt;
  sw = &amp;lt;# of data disks (don&#039;t count parity disks)&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use&lt;br /&gt;
&lt;br /&gt;
  su = 64k&lt;br /&gt;
  sw = 6 (RAID-6 of 8 disks has 6 data disks)&lt;br /&gt;
&lt;br /&gt;
A RAID stripe size of 256KB with a RAID-10 over 16 disks should use&lt;br /&gt;
&lt;br /&gt;
  su = 256k&lt;br /&gt;
  sw = 8 (RAID-10 of 16 disks has 8 data disks)&lt;br /&gt;
&lt;br /&gt;
Alternatively, you can use &amp;quot;sunit&amp;quot; instead of &amp;quot;su&amp;quot; and &amp;quot;swidth&amp;quot; instead of &amp;quot;sw&amp;quot; but then sunit/swidth values need to be specified in &amp;quot;number of 512B sectors&amp;quot;!&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; interpret sunit and swidth as being specified in units of 512B sectors; that&#039;s unfortunately not the unit they&#039;re reported in, however.&lt;br /&gt;
&amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; report them in multiples of your basic block size (bsize) and not in 512B sectors.&lt;br /&gt;
&lt;br /&gt;
Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.&lt;br /&gt;
&lt;br /&gt;
When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.&lt;br /&gt;
&lt;br /&gt;
== Q: Why doesn&#039;t NFS-exporting subdirectories of inode64-mounted filesystem work? ==&lt;br /&gt;
&lt;br /&gt;
The default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; type encodes only 32-bit of the inode number for subdirectory exports.  However, exporting the root of the filesystem works, or using one of the non-default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; types (&amp;lt;tt&amp;gt;fsid=uuid&amp;lt;/tt&amp;gt; in &amp;lt;tt&amp;gt;/etc/exports&amp;lt;/tt&amp;gt; with recent &amp;lt;tt&amp;gt;nfs-utils&amp;lt;/tt&amp;gt;) should work as well. (Thanks, Christoph!)&lt;br /&gt;
&lt;br /&gt;
== Q: What is the inode64 mount option for? ==&lt;br /&gt;
&lt;br /&gt;
By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like &amp;quot;disk full&amp;quot; when you still have plenty space free, but there&#039;s no more place in the first TB to create a new inode. Also, performance sucks.&lt;br /&gt;
&lt;br /&gt;
To come around this, use the inode64 mount options for filesystems &amp;gt;1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.&lt;br /&gt;
&lt;br /&gt;
Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.&lt;br /&gt;
&lt;br /&gt;
== Q: Can I just try the inode64 option to see if it helps me? ==&lt;br /&gt;
&lt;br /&gt;
Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can&#039;t access files &amp;amp; dirs that have been created with an inode &amp;gt;32bit anymore.&lt;br /&gt;
&lt;br /&gt;
== Q: Performance: mkfs.xfs -n size=64k option ==&lt;br /&gt;
&lt;br /&gt;
Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:&lt;br /&gt;
&lt;br /&gt;
Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a&lt;br /&gt;
directory entry is determined by the length of the name.&lt;br /&gt;
&lt;br /&gt;
There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there&#039;s the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.&lt;br /&gt;
&lt;br /&gt;
For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.&lt;br /&gt;
&lt;br /&gt;
In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don&#039;t have any numbers on what the difference might be - I&#039;m getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....&lt;br /&gt;
&lt;br /&gt;
== Q: I want to tune my XFS filesystems for &amp;lt;something&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Premature optimization is the root of all evil.&#039;&#039; - Donald Knuth&lt;br /&gt;
&lt;br /&gt;
The standard answer you will get to this question is this: use the defaults.&lt;br /&gt;
&lt;br /&gt;
There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to  configure the filesystem appropriately.&lt;br /&gt;
&lt;br /&gt;
There are a lot of &amp;quot;XFS tuning guides&amp;quot; that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don&#039;t expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.&lt;br /&gt;
&lt;br /&gt;
In most cases, the only thing you need to to consider for &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; mount options. Increasing &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; reduces the number of journal IOs for a given workload, and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; will reduce them even further. The trade off for this increase in metadata performance is that more operations may be &amp;quot;missing&amp;quot; after recovery if the system crashes while actively making modifications.&lt;br /&gt;
&lt;br /&gt;
As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.&lt;br /&gt;
&lt;br /&gt;
== Q: Which factors influence the memory usage of xfs_repair? ==&lt;br /&gt;
&lt;br /&gt;
This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -n -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2096.&lt;br /&gt;
  #&lt;br /&gt;
&lt;br /&gt;
xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,&lt;br /&gt;
of which 2,097,152KB is needed for tracking free space. &lt;br /&gt;
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)&lt;br /&gt;
&lt;br /&gt;
Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2289.&lt;br /&gt;
&lt;br /&gt;
That is now needs at least another 200MB of RAM to run.&lt;br /&gt;
&lt;br /&gt;
The numbers reported by xfs_repair are the absolute minimum required and approximate at that;&lt;br /&gt;
more RAM than this may be required to complete successfully.&lt;br /&gt;
Also, if you only give xfs_repair the minimum required RAM, it will be slow;&lt;br /&gt;
for best repair performance, the more RAM you can give it the better.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why some files of my filesystem shows as &amp;quot;?????????? ? ?      ?          ?                ? filename&amp;quot; ? ==&lt;br /&gt;
&lt;br /&gt;
If ls -l shows you a listing as&lt;br /&gt;
&lt;br /&gt;
  # ?????????? ? ?      ?          ?                ? file1&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file2&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file3&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file4&lt;br /&gt;
&lt;br /&gt;
and errors like:&lt;br /&gt;
  # ls /pathtodir/&lt;br /&gt;
    ls: cannot access /pathtodir/file1: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file2: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file3: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file4: Invalid argument&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
or even:&lt;br /&gt;
  # failed to stat /pathtodir/file1&lt;br /&gt;
&lt;br /&gt;
It is very probable your filesystem must be mounted with inode64&lt;br /&gt;
  # mount -oremount,inode64 /dev/diskpart /mnt/xfs&lt;br /&gt;
&lt;br /&gt;
should make it work ok again.&lt;br /&gt;
If it works, add the option to fstab.&lt;br /&gt;
&lt;br /&gt;
== Q: The xfs_db &amp;quot;frag&amp;quot; command says I&#039;m over 50%.  Is that bad? ==&lt;br /&gt;
&lt;br /&gt;
It depends.  It&#039;s important to know how the value is calculated.  xfs_db looks at the extents in all files, and returns:&lt;br /&gt;
&lt;br /&gt;
  (actual extents - ideal extents) / actual extents&lt;br /&gt;
&lt;br /&gt;
This means that if, for example, you have an average of 2 extents per file, you&#039;ll get an answer of 50%.  4 extents per file would give you 75%.  This may or may not be a problem, especially depending on the size of the files in question.  (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented).  The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.&lt;br /&gt;
&lt;br /&gt;
Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:&lt;br /&gt;
[[Image:Frag_factor.png|500px]]&lt;br /&gt;
&lt;br /&gt;
== Q: I&#039;m getting &amp;quot;Internal error xfs_sb_read_verify&amp;quot; errors when I try to run xfs_growfs under kernels v3.10 through v3.12 ==&lt;br /&gt;
&lt;br /&gt;
This may happen when running xfs_growfs under a v3.10-v3.12 kernel,&lt;br /&gt;
if the filesystem was previously grown under a kernel prior to v3.8.&lt;br /&gt;
&lt;br /&gt;
Old kernel versions prior to v3.8 did not zero the empty part of&lt;br /&gt;
new secondary superblocks when growing the filesystem with xfs_growfs.&lt;br /&gt;
&lt;br /&gt;
Kernels v3.10 and later began detecting this non-zero part of the&lt;br /&gt;
superblock as corruption, and emit the &lt;br /&gt;
&lt;br /&gt;
    Internal error xfs_sb_read_verify&lt;br /&gt;
&lt;br /&gt;
error message.&lt;br /&gt;
&lt;br /&gt;
Kernels v3.13 and later are more forgiving about this - if the non-zero &lt;br /&gt;
data is found on a Version 4 superblock, it will not be flagged as&lt;br /&gt;
corruption.&lt;br /&gt;
&lt;br /&gt;
The problematic secondary superblocks may be repaired by using an xfs_repair&lt;br /&gt;
version 3.2.0-alpha1 or above.&lt;br /&gt;
&lt;br /&gt;
The relevant kernelspace commits are as follows:&lt;br /&gt;
&lt;br /&gt;
    v3.8  1375cb6 xfs: growfs: don&#039;t read garbage for new secondary superblocks &amp;lt;- fixed underlying problem &lt;br /&gt;
    v3.10 04a1e6c xfs: add CRC checks to the superblock &amp;lt;- detected old underlying problem&lt;br /&gt;
    v3.13 59e5a0e xfs: don&#039;t break from growfs ag update loop on error &amp;lt;- less consequential&lt;br /&gt;
    v3.13 10e6e65 xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields &amp;lt;- is more forgiving of old underlying problem&lt;br /&gt;
&lt;br /&gt;
This commit allows xfs_repair to detect and correct the problem:&lt;br /&gt;
&lt;br /&gt;
    v3.2.0-alpha1 cbd7508 xfs_repair: zero out unused parts of superblocks&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2934</id>
		<title>XFS FAQ</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2934"/>
		<updated>2014-01-20T22:10:58Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: /* Q: I&amp;#039;m getting &amp;quot;Internal error xfs_sb_read_verify&amp;quot; errors when I try to run xfs_growfs under kernels v3.9 through v3.12 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about XFS? ==&lt;br /&gt;
&lt;br /&gt;
The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.&lt;br /&gt;
&lt;br /&gt;
You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the &#039;&#039;&#039;&amp;lt;nowiki&amp;gt;#xfs&amp;lt;/nowiki&amp;gt;&#039;&#039;&#039; IRC channel on &#039;&#039;irc.freenode.net&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about ACLs? ==&lt;br /&gt;
&lt;br /&gt;
Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;acl(5)&#039;&#039;&#039; manual page is also quite extensive.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find information about the internals of XFS? ==&lt;br /&gt;
&lt;br /&gt;
An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.&lt;br /&gt;
&lt;br /&gt;
Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.&lt;br /&gt;
&lt;br /&gt;
== Q: What partition type should I use for XFS on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Linux native filesystem (83).&lt;br /&gt;
&lt;br /&gt;
== Q: What mount options does XFS have? ==&lt;br /&gt;
&lt;br /&gt;
There are a number of mount options influencing XFS filesystems - refer to the &#039;&#039;&#039;mount(8)&#039;&#039;&#039; manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])&lt;br /&gt;
&lt;br /&gt;
== Q: Is there any relation between the XFS utilities and the kernel version? ==&lt;br /&gt;
&lt;br /&gt;
No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Does it run on platforms other than i386? ==&lt;br /&gt;
&lt;br /&gt;
XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Do quotas work on XFS? ==&lt;br /&gt;
&lt;br /&gt;
Yes.&lt;br /&gt;
&lt;br /&gt;
To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/  http://sourceforge.net/projects/linuxquota/] or use &#039;&#039;&#039;xfs_quota(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: What&#039;s project quota? ==&lt;br /&gt;
&lt;br /&gt;
The  project  quota  is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Can group quota and project quota be used at the same time? ==&lt;br /&gt;
&lt;br /&gt;
No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==&lt;br /&gt;
&lt;br /&gt;
To be answered.&lt;br /&gt;
&lt;br /&gt;
== Q: Are there any dump/restore tools for XFS? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and &#039;&#039;&#039;xfsrestore(8)&#039;&#039;&#039; are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.&lt;br /&gt;
&lt;br /&gt;
== Q: Does LILO work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
This depends on where you install LILO.&lt;br /&gt;
&lt;br /&gt;
Yes, for MBR (Master Boot Record) installations.&lt;br /&gt;
&lt;br /&gt;
No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.&lt;br /&gt;
&lt;br /&gt;
== Q: Does GRUB work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.&lt;br /&gt;
&lt;br /&gt;
== Q: Can XFS be used for a root filesystem? ==&lt;br /&gt;
&lt;br /&gt;
Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the &amp;quot;rootflags=&amp;quot; kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit &amp;quot;logdev=&amp;quot; specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]&lt;br /&gt;
&lt;br /&gt;
== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be &amp;quot;clean&amp;quot; when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.&lt;br /&gt;
&lt;br /&gt;
== Q: Is there a way to make a XFS filesystem larger or smaller? ==&lt;br /&gt;
&lt;br /&gt;
You can &#039;&#039;NOT&#039;&#039; make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.&lt;br /&gt;
&lt;br /&gt;
An XFS filesystem may be enlarged by using &#039;&#039;&#039;xfs_growfs(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the &#039;&#039;exact same&#039;&#039; starting point. Run &#039;&#039;&#039;xfs_growfs&#039;&#039;&#039; to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.&lt;br /&gt;
&lt;br /&gt;
Using XFS filesystems on top of a volume manager makes this a lot easier.&lt;br /&gt;
&lt;br /&gt;
== Q: What information should I include when reporting a problem? ==&lt;br /&gt;
&lt;br /&gt;
What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:&lt;br /&gt;
&lt;br /&gt;
* kernel version (uname -a)&lt;br /&gt;
* xfsprogs version (xfs_repair -V)&lt;br /&gt;
* number of CPUs&lt;br /&gt;
* contents of /proc/meminfo&lt;br /&gt;
* contents of /proc/mounts&lt;br /&gt;
* contents of /proc/partitions&lt;br /&gt;
* RAID layout (hardware and/or software)&lt;br /&gt;
* LVM configuration&lt;br /&gt;
* type of disks you are using&lt;br /&gt;
* write cache status of drives&lt;br /&gt;
* size of BBWC and mode it is running in&lt;br /&gt;
* xfs_info output on the filesystem in question&lt;br /&gt;
* dmesg output showing all error messages and stack traces&lt;br /&gt;
 &lt;br /&gt;
Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:&lt;br /&gt;
&lt;br /&gt;
# iostat -x -d -m 5&lt;br /&gt;
# vmstat 5&lt;br /&gt;
 &lt;br /&gt;
can give us insight into the IO and memory utilisation of your machine at the time of the problem.&lt;br /&gt;
&lt;br /&gt;
If the filesystem is hanging, then capture the output of the dmesg command after running:&lt;br /&gt;
&lt;br /&gt;
 # echo w &amp;gt; /proc/sysrq-trigger&lt;br /&gt;
 # dmesg&lt;br /&gt;
&lt;br /&gt;
will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.&lt;br /&gt;
&lt;br /&gt;
And for advanced users, capturing an event trace using &#039;&#039;&#039;trace-cmd&#039;&#039;&#039; (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it&#039;s a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd record -e xfs\*&lt;br /&gt;
&lt;br /&gt;
before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd report &amp;gt; trace_report.txt&lt;br /&gt;
&lt;br /&gt;
Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.&lt;br /&gt;
&lt;br /&gt;
If you have a problem with &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039;, make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using &#039;&#039;&#039;xfs_metadump(8)&#039;&#039;&#039; (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.&lt;br /&gt;
&lt;br /&gt;
== Q: Mounting an XFS filesystem does not work - what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
If mount prints an error message something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     mount: /dev/hda5 has wrong major or minor number&lt;br /&gt;
&lt;br /&gt;
you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the &amp;quot;-t xfs&amp;quot; option on mount or the &amp;quot;xfs&amp;quot; option in &amp;lt;tt&amp;gt;/etc/fstab&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
If you get something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 mount: wrong fs type, bad option, bad superblock on /dev/sda1,&lt;br /&gt;
        or too many mounted file systems&lt;br /&gt;
&lt;br /&gt;
Refer to your system log file (&amp;lt;tt&amp;gt;/var/log/messages&amp;lt;/tt&amp;gt;) for a detailed diagnostic message from the kernel.&lt;br /&gt;
&lt;br /&gt;
== Q: Does the filesystem have an undelete capability? ==&lt;br /&gt;
&lt;br /&gt;
There is no undelete in XFS.&lt;br /&gt;
&lt;br /&gt;
However, if an inode is unlinked but neither it nor its associated data blocks get immediately re-used and overwritten, there is some small chance to recover the file from the disk.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;photorec&#039;&#039;, &#039;&#039;xfs_irecover&#039;&#039; or &#039;&#039;xfsr&#039;&#039; are some tools which attempt to do this, with varying success.&lt;br /&gt;
&lt;br /&gt;
There are also commercial data recovery services and closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS] which claims to recover data, although this has not been tested by the XFS developers.&lt;br /&gt;
&lt;br /&gt;
As always, the best advice is to keep good backups.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I backup a XFS filesystem and ACLs? ==&lt;br /&gt;
&lt;br /&gt;
You can backup a XFS filesystem with utilities like &#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and standard &#039;&#039;&#039;tar(1)&#039;&#039;&#039; for standard files. If you want to backup ACLs you will need to use &#039;&#039;&#039;xfsdump&#039;&#039;&#039; or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (&amp;gt; version 3.1.4) or [http://rsync.samba.org/ rsync] (&amp;gt;= version 3.0.0) to backup ACLs and EAs. &#039;&#039;&#039;xfsdump&#039;&#039;&#039; can also be integrated with [http://www.amanda.org/ amanda(8)].&lt;br /&gt;
&lt;br /&gt;
== Q: I see applications returning error 990 or &amp;quot;Structure needs cleaning&amp;quot;, what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], &amp;quot;Structure needs cleaning.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.&lt;br /&gt;
&lt;br /&gt;
There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.&lt;br /&gt;
&lt;br /&gt;
You can use xfs_repair to remedy the problem (with the file system unmounted).&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==&lt;br /&gt;
&lt;br /&gt;
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.&lt;br /&gt;
&lt;br /&gt;
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.&lt;br /&gt;
&lt;br /&gt;
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you&#039;ll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the &#039;&#039;&#039;xfs_bmap(8)&#039;&#039;&#039; command).&lt;br /&gt;
&lt;br /&gt;
== Q: What is the problem with the write cache on journaled filesystems? ==&lt;br /&gt;
&lt;br /&gt;
Many drives use a write back cache in order to speed up the performance of writes.  However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk.  Further, the drive can de-stage data from the write cache to the platters in any order that it chooses.  This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk.  When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.&lt;br /&gt;
&lt;br /&gt;
With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information.  In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.&lt;br /&gt;
&lt;br /&gt;
With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued.  A powerfail &amp;quot;only&amp;quot; loses data in the cache but no essential ordering is violated, and corruption will not occur.&lt;br /&gt;
&lt;br /&gt;
With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance.  But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I tell if I have the disk write cache enabled? ==&lt;br /&gt;
&lt;br /&gt;
For SCSI/SATA:&lt;br /&gt;
&lt;br /&gt;
* Look in dmesg(8) output for a driver line, such as:&amp;lt;br /&amp;gt; &amp;quot;SCSI device sda: drive cache: write back&amp;quot;&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# sginfo -c /dev/sda | grep -i &#039;write cache&#039; &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -I /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; and look under &amp;quot;Enabled Supported&amp;quot; for &amp;quot;Write cache&amp;quot;&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
== Q: How can I address the problem with the disk write cache? ==&lt;br /&gt;
&lt;br /&gt;
=== Disabling the disk write back cache. ===&lt;br /&gt;
&lt;br /&gt;
For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -W0 /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # hdparm -W0 /dev/hda&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# blktool /dev/sda wcache off&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # blktool /dev/hda wcache off&lt;br /&gt;
&lt;br /&gt;
For SCSI:&lt;br /&gt;
&lt;br /&gt;
* Using sginfo(8) which is a little tedious&amp;lt;br /&amp;gt; It takes 3 steps. For example:&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -c /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives a list of attribute names and values&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cX /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives an array of cache values which you must match up with from step 1, e.g.&amp;lt;br /&amp;gt; 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; allows you to reset the value of the cache attributes.&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Using an external log. ===&lt;br /&gt;
&lt;br /&gt;
Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will &#039;&#039;&#039;not&#039;&#039;&#039; solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won&#039;t be able to guarantee that if the metadata is on a drive with the write cache enabled.&lt;br /&gt;
&lt;br /&gt;
In fact using an external log will disable XFS&#039; write barrier support.&lt;br /&gt;
&lt;br /&gt;
=== Write barrier support. ===&lt;br /&gt;
&lt;br /&gt;
Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with &amp;quot;nobarrier&amp;quot;. Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported with external log device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported by the underlying device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, trial barrier write failed&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If the filesystem is mounted with an external log device then we currently don&#039;t support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn&#039;t support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.&lt;br /&gt;
&lt;br /&gt;
== Q. Should barriers be enabled with storage which has a persistent write cache? ==&lt;br /&gt;
&lt;br /&gt;
Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with &amp;quot;nobarrier&amp;quot;, assuming your RAID controller is infallible and not resetting randomly like some common ones do.  But take care about the hard disk write cache, which should be off.&lt;br /&gt;
&lt;br /&gt;
== Q. Which settings does my RAID controller need ? ==&lt;br /&gt;
&lt;br /&gt;
It&#039;s hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:&lt;br /&gt;
&lt;br /&gt;
Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory &amp;quot;[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]&amp;quot;) which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.&lt;br /&gt;
&lt;br /&gt;
If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.&lt;br /&gt;
&lt;br /&gt;
* onboard RAID controllers: there are so many different types it&#039;s hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn&#039;t even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.&lt;br /&gt;
&lt;br /&gt;
* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86); &lt;br /&gt;
&lt;br /&gt;
* Adaptec: allows setting individual drives cache&lt;br /&gt;
arcconf setcache &amp;lt;disk&amp;gt; wb|wt&lt;br /&gt;
wb=write back, which means write cache on, wt=write through, which means write cache off. So &amp;quot;wt&amp;quot; should be chosen.&lt;br /&gt;
&lt;br /&gt;
* Areca: In archttp under &amp;quot;System Controls&amp;quot; -&amp;gt; &amp;quot;System Configuration&amp;quot; there&#039;s the option &amp;quot;Disk Write Cache Mode&amp;quot; (defaults &amp;quot;Auto&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Off&amp;quot;: disk write cache is turned off&lt;br /&gt;
&lt;br /&gt;
&amp;quot;On&amp;quot;: disk write cache is enabled, this is not safe for your data but fast&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Auto&amp;quot;: If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to &amp;quot;On&amp;quot;, because neither controller cache nor disk cache is safe so you don&#039;t seem to care about your data and just want high speed (which you get then).&lt;br /&gt;
&lt;br /&gt;
That&#039;s a very sensible default so you can let it &amp;quot;Auto&amp;quot; or enforce &amp;quot;Off&amp;quot; to be sure.&lt;br /&gt;
&lt;br /&gt;
* LSI MegaRAID: allows setting individual disks cache:&lt;br /&gt;
 MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL                          # flushes the controller cache&lt;br /&gt;
 MegaCli -LDGetProp -Cache    -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the controller cache settings&lt;br /&gt;
 MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the disk cache settings (for all phys. disks in logical disk)&lt;br /&gt;
 MegaCli -LDSetProp -EnDskCache|DisDskCache  -LN|-L0,1,2|-LAll  -aN|-a0,1,2|-aALL # set disk cache setting&lt;br /&gt;
&lt;br /&gt;
* Xyratex: from the docs: &amp;quot;Write cache includes the disk drive cache and controller cache.&amp;quot;. So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.&lt;br /&gt;
&lt;br /&gt;
== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==&lt;br /&gt;
&lt;br /&gt;
The biggest problem is that those products seem to also virtualize disk &lt;br /&gt;
writes in a way that even barriers don&#039;t work any more, which means even &lt;br /&gt;
a fsync is not reliable. Tests confirm that unplugging the power from &lt;br /&gt;
such a system even with RAID controller with battery backed cache and &lt;br /&gt;
hard disk cache turned off (which is safe on a normal host) you can &lt;br /&gt;
destroy a database within the virtual machine (client, domU whatever you &lt;br /&gt;
call it).&lt;br /&gt;
&lt;br /&gt;
In qemu you can specify cache=off on the line specifying the virtual &lt;br /&gt;
disk. For others information is missing.&lt;br /&gt;
&lt;br /&gt;
== Q: What is the issue with directory corruption in Linux 2.6.17? ==&lt;br /&gt;
&lt;br /&gt;
In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some &amp;quot;sparse&amp;quot; endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: the fix is included in 2.6.17.7 and later kernels.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
To add insult to injury, &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039; is currently not correcting these directories on detection of this corrupt state either. This &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; issue is actively being worked on, and a fixed version will be available shortly.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfs_repair -n&#039;&#039;&#039; should be able to detect any directory corruption.&lt;br /&gt;
&lt;br /&gt;
Until a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; binary is available, one can make use of the &#039;&#039;&#039;xfs_db(8)&#039;&#039;&#039; command to mark the problem directory for removal (see the example below). A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; invocation will remove the directory and move all contents into &amp;quot;lost+found&amp;quot;, named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 core.mode = 040755&lt;br /&gt;
 core.version = 2&lt;br /&gt;
 core.format = 3 (btree)&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; write core.mode 0&lt;br /&gt;
 xfs_db&amp;amp;gt; quit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; will clear the directory, and add new entries (named by inode number) in lost+found.&lt;br /&gt;
&lt;br /&gt;
The easiest way to map inode numbers to full paths is via &#039;&#039;&#039;xfs_ncheck(8)&#039;&#039;&#039;&amp;lt;nowiki&amp;gt;: &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_ncheck -i 14101 -i 14102 /dev/sdXXX&lt;br /&gt;
       14101 full/path/mumble_fratz_foo_bar_1495&lt;br /&gt;
       14102 full/path/mumble_fratz_foo_bar_1494&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 ...&lt;br /&gt;
 next_unlinked = null&lt;br /&gt;
 u.bmbt.level = 1&lt;br /&gt;
 u.bmbt.numrecs = 1&lt;br /&gt;
 u.bmbt.keys[1] = [startoff] 1:[0]&lt;br /&gt;
 u.bmbt.ptrs[1] = 1:3628&lt;br /&gt;
 xfs_db&amp;amp;gt; fsblock 3628&lt;br /&gt;
 xfs_db&amp;amp;gt; type bmapbtd&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 magic = 0x424d4150&lt;br /&gt;
 level = 0&lt;br /&gt;
 numrecs = 19&lt;br /&gt;
 leftsib = null&lt;br /&gt;
 rightsib = null&lt;br /&gt;
 recs[1-19] = [startoff,startblock,blockcount,extentflag]&lt;br /&gt;
        1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]&lt;br /&gt;
        5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]&lt;br /&gt;
        9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]&lt;br /&gt;
        12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]&lt;br /&gt;
        15:[33554436,3488,8,0] 16:[33554444,3629,4,0]&lt;br /&gt;
        17:[33554448,3748,4,0] 18:[33554452,3900,4,0]&lt;br /&gt;
        19:[67108864,3364,4,0]&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the &#039;&#039;&#039;xfs_db&#039;&#039;&#039; dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; dblock 20&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 dhdr.magic = 0x58443244&lt;br /&gt;
 dhdr.bestfree[0].offset = 0&lt;br /&gt;
 dhdr.bestfree[0].length = 0&lt;br /&gt;
 dhdr.bestfree[1].offset = 0&lt;br /&gt;
 dhdr.bestfree[1].length = 0&lt;br /&gt;
 dhdr.bestfree[2].offset = 0&lt;br /&gt;
 dhdr.bestfree[2].length = 0&lt;br /&gt;
 du[0].inumber = 13937&lt;br /&gt;
 du[0].namelen = 25&lt;br /&gt;
 du[0].name = &amp;quot;mumble_fratz_foo_bar_1595&amp;quot;&lt;br /&gt;
 du[0].tag = 0x10&lt;br /&gt;
 du[1].inumber = 13938&lt;br /&gt;
 du[1].namelen = 25&lt;br /&gt;
 du[1].name = &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;&lt;br /&gt;
 du[1].tag = 0x38&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
So, here we can see that inode number 13938 matches up with name &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;. Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at &amp;quot;lost+found&amp;quot; (once &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; has removed the corrupt directory).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why does my &amp;gt; 2TB XFS partition disappear when I reboot ? ==&lt;br /&gt;
&lt;br /&gt;
Strictly speaking this is not an XFS problem.&lt;br /&gt;
&lt;br /&gt;
To support &amp;gt; 2TB partitions you need two things: a kernel that supports large block devices (&amp;lt;tt&amp;gt;CONFIG_LBD=y&amp;lt;/tt&amp;gt;) and a partition table format that can hold large partitions.  The default DOS partition tables don&#039;t.  The best partition format for&lt;br /&gt;
&amp;gt; 2TB partitions is the EFI GPT format (&amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Without CONFIG_LBD=y you can&#039;t even create the filesystem, but without &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt; it works fine until you reboot at which point the partition will disappear.  Note that you need to enable the &amp;lt;tt&amp;gt;CONFIG_PARTITION_ADVANCED&amp;lt;/tt&amp;gt; option before you can set &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I receive &amp;lt;tt&amp;gt;No space left on device&amp;lt;/tt&amp;gt; after &amp;lt;tt&amp;gt;xfs_growfs&amp;lt;/tt&amp;gt;? ==&lt;br /&gt;
&lt;br /&gt;
After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:&lt;br /&gt;
&lt;br /&gt;
  The only way to fix this is to move data around to free up space&lt;br /&gt;
  below 1TB. Find your oldest data (i.e. that was around before even&lt;br /&gt;
  the first grow) and move it off the filesystem (move, not copy).&lt;br /&gt;
  Then if you copy it back on, the data blocks will end up above 1TB&lt;br /&gt;
  and that should leave you with plenty of space for inodes below 1TB.&lt;br /&gt;
  &lt;br /&gt;
  A complete dump and restore will also fix the problem ;)&lt;br /&gt;
&lt;br /&gt;
Also, you can add &#039;inode64&#039; to your mount options to allow inodes to live above 1TB.&lt;br /&gt;
&lt;br /&gt;
example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&amp;amp;forum=38 | No space left on device on xfs filesystem with 7.7TB free]&lt;br /&gt;
&lt;br /&gt;
== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==&lt;br /&gt;
&lt;br /&gt;
The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons. &lt;br /&gt;
&lt;br /&gt;
Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.&lt;br /&gt;
&lt;br /&gt;
== Q: How to get around a bad inode repair is unable to clean up ==&lt;br /&gt;
&lt;br /&gt;
The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.&lt;br /&gt;
&lt;br /&gt;
  xfs_db -x -c &#039;inode XXX&#039; -c &#039;write core.nextents 0&#039; -c &#039;write core.size 0&#039; /dev/hdXX&lt;br /&gt;
&lt;br /&gt;
== Q: How to calculate the correct sunit,swidth values for optimal performance ==&lt;br /&gt;
&lt;br /&gt;
XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.&lt;br /&gt;
&lt;br /&gt;
These options can be sometimes autodetected (for example with md raid and recent enough kernel (&amp;gt;= 2.6.32) and xfsprogs (&amp;gt;= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.&lt;br /&gt;
&lt;br /&gt;
The calculation of these values is quite simple:&lt;br /&gt;
&lt;br /&gt;
  su = &amp;lt;RAID controllers stripe size in BYTES (or KiBytes when used with k)&amp;gt;&lt;br /&gt;
  sw = &amp;lt;# of data disks (don&#039;t count parity disks)&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use&lt;br /&gt;
&lt;br /&gt;
  su = 64k&lt;br /&gt;
  sw = 6 (RAID-6 of 8 disks has 6 data disks)&lt;br /&gt;
&lt;br /&gt;
A RAID stripe size of 256KB with a RAID-10 over 16 disks should use&lt;br /&gt;
&lt;br /&gt;
  su = 256k&lt;br /&gt;
  sw = 8 (RAID-10 of 16 disks has 8 data disks)&lt;br /&gt;
&lt;br /&gt;
Alternatively, you can use &amp;quot;sunit&amp;quot; instead of &amp;quot;su&amp;quot; and &amp;quot;swidth&amp;quot; instead of &amp;quot;sw&amp;quot; but then sunit/swidth values need to be specified in &amp;quot;number of 512B sectors&amp;quot;!&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; interpret sunit and swidth as being specified in units of 512B sectors; that&#039;s unfortunately not the unit they&#039;re reported in, however.&lt;br /&gt;
&amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; report them in multiples of your basic block size (bsize) and not in 512B sectors.&lt;br /&gt;
&lt;br /&gt;
Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.&lt;br /&gt;
&lt;br /&gt;
When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.&lt;br /&gt;
&lt;br /&gt;
== Q: Why doesn&#039;t NFS-exporting subdirectories of inode64-mounted filesystem work? ==&lt;br /&gt;
&lt;br /&gt;
The default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; type encodes only 32-bit of the inode number for subdirectory exports.  However, exporting the root of the filesystem works, or using one of the non-default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; types (&amp;lt;tt&amp;gt;fsid=uuid&amp;lt;/tt&amp;gt; in &amp;lt;tt&amp;gt;/etc/exports&amp;lt;/tt&amp;gt; with recent &amp;lt;tt&amp;gt;nfs-utils&amp;lt;/tt&amp;gt;) should work as well. (Thanks, Christoph!)&lt;br /&gt;
&lt;br /&gt;
== Q: What is the inode64 mount option for? ==&lt;br /&gt;
&lt;br /&gt;
By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like &amp;quot;disk full&amp;quot; when you still have plenty space free, but there&#039;s no more place in the first TB to create a new inode. Also, performance sucks.&lt;br /&gt;
&lt;br /&gt;
To come around this, use the inode64 mount options for filesystems &amp;gt;1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.&lt;br /&gt;
&lt;br /&gt;
Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.&lt;br /&gt;
&lt;br /&gt;
== Q: Can I just try the inode64 option to see if it helps me? ==&lt;br /&gt;
&lt;br /&gt;
Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can&#039;t access files &amp;amp; dirs that have been created with an inode &amp;gt;32bit anymore.&lt;br /&gt;
&lt;br /&gt;
== Q: Performance: mkfs.xfs -n size=64k option ==&lt;br /&gt;
&lt;br /&gt;
Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:&lt;br /&gt;
&lt;br /&gt;
Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a&lt;br /&gt;
directory entry is determined by the length of the name.&lt;br /&gt;
&lt;br /&gt;
There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there&#039;s the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.&lt;br /&gt;
&lt;br /&gt;
For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.&lt;br /&gt;
&lt;br /&gt;
In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don&#039;t have any numbers on what the difference might be - I&#039;m getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....&lt;br /&gt;
&lt;br /&gt;
== Q: I want to tune my XFS filesystems for &amp;lt;something&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Premature optimization is the root of all evil.&#039;&#039; - Donald Knuth&lt;br /&gt;
&lt;br /&gt;
The standard answer you will get to this question is this: use the defaults.&lt;br /&gt;
&lt;br /&gt;
There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to  configure the filesystem appropriately.&lt;br /&gt;
&lt;br /&gt;
There are a lot of &amp;quot;XFS tuning guides&amp;quot; that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don&#039;t expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.&lt;br /&gt;
&lt;br /&gt;
In most cases, the only thing you need to to consider for &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; mount options. Increasing &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; reduces the number of journal IOs for a given workload, and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; will reduce them even further. The trade off for this increase in metadata performance is that more operations may be &amp;quot;missing&amp;quot; after recovery if the system crashes while actively making modifications.&lt;br /&gt;
&lt;br /&gt;
As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.&lt;br /&gt;
&lt;br /&gt;
== Q: Which factors influence the memory usage of xfs_repair? ==&lt;br /&gt;
&lt;br /&gt;
This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -n -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2096.&lt;br /&gt;
  #&lt;br /&gt;
&lt;br /&gt;
xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,&lt;br /&gt;
of which 2,097,152KB is needed for tracking free space. &lt;br /&gt;
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)&lt;br /&gt;
&lt;br /&gt;
Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2289.&lt;br /&gt;
&lt;br /&gt;
That is now needs at least another 200MB of RAM to run.&lt;br /&gt;
&lt;br /&gt;
The numbers reported by xfs_repair are the absolute minimum required and approximate at that;&lt;br /&gt;
more RAM than this may be required to complete successfully.&lt;br /&gt;
Also, if you only give xfs_repair the minimum required RAM, it will be slow;&lt;br /&gt;
for best repair performance, the more RAM you can give it the better.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why some files of my filesystem shows as &amp;quot;?????????? ? ?      ?          ?                ? filename&amp;quot; ? ==&lt;br /&gt;
&lt;br /&gt;
If ls -l shows you a listing as&lt;br /&gt;
&lt;br /&gt;
  # ?????????? ? ?      ?          ?                ? file1&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file2&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file3&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file4&lt;br /&gt;
&lt;br /&gt;
and errors like:&lt;br /&gt;
  # ls /pathtodir/&lt;br /&gt;
    ls: cannot access /pathtodir/file1: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file2: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file3: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file4: Invalid argument&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
or even:&lt;br /&gt;
  # failed to stat /pathtodir/file1&lt;br /&gt;
&lt;br /&gt;
It is very probable your filesystem must be mounted with inode64&lt;br /&gt;
  # mount -oremount,inode64 /dev/diskpart /mnt/xfs&lt;br /&gt;
&lt;br /&gt;
should make it work ok again.&lt;br /&gt;
If it works, add the option to fstab.&lt;br /&gt;
&lt;br /&gt;
== Q: The xfs_db &amp;quot;frag&amp;quot; command says I&#039;m over 50%.  Is that bad? ==&lt;br /&gt;
&lt;br /&gt;
It depends.  It&#039;s important to know how the value is calculated.  xfs_db looks at the extents in all files, and returns:&lt;br /&gt;
&lt;br /&gt;
  (actual extents - ideal extents) / actual extents&lt;br /&gt;
&lt;br /&gt;
This means that if, for example, you have an average of 2 extents per file, you&#039;ll get an answer of 50%.  4 extents per file would give you 75%.  This may or may not be a problem, especially depending on the size of the files in question.  (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented).  The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.&lt;br /&gt;
&lt;br /&gt;
Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:&lt;br /&gt;
[[Image:Frag_factor.png|500px]]&lt;br /&gt;
&lt;br /&gt;
== Q: I&#039;m getting &amp;quot;Internal error xfs_sb_read_verify&amp;quot; errors when I try to run xfs_growfs under kernels v3.9 through v3.12 ==&lt;br /&gt;
&lt;br /&gt;
Old kernel versions didn&#039;t zero the empty part of the secondary&lt;br /&gt;
superblocks when growing the filesystem.  However, kernels v3.10 and&lt;br /&gt;
newer have started flagging this as corruption.  (Kernels v3.13 and later&lt;br /&gt;
are more forgiving about this, if the garbage is found on a V4 (not V5)&lt;br /&gt;
superblock).&lt;br /&gt;
&lt;br /&gt;
This commit in 3.8 fixed&lt;br /&gt;
the kernel growfs code not to put garbage in the new secondary&lt;br /&gt;
superblocks:&lt;br /&gt;
&lt;br /&gt;
    commit 1375cb65e87b327a8dd4f920c3e3d837fb40e9c2&lt;br /&gt;
    Author: Dave Chinner &amp;lt;dchinner@redhat.com&amp;gt;&lt;br /&gt;
    Date:   Tue Oct 9 14:50:52 2012 +1100&lt;br /&gt;
    &lt;br /&gt;
    xfs: growfs: don&#039;t read garbage for new secondary superblocks&lt;br /&gt;
    &lt;br /&gt;
    When updating new secondary superblocks in a growfs operation, the&lt;br /&gt;
    superblock buffer is read from the newly grown region of the&lt;br /&gt;
    underlying device. This is not guaranteed to be zero, so violates&lt;br /&gt;
    the underlying assumption that the unused parts of superblocks are&lt;br /&gt;
    zero filled. Get a new buffer for these secondary superblocks to&lt;br /&gt;
    ensure that the unused regions are zero filled correctly.&lt;br /&gt;
&lt;br /&gt;
The only time the kernel reads secondary superblocks is during a&lt;br /&gt;
growfs operation, so that&#039;s the only time the kernel will detect&lt;br /&gt;
such an error. More extensive validity tests were added during 3.9&lt;br /&gt;
and 3.10, and these now throw corruption errors over secondary&lt;br /&gt;
superblocks that have not been correctly zeroed.&lt;br /&gt;
&lt;br /&gt;
To fix this, you need to grab xfsprogs from the git repo&lt;br /&gt;
(3.2.0-alpha1 or newer will do) as this commit to xfs_repair detects and fixes&lt;br /&gt;
the corrupted superblocks:&lt;br /&gt;
&lt;br /&gt;
    commit cbd7508db4c9597889ad98d5f027542002e0e57c&lt;br /&gt;
    Author: Eric Sandeen &amp;lt;sandeen@redhat.com&amp;gt;&lt;br /&gt;
    Date:   Thu Aug 15 02:26:40 2013 +0000&lt;br /&gt;
    &lt;br /&gt;
    xfs_repair: zero out unused parts of superblocks&lt;br /&gt;
        &lt;br /&gt;
    Prior to:&lt;br /&gt;
    1375cb65 xfs: growfs: don&#039;t read garbage for new secondary superblocks&lt;br /&gt;
    &lt;br /&gt;
    we ran the risk of allowing garbage in secondary superblocks&lt;br /&gt;
    beyond the in-use sb fields.  With kernels 3.10 and beyond, the&lt;br /&gt;
    verifiers will kick these out as invalid, but xfs_repair does&lt;br /&gt;
    not detect or repair this condition.&lt;br /&gt;
    &lt;br /&gt;
    There is superblock stale-data zeroing code, but it is under a&lt;br /&gt;
    narrow conditional - the bug addressed in the above commit did not&lt;br /&gt;
    meet that conditional.  So change this to check unconditionally.&lt;br /&gt;
    &lt;br /&gt;
    Further, the checking code was looking at the in-memory&lt;br /&gt;
    superblock buffer, which was zeroed prior to population, and&lt;br /&gt;
    would therefore never possibly show any stale data beyond the&lt;br /&gt;
    last up-rev superblock field.&lt;br /&gt;
    &lt;br /&gt;
    So instead, check the disk buffer for this garbage condition.&lt;br /&gt;
    &lt;br /&gt;
    If we detect garbage, we must zero out both the in-memory sb&lt;br /&gt;
    and the disk buffer; the former may contain unused data&lt;br /&gt;
    in up-rev sb fields which will be written back out; the latter&lt;br /&gt;
    may contain garbage beyond all fields, which won&#039;t be updated&lt;br /&gt;
    when we translate the in-memory sb back to disk.&lt;br /&gt;
    &lt;br /&gt;
    The V4 superblock case was zeroing out the sb_bad_features2&lt;br /&gt;
    field; we also fix that to leave that field alone.&lt;br /&gt;
&lt;br /&gt;
In kernels v3.13 and newer, this commit:&lt;br /&gt;
&lt;br /&gt;
    commit 10e6e65dfcedff63275c3d649d329c044caa8e26&lt;br /&gt;
    Author: Eric Sandeen &amp;lt;sandeen@sandeen.net&amp;gt;&lt;br /&gt;
    Date:   Mon Sep 9 15:33:29 2013 -0500&lt;br /&gt;
    &lt;br /&gt;
    xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields&lt;br /&gt;
&lt;br /&gt;
will cause the kernel to be more forgiving of this situation.&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2933</id>
		<title>XFS FAQ</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2933"/>
		<updated>2014-01-20T21:51:32Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about XFS? ==&lt;br /&gt;
&lt;br /&gt;
The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.&lt;br /&gt;
&lt;br /&gt;
You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the &#039;&#039;&#039;&amp;lt;nowiki&amp;gt;#xfs&amp;lt;/nowiki&amp;gt;&#039;&#039;&#039; IRC channel on &#039;&#039;irc.freenode.net&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about ACLs? ==&lt;br /&gt;
&lt;br /&gt;
Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;acl(5)&#039;&#039;&#039; manual page is also quite extensive.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find information about the internals of XFS? ==&lt;br /&gt;
&lt;br /&gt;
An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.&lt;br /&gt;
&lt;br /&gt;
Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.&lt;br /&gt;
&lt;br /&gt;
== Q: What partition type should I use for XFS on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Linux native filesystem (83).&lt;br /&gt;
&lt;br /&gt;
== Q: What mount options does XFS have? ==&lt;br /&gt;
&lt;br /&gt;
There are a number of mount options influencing XFS filesystems - refer to the &#039;&#039;&#039;mount(8)&#039;&#039;&#039; manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])&lt;br /&gt;
&lt;br /&gt;
== Q: Is there any relation between the XFS utilities and the kernel version? ==&lt;br /&gt;
&lt;br /&gt;
No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Does it run on platforms other than i386? ==&lt;br /&gt;
&lt;br /&gt;
XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Do quotas work on XFS? ==&lt;br /&gt;
&lt;br /&gt;
Yes.&lt;br /&gt;
&lt;br /&gt;
To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/  http://sourceforge.net/projects/linuxquota/] or use &#039;&#039;&#039;xfs_quota(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: What&#039;s project quota? ==&lt;br /&gt;
&lt;br /&gt;
The  project  quota  is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Can group quota and project quota be used at the same time? ==&lt;br /&gt;
&lt;br /&gt;
No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==&lt;br /&gt;
&lt;br /&gt;
To be answered.&lt;br /&gt;
&lt;br /&gt;
== Q: Are there any dump/restore tools for XFS? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and &#039;&#039;&#039;xfsrestore(8)&#039;&#039;&#039; are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.&lt;br /&gt;
&lt;br /&gt;
== Q: Does LILO work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
This depends on where you install LILO.&lt;br /&gt;
&lt;br /&gt;
Yes, for MBR (Master Boot Record) installations.&lt;br /&gt;
&lt;br /&gt;
No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.&lt;br /&gt;
&lt;br /&gt;
== Q: Does GRUB work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.&lt;br /&gt;
&lt;br /&gt;
== Q: Can XFS be used for a root filesystem? ==&lt;br /&gt;
&lt;br /&gt;
Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the &amp;quot;rootflags=&amp;quot; kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit &amp;quot;logdev=&amp;quot; specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]&lt;br /&gt;
&lt;br /&gt;
== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be &amp;quot;clean&amp;quot; when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.&lt;br /&gt;
&lt;br /&gt;
== Q: Is there a way to make a XFS filesystem larger or smaller? ==&lt;br /&gt;
&lt;br /&gt;
You can &#039;&#039;NOT&#039;&#039; make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.&lt;br /&gt;
&lt;br /&gt;
An XFS filesystem may be enlarged by using &#039;&#039;&#039;xfs_growfs(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the &#039;&#039;exact same&#039;&#039; starting point. Run &#039;&#039;&#039;xfs_growfs&#039;&#039;&#039; to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.&lt;br /&gt;
&lt;br /&gt;
Using XFS filesystems on top of a volume manager makes this a lot easier.&lt;br /&gt;
&lt;br /&gt;
== Q: What information should I include when reporting a problem? ==&lt;br /&gt;
&lt;br /&gt;
What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:&lt;br /&gt;
&lt;br /&gt;
* kernel version (uname -a)&lt;br /&gt;
* xfsprogs version (xfs_repair -V)&lt;br /&gt;
* number of CPUs&lt;br /&gt;
* contents of /proc/meminfo&lt;br /&gt;
* contents of /proc/mounts&lt;br /&gt;
* contents of /proc/partitions&lt;br /&gt;
* RAID layout (hardware and/or software)&lt;br /&gt;
* LVM configuration&lt;br /&gt;
* type of disks you are using&lt;br /&gt;
* write cache status of drives&lt;br /&gt;
* size of BBWC and mode it is running in&lt;br /&gt;
* xfs_info output on the filesystem in question&lt;br /&gt;
* dmesg output showing all error messages and stack traces&lt;br /&gt;
 &lt;br /&gt;
Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:&lt;br /&gt;
&lt;br /&gt;
# iostat -x -d -m 5&lt;br /&gt;
# vmstat 5&lt;br /&gt;
 &lt;br /&gt;
can give us insight into the IO and memory utilisation of your machine at the time of the problem.&lt;br /&gt;
&lt;br /&gt;
If the filesystem is hanging, then capture the output of the dmesg command after running:&lt;br /&gt;
&lt;br /&gt;
 # echo w &amp;gt; /proc/sysrq-trigger&lt;br /&gt;
 # dmesg&lt;br /&gt;
&lt;br /&gt;
will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.&lt;br /&gt;
&lt;br /&gt;
And for advanced users, capturing an event trace using &#039;&#039;&#039;trace-cmd&#039;&#039;&#039; (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it&#039;s a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd record -e xfs\*&lt;br /&gt;
&lt;br /&gt;
before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd report &amp;gt; trace_report.txt&lt;br /&gt;
&lt;br /&gt;
Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.&lt;br /&gt;
&lt;br /&gt;
If you have a problem with &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039;, make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using &#039;&#039;&#039;xfs_metadump(8)&#039;&#039;&#039; (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.&lt;br /&gt;
&lt;br /&gt;
== Q: Mounting an XFS filesystem does not work - what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
If mount prints an error message something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     mount: /dev/hda5 has wrong major or minor number&lt;br /&gt;
&lt;br /&gt;
you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the &amp;quot;-t xfs&amp;quot; option on mount or the &amp;quot;xfs&amp;quot; option in &amp;lt;tt&amp;gt;/etc/fstab&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
If you get something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 mount: wrong fs type, bad option, bad superblock on /dev/sda1,&lt;br /&gt;
        or too many mounted file systems&lt;br /&gt;
&lt;br /&gt;
Refer to your system log file (&amp;lt;tt&amp;gt;/var/log/messages&amp;lt;/tt&amp;gt;) for a detailed diagnostic message from the kernel.&lt;br /&gt;
&lt;br /&gt;
== Q: Does the filesystem have an undelete capability? ==&lt;br /&gt;
&lt;br /&gt;
There is no undelete in XFS.&lt;br /&gt;
&lt;br /&gt;
However, if an inode is unlinked but neither it nor its associated data blocks get immediately re-used and overwritten, there is some small chance to recover the file from the disk.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;photorec&#039;&#039;, &#039;&#039;xfs_irecover&#039;&#039; or &#039;&#039;xfsr&#039;&#039; are some tools which attempt to do this, with varying success.&lt;br /&gt;
&lt;br /&gt;
There are also commercial data recovery services and closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS] which claims to recover data, although this has not been tested by the XFS developers.&lt;br /&gt;
&lt;br /&gt;
As always, the best advice is to keep good backups.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I backup a XFS filesystem and ACLs? ==&lt;br /&gt;
&lt;br /&gt;
You can backup a XFS filesystem with utilities like &#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and standard &#039;&#039;&#039;tar(1)&#039;&#039;&#039; for standard files. If you want to backup ACLs you will need to use &#039;&#039;&#039;xfsdump&#039;&#039;&#039; or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (&amp;gt; version 3.1.4) or [http://rsync.samba.org/ rsync] (&amp;gt;= version 3.0.0) to backup ACLs and EAs. &#039;&#039;&#039;xfsdump&#039;&#039;&#039; can also be integrated with [http://www.amanda.org/ amanda(8)].&lt;br /&gt;
&lt;br /&gt;
== Q: I see applications returning error 990 or &amp;quot;Structure needs cleaning&amp;quot;, what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], &amp;quot;Structure needs cleaning.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.&lt;br /&gt;
&lt;br /&gt;
There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.&lt;br /&gt;
&lt;br /&gt;
You can use xfs_repair to remedy the problem (with the file system unmounted).&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==&lt;br /&gt;
&lt;br /&gt;
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.&lt;br /&gt;
&lt;br /&gt;
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.&lt;br /&gt;
&lt;br /&gt;
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you&#039;ll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the &#039;&#039;&#039;xfs_bmap(8)&#039;&#039;&#039; command).&lt;br /&gt;
&lt;br /&gt;
== Q: What is the problem with the write cache on journaled filesystems? ==&lt;br /&gt;
&lt;br /&gt;
Many drives use a write back cache in order to speed up the performance of writes.  However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk.  Further, the drive can de-stage data from the write cache to the platters in any order that it chooses.  This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk.  When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.&lt;br /&gt;
&lt;br /&gt;
With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information.  In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.&lt;br /&gt;
&lt;br /&gt;
With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued.  A powerfail &amp;quot;only&amp;quot; loses data in the cache but no essential ordering is violated, and corruption will not occur.&lt;br /&gt;
&lt;br /&gt;
With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance.  But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I tell if I have the disk write cache enabled? ==&lt;br /&gt;
&lt;br /&gt;
For SCSI/SATA:&lt;br /&gt;
&lt;br /&gt;
* Look in dmesg(8) output for a driver line, such as:&amp;lt;br /&amp;gt; &amp;quot;SCSI device sda: drive cache: write back&amp;quot;&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# sginfo -c /dev/sda | grep -i &#039;write cache&#039; &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -I /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; and look under &amp;quot;Enabled Supported&amp;quot; for &amp;quot;Write cache&amp;quot;&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
== Q: How can I address the problem with the disk write cache? ==&lt;br /&gt;
&lt;br /&gt;
=== Disabling the disk write back cache. ===&lt;br /&gt;
&lt;br /&gt;
For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -W0 /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # hdparm -W0 /dev/hda&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# blktool /dev/sda wcache off&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # blktool /dev/hda wcache off&lt;br /&gt;
&lt;br /&gt;
For SCSI:&lt;br /&gt;
&lt;br /&gt;
* Using sginfo(8) which is a little tedious&amp;lt;br /&amp;gt; It takes 3 steps. For example:&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -c /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives a list of attribute names and values&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cX /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives an array of cache values which you must match up with from step 1, e.g.&amp;lt;br /&amp;gt; 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; allows you to reset the value of the cache attributes.&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Using an external log. ===&lt;br /&gt;
&lt;br /&gt;
Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will &#039;&#039;&#039;not&#039;&#039;&#039; solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won&#039;t be able to guarantee that if the metadata is on a drive with the write cache enabled.&lt;br /&gt;
&lt;br /&gt;
In fact using an external log will disable XFS&#039; write barrier support.&lt;br /&gt;
&lt;br /&gt;
=== Write barrier support. ===&lt;br /&gt;
&lt;br /&gt;
Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with &amp;quot;nobarrier&amp;quot;. Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported with external log device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported by the underlying device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, trial barrier write failed&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If the filesystem is mounted with an external log device then we currently don&#039;t support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn&#039;t support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.&lt;br /&gt;
&lt;br /&gt;
== Q. Should barriers be enabled with storage which has a persistent write cache? ==&lt;br /&gt;
&lt;br /&gt;
Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with &amp;quot;nobarrier&amp;quot;, assuming your RAID controller is infallible and not resetting randomly like some common ones do.  But take care about the hard disk write cache, which should be off.&lt;br /&gt;
&lt;br /&gt;
== Q. Which settings does my RAID controller need ? ==&lt;br /&gt;
&lt;br /&gt;
It&#039;s hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:&lt;br /&gt;
&lt;br /&gt;
Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory &amp;quot;[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]&amp;quot;) which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.&lt;br /&gt;
&lt;br /&gt;
If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.&lt;br /&gt;
&lt;br /&gt;
* onboard RAID controllers: there are so many different types it&#039;s hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn&#039;t even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.&lt;br /&gt;
&lt;br /&gt;
* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86); &lt;br /&gt;
&lt;br /&gt;
* Adaptec: allows setting individual drives cache&lt;br /&gt;
arcconf setcache &amp;lt;disk&amp;gt; wb|wt&lt;br /&gt;
wb=write back, which means write cache on, wt=write through, which means write cache off. So &amp;quot;wt&amp;quot; should be chosen.&lt;br /&gt;
&lt;br /&gt;
* Areca: In archttp under &amp;quot;System Controls&amp;quot; -&amp;gt; &amp;quot;System Configuration&amp;quot; there&#039;s the option &amp;quot;Disk Write Cache Mode&amp;quot; (defaults &amp;quot;Auto&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Off&amp;quot;: disk write cache is turned off&lt;br /&gt;
&lt;br /&gt;
&amp;quot;On&amp;quot;: disk write cache is enabled, this is not safe for your data but fast&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Auto&amp;quot;: If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to &amp;quot;On&amp;quot;, because neither controller cache nor disk cache is safe so you don&#039;t seem to care about your data and just want high speed (which you get then).&lt;br /&gt;
&lt;br /&gt;
That&#039;s a very sensible default so you can let it &amp;quot;Auto&amp;quot; or enforce &amp;quot;Off&amp;quot; to be sure.&lt;br /&gt;
&lt;br /&gt;
* LSI MegaRAID: allows setting individual disks cache:&lt;br /&gt;
 MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL                          # flushes the controller cache&lt;br /&gt;
 MegaCli -LDGetProp -Cache    -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the controller cache settings&lt;br /&gt;
 MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the disk cache settings (for all phys. disks in logical disk)&lt;br /&gt;
 MegaCli -LDSetProp -EnDskCache|DisDskCache  -LN|-L0,1,2|-LAll  -aN|-a0,1,2|-aALL # set disk cache setting&lt;br /&gt;
&lt;br /&gt;
* Xyratex: from the docs: &amp;quot;Write cache includes the disk drive cache and controller cache.&amp;quot;. So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.&lt;br /&gt;
&lt;br /&gt;
== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==&lt;br /&gt;
&lt;br /&gt;
The biggest problem is that those products seem to also virtualize disk &lt;br /&gt;
writes in a way that even barriers don&#039;t work any more, which means even &lt;br /&gt;
a fsync is not reliable. Tests confirm that unplugging the power from &lt;br /&gt;
such a system even with RAID controller with battery backed cache and &lt;br /&gt;
hard disk cache turned off (which is safe on a normal host) you can &lt;br /&gt;
destroy a database within the virtual machine (client, domU whatever you &lt;br /&gt;
call it).&lt;br /&gt;
&lt;br /&gt;
In qemu you can specify cache=off on the line specifying the virtual &lt;br /&gt;
disk. For others information is missing.&lt;br /&gt;
&lt;br /&gt;
== Q: What is the issue with directory corruption in Linux 2.6.17? ==&lt;br /&gt;
&lt;br /&gt;
In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some &amp;quot;sparse&amp;quot; endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: the fix is included in 2.6.17.7 and later kernels.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
To add insult to injury, &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039; is currently not correcting these directories on detection of this corrupt state either. This &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; issue is actively being worked on, and a fixed version will be available shortly.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfs_repair -n&#039;&#039;&#039; should be able to detect any directory corruption.&lt;br /&gt;
&lt;br /&gt;
Until a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; binary is available, one can make use of the &#039;&#039;&#039;xfs_db(8)&#039;&#039;&#039; command to mark the problem directory for removal (see the example below). A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; invocation will remove the directory and move all contents into &amp;quot;lost+found&amp;quot;, named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 core.mode = 040755&lt;br /&gt;
 core.version = 2&lt;br /&gt;
 core.format = 3 (btree)&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; write core.mode 0&lt;br /&gt;
 xfs_db&amp;amp;gt; quit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; will clear the directory, and add new entries (named by inode number) in lost+found.&lt;br /&gt;
&lt;br /&gt;
The easiest way to map inode numbers to full paths is via &#039;&#039;&#039;xfs_ncheck(8)&#039;&#039;&#039;&amp;lt;nowiki&amp;gt;: &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_ncheck -i 14101 -i 14102 /dev/sdXXX&lt;br /&gt;
       14101 full/path/mumble_fratz_foo_bar_1495&lt;br /&gt;
       14102 full/path/mumble_fratz_foo_bar_1494&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 ...&lt;br /&gt;
 next_unlinked = null&lt;br /&gt;
 u.bmbt.level = 1&lt;br /&gt;
 u.bmbt.numrecs = 1&lt;br /&gt;
 u.bmbt.keys[1] = [startoff] 1:[0]&lt;br /&gt;
 u.bmbt.ptrs[1] = 1:3628&lt;br /&gt;
 xfs_db&amp;amp;gt; fsblock 3628&lt;br /&gt;
 xfs_db&amp;amp;gt; type bmapbtd&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 magic = 0x424d4150&lt;br /&gt;
 level = 0&lt;br /&gt;
 numrecs = 19&lt;br /&gt;
 leftsib = null&lt;br /&gt;
 rightsib = null&lt;br /&gt;
 recs[1-19] = [startoff,startblock,blockcount,extentflag]&lt;br /&gt;
        1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]&lt;br /&gt;
        5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]&lt;br /&gt;
        9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]&lt;br /&gt;
        12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]&lt;br /&gt;
        15:[33554436,3488,8,0] 16:[33554444,3629,4,0]&lt;br /&gt;
        17:[33554448,3748,4,0] 18:[33554452,3900,4,0]&lt;br /&gt;
        19:[67108864,3364,4,0]&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the &#039;&#039;&#039;xfs_db&#039;&#039;&#039; dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; dblock 20&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 dhdr.magic = 0x58443244&lt;br /&gt;
 dhdr.bestfree[0].offset = 0&lt;br /&gt;
 dhdr.bestfree[0].length = 0&lt;br /&gt;
 dhdr.bestfree[1].offset = 0&lt;br /&gt;
 dhdr.bestfree[1].length = 0&lt;br /&gt;
 dhdr.bestfree[2].offset = 0&lt;br /&gt;
 dhdr.bestfree[2].length = 0&lt;br /&gt;
 du[0].inumber = 13937&lt;br /&gt;
 du[0].namelen = 25&lt;br /&gt;
 du[0].name = &amp;quot;mumble_fratz_foo_bar_1595&amp;quot;&lt;br /&gt;
 du[0].tag = 0x10&lt;br /&gt;
 du[1].inumber = 13938&lt;br /&gt;
 du[1].namelen = 25&lt;br /&gt;
 du[1].name = &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;&lt;br /&gt;
 du[1].tag = 0x38&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
So, here we can see that inode number 13938 matches up with name &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;. Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at &amp;quot;lost+found&amp;quot; (once &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; has removed the corrupt directory).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why does my &amp;gt; 2TB XFS partition disappear when I reboot ? ==&lt;br /&gt;
&lt;br /&gt;
Strictly speaking this is not an XFS problem.&lt;br /&gt;
&lt;br /&gt;
To support &amp;gt; 2TB partitions you need two things: a kernel that supports large block devices (&amp;lt;tt&amp;gt;CONFIG_LBD=y&amp;lt;/tt&amp;gt;) and a partition table format that can hold large partitions.  The default DOS partition tables don&#039;t.  The best partition format for&lt;br /&gt;
&amp;gt; 2TB partitions is the EFI GPT format (&amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Without CONFIG_LBD=y you can&#039;t even create the filesystem, but without &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt; it works fine until you reboot at which point the partition will disappear.  Note that you need to enable the &amp;lt;tt&amp;gt;CONFIG_PARTITION_ADVANCED&amp;lt;/tt&amp;gt; option before you can set &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I receive &amp;lt;tt&amp;gt;No space left on device&amp;lt;/tt&amp;gt; after &amp;lt;tt&amp;gt;xfs_growfs&amp;lt;/tt&amp;gt;? ==&lt;br /&gt;
&lt;br /&gt;
After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:&lt;br /&gt;
&lt;br /&gt;
  The only way to fix this is to move data around to free up space&lt;br /&gt;
  below 1TB. Find your oldest data (i.e. that was around before even&lt;br /&gt;
  the first grow) and move it off the filesystem (move, not copy).&lt;br /&gt;
  Then if you copy it back on, the data blocks will end up above 1TB&lt;br /&gt;
  and that should leave you with plenty of space for inodes below 1TB.&lt;br /&gt;
  &lt;br /&gt;
  A complete dump and restore will also fix the problem ;)&lt;br /&gt;
&lt;br /&gt;
Also, you can add &#039;inode64&#039; to your mount options to allow inodes to live above 1TB.&lt;br /&gt;
&lt;br /&gt;
example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&amp;amp;forum=38 | No space left on device on xfs filesystem with 7.7TB free]&lt;br /&gt;
&lt;br /&gt;
== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==&lt;br /&gt;
&lt;br /&gt;
The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons. &lt;br /&gt;
&lt;br /&gt;
Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.&lt;br /&gt;
&lt;br /&gt;
== Q: How to get around a bad inode repair is unable to clean up ==&lt;br /&gt;
&lt;br /&gt;
The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.&lt;br /&gt;
&lt;br /&gt;
  xfs_db -x -c &#039;inode XXX&#039; -c &#039;write core.nextents 0&#039; -c &#039;write core.size 0&#039; /dev/hdXX&lt;br /&gt;
&lt;br /&gt;
== Q: How to calculate the correct sunit,swidth values for optimal performance ==&lt;br /&gt;
&lt;br /&gt;
XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.&lt;br /&gt;
&lt;br /&gt;
These options can be sometimes autodetected (for example with md raid and recent enough kernel (&amp;gt;= 2.6.32) and xfsprogs (&amp;gt;= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.&lt;br /&gt;
&lt;br /&gt;
The calculation of these values is quite simple:&lt;br /&gt;
&lt;br /&gt;
  su = &amp;lt;RAID controllers stripe size in BYTES (or KiBytes when used with k)&amp;gt;&lt;br /&gt;
  sw = &amp;lt;# of data disks (don&#039;t count parity disks)&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use&lt;br /&gt;
&lt;br /&gt;
  su = 64k&lt;br /&gt;
  sw = 6 (RAID-6 of 8 disks has 6 data disks)&lt;br /&gt;
&lt;br /&gt;
A RAID stripe size of 256KB with a RAID-10 over 16 disks should use&lt;br /&gt;
&lt;br /&gt;
  su = 256k&lt;br /&gt;
  sw = 8 (RAID-10 of 16 disks has 8 data disks)&lt;br /&gt;
&lt;br /&gt;
Alternatively, you can use &amp;quot;sunit&amp;quot; instead of &amp;quot;su&amp;quot; and &amp;quot;swidth&amp;quot; instead of &amp;quot;sw&amp;quot; but then sunit/swidth values need to be specified in &amp;quot;number of 512B sectors&amp;quot;!&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; interpret sunit and swidth as being specified in units of 512B sectors; that&#039;s unfortunately not the unit they&#039;re reported in, however.&lt;br /&gt;
&amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; report them in multiples of your basic block size (bsize) and not in 512B sectors.&lt;br /&gt;
&lt;br /&gt;
Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.&lt;br /&gt;
&lt;br /&gt;
When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.&lt;br /&gt;
&lt;br /&gt;
== Q: Why doesn&#039;t NFS-exporting subdirectories of inode64-mounted filesystem work? ==&lt;br /&gt;
&lt;br /&gt;
The default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; type encodes only 32-bit of the inode number for subdirectory exports.  However, exporting the root of the filesystem works, or using one of the non-default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; types (&amp;lt;tt&amp;gt;fsid=uuid&amp;lt;/tt&amp;gt; in &amp;lt;tt&amp;gt;/etc/exports&amp;lt;/tt&amp;gt; with recent &amp;lt;tt&amp;gt;nfs-utils&amp;lt;/tt&amp;gt;) should work as well. (Thanks, Christoph!)&lt;br /&gt;
&lt;br /&gt;
== Q: What is the inode64 mount option for? ==&lt;br /&gt;
&lt;br /&gt;
By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like &amp;quot;disk full&amp;quot; when you still have plenty space free, but there&#039;s no more place in the first TB to create a new inode. Also, performance sucks.&lt;br /&gt;
&lt;br /&gt;
To come around this, use the inode64 mount options for filesystems &amp;gt;1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.&lt;br /&gt;
&lt;br /&gt;
Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.&lt;br /&gt;
&lt;br /&gt;
== Q: Can I just try the inode64 option to see if it helps me? ==&lt;br /&gt;
&lt;br /&gt;
Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can&#039;t access files &amp;amp; dirs that have been created with an inode &amp;gt;32bit anymore.&lt;br /&gt;
&lt;br /&gt;
== Q: Performance: mkfs.xfs -n size=64k option ==&lt;br /&gt;
&lt;br /&gt;
Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:&lt;br /&gt;
&lt;br /&gt;
Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a&lt;br /&gt;
directory entry is determined by the length of the name.&lt;br /&gt;
&lt;br /&gt;
There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there&#039;s the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.&lt;br /&gt;
&lt;br /&gt;
For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.&lt;br /&gt;
&lt;br /&gt;
In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don&#039;t have any numbers on what the difference might be - I&#039;m getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....&lt;br /&gt;
&lt;br /&gt;
== Q: I want to tune my XFS filesystems for &amp;lt;something&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Premature optimization is the root of all evil.&#039;&#039; - Donald Knuth&lt;br /&gt;
&lt;br /&gt;
The standard answer you will get to this question is this: use the defaults.&lt;br /&gt;
&lt;br /&gt;
There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to  configure the filesystem appropriately.&lt;br /&gt;
&lt;br /&gt;
There are a lot of &amp;quot;XFS tuning guides&amp;quot; that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don&#039;t expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.&lt;br /&gt;
&lt;br /&gt;
In most cases, the only thing you need to to consider for &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; mount options. Increasing &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; reduces the number of journal IOs for a given workload, and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; will reduce them even further. The trade off for this increase in metadata performance is that more operations may be &amp;quot;missing&amp;quot; after recovery if the system crashes while actively making modifications.&lt;br /&gt;
&lt;br /&gt;
As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.&lt;br /&gt;
&lt;br /&gt;
== Q: Which factors influence the memory usage of xfs_repair? ==&lt;br /&gt;
&lt;br /&gt;
This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -n -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2096.&lt;br /&gt;
  #&lt;br /&gt;
&lt;br /&gt;
xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,&lt;br /&gt;
of which 2,097,152KB is needed for tracking free space. &lt;br /&gt;
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)&lt;br /&gt;
&lt;br /&gt;
Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2289.&lt;br /&gt;
&lt;br /&gt;
That is now needs at least another 200MB of RAM to run.&lt;br /&gt;
&lt;br /&gt;
The numbers reported by xfs_repair are the absolute minimum required and approximate at that;&lt;br /&gt;
more RAM than this may be required to complete successfully.&lt;br /&gt;
Also, if you only give xfs_repair the minimum required RAM, it will be slow;&lt;br /&gt;
for best repair performance, the more RAM you can give it the better.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why some files of my filesystem shows as &amp;quot;?????????? ? ?      ?          ?                ? filename&amp;quot; ? ==&lt;br /&gt;
&lt;br /&gt;
If ls -l shows you a listing as&lt;br /&gt;
&lt;br /&gt;
  # ?????????? ? ?      ?          ?                ? file1&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file2&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file3&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file4&lt;br /&gt;
&lt;br /&gt;
and errors like:&lt;br /&gt;
  # ls /pathtodir/&lt;br /&gt;
    ls: cannot access /pathtodir/file1: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file2: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file3: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file4: Invalid argument&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
or even:&lt;br /&gt;
  # failed to stat /pathtodir/file1&lt;br /&gt;
&lt;br /&gt;
It is very probable your filesystem must be mounted with inode64&lt;br /&gt;
  # mount -oremount,inode64 /dev/diskpart /mnt/xfs&lt;br /&gt;
&lt;br /&gt;
should make it work ok again.&lt;br /&gt;
If it works, add the option to fstab.&lt;br /&gt;
&lt;br /&gt;
== Q: The xfs_db &amp;quot;frag&amp;quot; command says I&#039;m over 50%.  Is that bad? ==&lt;br /&gt;
&lt;br /&gt;
It depends.  It&#039;s important to know how the value is calculated.  xfs_db looks at the extents in all files, and returns:&lt;br /&gt;
&lt;br /&gt;
  (actual extents - ideal extents) / actual extents&lt;br /&gt;
&lt;br /&gt;
This means that if, for example, you have an average of 2 extents per file, you&#039;ll get an answer of 50%.  4 extents per file would give you 75%.  This may or may not be a problem, especially depending on the size of the files in question.  (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented).  The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.&lt;br /&gt;
&lt;br /&gt;
Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:&lt;br /&gt;
[[Image:Frag_factor.png|500px]]&lt;br /&gt;
&lt;br /&gt;
== Q: I&#039;m getting &amp;quot;Internal error xfs_sb_read_verify&amp;quot; errors when I try to run xfs_growfs under kernels v3.9 through v3.12 ==&lt;br /&gt;
&lt;br /&gt;
Old kernel versions didn&#039;t zero the empty part of the secondary&lt;br /&gt;
superblocks when growing the filesystem.  However, kernels v3.10 and&lt;br /&gt;
newer have started flagging this as corruption.  (Kernels v3.13 and later&lt;br /&gt;
are more forgiving about this, if the garbage is found on a V4 (not V5)&lt;br /&gt;
superblock).&lt;br /&gt;
&lt;br /&gt;
This commit in 3.8 fixed&lt;br /&gt;
the kernel growfs code not to put garbage in the new secondary&lt;br /&gt;
superblocks:&lt;br /&gt;
&lt;br /&gt;
commit 1375cb65e87b327a8dd4f920c3e3d837fb40e9c2&lt;br /&gt;
Author: Dave Chinner &amp;lt;dchinner@redhat.com&amp;gt;&lt;br /&gt;
Date:   Tue Oct 9 14:50:52 2012 +1100&lt;br /&gt;
&lt;br /&gt;
    xfs: growfs: don&#039;t read garbage for new secondary superblocks&lt;br /&gt;
&lt;br /&gt;
    When updating new secondary superblocks in a growfs operation, the&lt;br /&gt;
    superblock buffer is read from the newly grown region of the&lt;br /&gt;
    underlying device. This is not guaranteed to be zero, so violates&lt;br /&gt;
    the underlying assumption that the unused parts of superblocks are&lt;br /&gt;
    zero filled. Get a new buffer for these secondary superblocks to&lt;br /&gt;
    ensure that the unused regions are zero filled correctly.&lt;br /&gt;
&lt;br /&gt;
The only time the kernel reads secondary superblocks is during a&lt;br /&gt;
growfs operation, so that&#039;s the only time the kernel will detect&lt;br /&gt;
such an error. More extensive validity tests were added during 3.9&lt;br /&gt;
and 3.10, and these now throw corruption errors over secondary&lt;br /&gt;
superblocks that have not been correctly zeroed.&lt;br /&gt;
&lt;br /&gt;
To fix this, you need to grab xfsprogs from the git repo&lt;br /&gt;
(3.2.0-alpha1 or newer will do) as this commit to xfs_repair detects and fixes&lt;br /&gt;
the corrupted superblocks:&lt;br /&gt;
&lt;br /&gt;
commit cbd7508db4c9597889ad98d5f027542002e0e57c&lt;br /&gt;
Author: Eric Sandeen &amp;lt;sandeen@redhat.com&amp;gt;&lt;br /&gt;
Date:   Thu Aug 15 02:26:40 2013 +0000&lt;br /&gt;
&lt;br /&gt;
    xfs_repair: zero out unused parts of superblocks&lt;br /&gt;
    &lt;br /&gt;
    Prior to:&lt;br /&gt;
    1375cb65 xfs: growfs: don&#039;t read garbage for new secondary superblocks&lt;br /&gt;
    &lt;br /&gt;
    we ran the risk of allowing garbage in secondary superblocks&lt;br /&gt;
    beyond the in-use sb fields.  With kernels 3.10 and beyond, the&lt;br /&gt;
    verifiers will kick these out as invalid, but xfs_repair does&lt;br /&gt;
    not detect or repair this condition.&lt;br /&gt;
    &lt;br /&gt;
    There is superblock stale-data zeroing code, but it is under a&lt;br /&gt;
    narrow conditional - the bug addressed in the above commit did not&lt;br /&gt;
    meet that conditional.  So change this to check unconditionally.&lt;br /&gt;
    &lt;br /&gt;
    Further, the checking code was looking at the in-memory&lt;br /&gt;
    superblock buffer, which was zeroed prior to population, and&lt;br /&gt;
    would therefore never possibly show any stale data beyond the&lt;br /&gt;
    last up-rev superblock field.&lt;br /&gt;
    &lt;br /&gt;
    So instead, check the disk buffer for this garbage condition.&lt;br /&gt;
    &lt;br /&gt;
    If we detect garbage, we must zero out both the in-memory sb&lt;br /&gt;
    and the disk buffer; the former may contain unused data&lt;br /&gt;
    in up-rev sb fields which will be written back out; the latter&lt;br /&gt;
    may contain garbage beyond all fields, which won&#039;t be updated&lt;br /&gt;
    when we translate the in-memory sb back to disk.&lt;br /&gt;
    &lt;br /&gt;
    The V4 superblock case was zeroing out the sb_bad_features2&lt;br /&gt;
    field; we also fix that to leave that field alone.&lt;br /&gt;
&lt;br /&gt;
In kernels v3.13 and newer, this commit:&lt;br /&gt;
&lt;br /&gt;
commit 10e6e65dfcedff63275c3d649d329c044caa8e26&lt;br /&gt;
Author: Eric Sandeen &amp;lt;sandeen@sandeen.net&amp;gt;&lt;br /&gt;
Date:   Mon Sep 9 15:33:29 2013 -0500&lt;br /&gt;
&lt;br /&gt;
    xfs: be more forgiving of a v4 secondary sb w/ junk in v5 fields&lt;br /&gt;
&lt;br /&gt;
will cause the kernel to be more forgiving of this situation.&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2829</id>
		<title>XFS FAQ</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2829"/>
		<updated>2012-11-02T17:36:16Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: /* Q: Does the filesystem have an undelete capability? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about XFS? ==&lt;br /&gt;
&lt;br /&gt;
The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.&lt;br /&gt;
&lt;br /&gt;
You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the &#039;&#039;&#039;&amp;lt;nowiki&amp;gt;#xfs&amp;lt;/nowiki&amp;gt;&#039;&#039;&#039; IRC channel on &#039;&#039;irc.freenode.net&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about ACLs? ==&lt;br /&gt;
&lt;br /&gt;
Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;acl(5)&#039;&#039;&#039; manual page is also quite extensive.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find information about the internals of XFS? ==&lt;br /&gt;
&lt;br /&gt;
An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.&lt;br /&gt;
&lt;br /&gt;
Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.&lt;br /&gt;
&lt;br /&gt;
== Q: What partition type should I use for XFS on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Linux native filesystem (83).&lt;br /&gt;
&lt;br /&gt;
== Q: What mount options does XFS have? ==&lt;br /&gt;
&lt;br /&gt;
There are a number of mount options influencing XFS filesystems - refer to the &#039;&#039;&#039;mount(8)&#039;&#039;&#039; manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])&lt;br /&gt;
&lt;br /&gt;
== Q: Is there any relation between the XFS utilities and the kernel version? ==&lt;br /&gt;
&lt;br /&gt;
No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Does it run on platforms other than i386? ==&lt;br /&gt;
&lt;br /&gt;
XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Do quotas work on XFS? ==&lt;br /&gt;
&lt;br /&gt;
Yes.&lt;br /&gt;
&lt;br /&gt;
To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/  http://sourceforge.net/projects/linuxquota/] or use &#039;&#039;&#039;xfs_quota(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: What&#039;s project quota? ==&lt;br /&gt;
&lt;br /&gt;
The  project  quota  is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Can group quota and project quota be used at the same time? ==&lt;br /&gt;
&lt;br /&gt;
No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==&lt;br /&gt;
&lt;br /&gt;
To be answered.&lt;br /&gt;
&lt;br /&gt;
== Q: Are there any dump/restore tools for XFS? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and &#039;&#039;&#039;xfsrestore(8)&#039;&#039;&#039; are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.&lt;br /&gt;
&lt;br /&gt;
== Q: Does LILO work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
This depends on where you install LILO.&lt;br /&gt;
&lt;br /&gt;
Yes, for MBR (Master Boot Record) installations.&lt;br /&gt;
&lt;br /&gt;
No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.&lt;br /&gt;
&lt;br /&gt;
== Q: Does GRUB work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.&lt;br /&gt;
&lt;br /&gt;
== Q: Can XFS be used for a root filesystem? ==&lt;br /&gt;
&lt;br /&gt;
Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the &amp;quot;rootflags=&amp;quot; kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit &amp;quot;logdev=&amp;quot; specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]&lt;br /&gt;
&lt;br /&gt;
== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be &amp;quot;clean&amp;quot; when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.&lt;br /&gt;
&lt;br /&gt;
== Q: Is there a way to make a XFS filesystem larger or smaller? ==&lt;br /&gt;
&lt;br /&gt;
You can &#039;&#039;NOT&#039;&#039; make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.&lt;br /&gt;
&lt;br /&gt;
An XFS filesystem may be enlarged by using &#039;&#039;&#039;xfs_growfs(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the &#039;&#039;exact same&#039;&#039; starting point. Run &#039;&#039;&#039;xfs_growfs&#039;&#039;&#039; to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.&lt;br /&gt;
&lt;br /&gt;
Using XFS filesystems on top of a volume manager makes this a lot easier.&lt;br /&gt;
&lt;br /&gt;
== Q: What information should I include when reporting a problem? ==&lt;br /&gt;
&lt;br /&gt;
What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:&lt;br /&gt;
&lt;br /&gt;
* kernel version (uname -a)&lt;br /&gt;
* xfsprogs version (xfs_repair -V)&lt;br /&gt;
* number of CPUs&lt;br /&gt;
* contents of /proc/meminfo&lt;br /&gt;
* contents of /proc/mounts&lt;br /&gt;
* contents of /proc/partitions&lt;br /&gt;
* RAID layout (hardware and/or software)&lt;br /&gt;
* LVM configuration&lt;br /&gt;
* type of disks you are using&lt;br /&gt;
* write cache status of drives&lt;br /&gt;
* size of BBWC and mode it is running in&lt;br /&gt;
* xfs_info output on the filesystem in question&lt;br /&gt;
* dmesg output showing all error messages and stack traces&lt;br /&gt;
 &lt;br /&gt;
Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:&lt;br /&gt;
&lt;br /&gt;
# iostat -x -d -m 5&lt;br /&gt;
# vmstat 5&lt;br /&gt;
 &lt;br /&gt;
can give us insight into the IO and memory utilisation of your machine at the time of the problem.&lt;br /&gt;
&lt;br /&gt;
If the filesystem is hanging, then capture the output of the dmesg command after running:&lt;br /&gt;
&lt;br /&gt;
 # echo w &amp;gt; /proc/sysrq-trigger&lt;br /&gt;
 # dmesg&lt;br /&gt;
&lt;br /&gt;
will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.&lt;br /&gt;
&lt;br /&gt;
And for advanced users, capturing an event trace using &#039;&#039;&#039;trace-cmd&#039;&#039;&#039; (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it&#039;s a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd record -e xfs\*&lt;br /&gt;
&lt;br /&gt;
before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd report &amp;gt; trace_report.txt&lt;br /&gt;
&lt;br /&gt;
Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.&lt;br /&gt;
&lt;br /&gt;
If you have a problem with &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039;, make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using &#039;&#039;&#039;xfs_metadump(8)&#039;&#039;&#039; (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.&lt;br /&gt;
&lt;br /&gt;
== Q: Mounting an XFS filesystem does not work - what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
If mount prints an error message something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     mount: /dev/hda5 has wrong major or minor number&lt;br /&gt;
&lt;br /&gt;
you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the &amp;quot;-t xfs&amp;quot; option on mount or the &amp;quot;xfs&amp;quot; option in &amp;lt;tt&amp;gt;/etc/fstab&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
If you get something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 mount: wrong fs type, bad option, bad superblock on /dev/sda1,&lt;br /&gt;
        or too many mounted file systems&lt;br /&gt;
&lt;br /&gt;
Refer to your system log file (&amp;lt;tt&amp;gt;/var/log/messages&amp;lt;/tt&amp;gt;) for a detailed diagnostic message from the kernel.&lt;br /&gt;
&lt;br /&gt;
== Q: Does the filesystem have an undelete capability? ==&lt;br /&gt;
&lt;br /&gt;
There is no undelete in XFS.&lt;br /&gt;
&lt;br /&gt;
However, if an inode is unlinked but neither it nor its associated data blocks get immediately re-used and overwritten, there is some small chance to recover the file from the disk.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;photorec&#039;&#039;, &#039;&#039;xfs_irecover&#039;&#039; or &#039;&#039;xfsr&#039;&#039; are some tools which attempt to do this, with varying success.&lt;br /&gt;
&lt;br /&gt;
There are also commercial data recovery services and closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS] which claims to recover data, although this has not been tested by the XFS developers.&lt;br /&gt;
&lt;br /&gt;
As always, the best advice is to keep good backups.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I backup a XFS filesystem and ACLs? ==&lt;br /&gt;
&lt;br /&gt;
You can backup a XFS filesystem with utilities like &#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and standard &#039;&#039;&#039;tar(1)&#039;&#039;&#039; for standard files. If you want to backup ACLs you will need to use &#039;&#039;&#039;xfsdump&#039;&#039;&#039; or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (&amp;gt; version 3.1.4) or [http://rsync.samba.org/ rsync] (&amp;gt;= version 3.0.0) to backup ACLs and EAs. &#039;&#039;&#039;xfsdump&#039;&#039;&#039; can also be integrated with [http://www.amanda.org/ amanda(8)].&lt;br /&gt;
&lt;br /&gt;
== Q: I see applications returning error 990 or &amp;quot;Structure needs cleaning&amp;quot;, what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], &amp;quot;Structure needs cleaning.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.&lt;br /&gt;
&lt;br /&gt;
There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.&lt;br /&gt;
&lt;br /&gt;
You can use xfs_repair to remedy the problem (with the file system unmounted).&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==&lt;br /&gt;
&lt;br /&gt;
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.&lt;br /&gt;
&lt;br /&gt;
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.&lt;br /&gt;
&lt;br /&gt;
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you&#039;ll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the &#039;&#039;&#039;xfs_bmap(8)&#039;&#039;&#039; command).&lt;br /&gt;
&lt;br /&gt;
== Q: What is the problem with the write cache on journaled filesystems? ==&lt;br /&gt;
&lt;br /&gt;
Many drives use a write back cache in order to speed up the performance of writes.  However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk.  Further, the drive can de-stage data from the write cache to the platters in any order that it chooses.  This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk.  When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.&lt;br /&gt;
&lt;br /&gt;
With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information.  In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.&lt;br /&gt;
&lt;br /&gt;
With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued.  A powerfail &amp;quot;only&amp;quot; loses data in the cache but no essential ordering is violated, and corruption will not occur.&lt;br /&gt;
&lt;br /&gt;
With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance.  But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I tell if I have the disk write cache enabled? ==&lt;br /&gt;
&lt;br /&gt;
For SCSI/SATA:&lt;br /&gt;
&lt;br /&gt;
* Look in dmesg(8) output for a driver line, such as:&amp;lt;br /&amp;gt; &amp;quot;SCSI device sda: drive cache: write back&amp;quot;&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# sginfo -c /dev/sda | grep -i &#039;write cache&#039; &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -I /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; and look under &amp;quot;Enabled Supported&amp;quot; for &amp;quot;Write cache&amp;quot;&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
== Q: How can I address the problem with the disk write cache? ==&lt;br /&gt;
&lt;br /&gt;
=== Disabling the disk write back cache. ===&lt;br /&gt;
&lt;br /&gt;
For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -W0 /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # hdparm -W0 /dev/hda&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# blktool /dev/sda wcache off&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # blktool /dev/hda wcache off&lt;br /&gt;
&lt;br /&gt;
For SCSI:&lt;br /&gt;
&lt;br /&gt;
* Using sginfo(8) which is a little tedious&amp;lt;br /&amp;gt; It takes 3 steps. For example:&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -c /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives a list of attribute names and values&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cX /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives an array of cache values which you must match up with from step 1, e.g.&amp;lt;br /&amp;gt; 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; allows you to reset the value of the cache attributes.&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Using an external log. ===&lt;br /&gt;
&lt;br /&gt;
Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will &#039;&#039;&#039;not&#039;&#039;&#039; solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won&#039;t be able to guarantee that if the metadata is on a drive with the write cache enabled.&lt;br /&gt;
&lt;br /&gt;
In fact using an external log will disable XFS&#039; write barrier support.&lt;br /&gt;
&lt;br /&gt;
=== Write barrier support. ===&lt;br /&gt;
&lt;br /&gt;
Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with &amp;quot;nobarrier&amp;quot;. Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported with external log device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported by the underlying device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, trial barrier write failed&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If the filesystem is mounted with an external log device then we currently don&#039;t support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn&#039;t support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.&lt;br /&gt;
&lt;br /&gt;
== Q. Should barriers be enabled with storage which has a persistent write cache? ==&lt;br /&gt;
&lt;br /&gt;
Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with &amp;quot;nobarrier&amp;quot;, assuming your RAID controller is infallible and not resetting randomly like some common ones do.  But take care about the hard disk write cache, which should be off.&lt;br /&gt;
&lt;br /&gt;
== Q. Which settings does my RAID controller need ? ==&lt;br /&gt;
&lt;br /&gt;
It&#039;s hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:&lt;br /&gt;
&lt;br /&gt;
Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory &amp;quot;[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]&amp;quot;) which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.&lt;br /&gt;
&lt;br /&gt;
If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.&lt;br /&gt;
&lt;br /&gt;
* onboard RAID controllers: there are so many different types it&#039;s hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn&#039;t even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.&lt;br /&gt;
&lt;br /&gt;
* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86); &lt;br /&gt;
&lt;br /&gt;
* Adaptec: allows setting individual drives cache&lt;br /&gt;
arcconf setcache &amp;lt;disk&amp;gt; wb|wt&lt;br /&gt;
wb=write back, which means write cache on, wt=write through, which means write cache off. So &amp;quot;wt&amp;quot; should be chosen.&lt;br /&gt;
&lt;br /&gt;
* Areca: In archttp under &amp;quot;System Controls&amp;quot; -&amp;gt; &amp;quot;System Configuration&amp;quot; there&#039;s the option &amp;quot;Disk Write Cache Mode&amp;quot; (defaults &amp;quot;Auto&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Off&amp;quot;: disk write cache is turned off&lt;br /&gt;
&lt;br /&gt;
&amp;quot;On&amp;quot;: disk write cache is enabled, this is not safe for your data but fast&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Auto&amp;quot;: If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to &amp;quot;On&amp;quot;, because neither controller cache nor disk cache is safe so you don&#039;t seem to care about your data and just want high speed (which you get then).&lt;br /&gt;
&lt;br /&gt;
That&#039;s a very sensible default so you can let it &amp;quot;Auto&amp;quot; or enforce &amp;quot;Off&amp;quot; to be sure.&lt;br /&gt;
&lt;br /&gt;
* LSI MegaRAID: allows setting individual disks cache:&lt;br /&gt;
 MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL                          # flushes the controller cache&lt;br /&gt;
 MegaCli -LDGetProp -Cache    -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the controller cache settings&lt;br /&gt;
 MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the disk cache settings (for all phys. disks in logical disk)&lt;br /&gt;
 MegaCli -LDSetProp -EnDskCache|DisDskCache  -LN|-L0,1,2|-LAll  -aN|-a0,1,2|-aALL # set disk cache setting&lt;br /&gt;
&lt;br /&gt;
* Xyratex: from the docs: &amp;quot;Write cache includes the disk drive cache and controller cache.&amp;quot;. So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.&lt;br /&gt;
&lt;br /&gt;
== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==&lt;br /&gt;
&lt;br /&gt;
The biggest problem is that those products seem to also virtualize disk &lt;br /&gt;
writes in a way that even barriers don&#039;t work any more, which means even &lt;br /&gt;
a fsync is not reliable. Tests confirm that unplugging the power from &lt;br /&gt;
such a system even with RAID controller with battery backed cache and &lt;br /&gt;
hard disk cache turned off (which is safe on a normal host) you can &lt;br /&gt;
destroy a database within the virtual machine (client, domU whatever you &lt;br /&gt;
call it).&lt;br /&gt;
&lt;br /&gt;
In qemu you can specify cache=off on the line specifying the virtual &lt;br /&gt;
disk. For others information is missing.&lt;br /&gt;
&lt;br /&gt;
== Q: What is the issue with directory corruption in Linux 2.6.17? ==&lt;br /&gt;
&lt;br /&gt;
In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some &amp;quot;sparse&amp;quot; endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: the fix is included in 2.6.17.7 and later kernels.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
To add insult to injury, &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039; is currently not correcting these directories on detection of this corrupt state either. This &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; issue is actively being worked on, and a fixed version will be available shortly.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfs_repair -n&#039;&#039;&#039; should be able to detect any directory corruption.&lt;br /&gt;
&lt;br /&gt;
Until a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; binary is available, one can make use of the &#039;&#039;&#039;xfs_db(8)&#039;&#039;&#039; command to mark the problem directory for removal (see the example below). A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; invocation will remove the directory and move all contents into &amp;quot;lost+found&amp;quot;, named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 core.mode = 040755&lt;br /&gt;
 core.version = 2&lt;br /&gt;
 core.format = 3 (btree)&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; write core.mode 0&lt;br /&gt;
 xfs_db&amp;amp;gt; quit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; will clear the directory, and add new entries (named by inode number) in lost+found.&lt;br /&gt;
&lt;br /&gt;
The easiest way to map inode numbers to full paths is via &#039;&#039;&#039;xfs_ncheck(8)&#039;&#039;&#039;&amp;lt;nowiki&amp;gt;: &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_ncheck -i 14101 -i 14102 /dev/sdXXX&lt;br /&gt;
       14101 full/path/mumble_fratz_foo_bar_1495&lt;br /&gt;
       14102 full/path/mumble_fratz_foo_bar_1494&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 ...&lt;br /&gt;
 next_unlinked = null&lt;br /&gt;
 u.bmbt.level = 1&lt;br /&gt;
 u.bmbt.numrecs = 1&lt;br /&gt;
 u.bmbt.keys[1] = [startoff] 1:[0]&lt;br /&gt;
 u.bmbt.ptrs[1] = 1:3628&lt;br /&gt;
 xfs_db&amp;amp;gt; fsblock 3628&lt;br /&gt;
 xfs_db&amp;amp;gt; type bmapbtd&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 magic = 0x424d4150&lt;br /&gt;
 level = 0&lt;br /&gt;
 numrecs = 19&lt;br /&gt;
 leftsib = null&lt;br /&gt;
 rightsib = null&lt;br /&gt;
 recs[1-19] = [startoff,startblock,blockcount,extentflag]&lt;br /&gt;
        1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]&lt;br /&gt;
        5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]&lt;br /&gt;
        9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]&lt;br /&gt;
        12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]&lt;br /&gt;
        15:[33554436,3488,8,0] 16:[33554444,3629,4,0]&lt;br /&gt;
        17:[33554448,3748,4,0] 18:[33554452,3900,4,0]&lt;br /&gt;
        19:[67108864,3364,4,0]&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the &#039;&#039;&#039;xfs_db&#039;&#039;&#039; dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; dblock 20&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 dhdr.magic = 0x58443244&lt;br /&gt;
 dhdr.bestfree[0].offset = 0&lt;br /&gt;
 dhdr.bestfree[0].length = 0&lt;br /&gt;
 dhdr.bestfree[1].offset = 0&lt;br /&gt;
 dhdr.bestfree[1].length = 0&lt;br /&gt;
 dhdr.bestfree[2].offset = 0&lt;br /&gt;
 dhdr.bestfree[2].length = 0&lt;br /&gt;
 du[0].inumber = 13937&lt;br /&gt;
 du[0].namelen = 25&lt;br /&gt;
 du[0].name = &amp;quot;mumble_fratz_foo_bar_1595&amp;quot;&lt;br /&gt;
 du[0].tag = 0x10&lt;br /&gt;
 du[1].inumber = 13938&lt;br /&gt;
 du[1].namelen = 25&lt;br /&gt;
 du[1].name = &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;&lt;br /&gt;
 du[1].tag = 0x38&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
So, here we can see that inode number 13938 matches up with name &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;. Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at &amp;quot;lost+found&amp;quot; (once &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; has removed the corrupt directory).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why does my &amp;gt; 2TB XFS partition disappear when I reboot ? ==&lt;br /&gt;
&lt;br /&gt;
Strictly speaking this is not an XFS problem.&lt;br /&gt;
&lt;br /&gt;
To support &amp;gt; 2TB partitions you need two things: a kernel that supports large block devices (&amp;lt;tt&amp;gt;CONFIG_LBD=y&amp;lt;/tt&amp;gt;) and a partition table format that can hold large partitions.  The default DOS partition tables don&#039;t.  The best partition format for&lt;br /&gt;
&amp;gt; 2TB partitions is the EFI GPT format (&amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Without CONFIG_LBD=y you can&#039;t even create the filesystem, but without &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt; it works fine until you reboot at which point the partition will disappear.  Note that you need to enable the &amp;lt;tt&amp;gt;CONFIG_PARTITION_ADVANCED&amp;lt;/tt&amp;gt; option before you can set &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I receive &amp;lt;tt&amp;gt;No space left on device&amp;lt;/tt&amp;gt; after &amp;lt;tt&amp;gt;xfs_growfs&amp;lt;/tt&amp;gt;? ==&lt;br /&gt;
&lt;br /&gt;
After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:&lt;br /&gt;
&lt;br /&gt;
  The only way to fix this is to move data around to free up space&lt;br /&gt;
  below 1TB. Find your oldest data (i.e. that was around before even&lt;br /&gt;
  the first grow) and move it off the filesystem (move, not copy).&lt;br /&gt;
  Then if you copy it back on, the data blocks will end up above 1TB&lt;br /&gt;
  and that should leave you with plenty of space for inodes below 1TB.&lt;br /&gt;
  &lt;br /&gt;
  A complete dump and restore will also fix the problem ;)&lt;br /&gt;
&lt;br /&gt;
Also, you can add &#039;inode64&#039; to your mount options to allow inodes to live above 1TB.&lt;br /&gt;
&lt;br /&gt;
example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&amp;amp;forum=38 | No space left on device on xfs filesystem with 7.7TB free]&lt;br /&gt;
&lt;br /&gt;
== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==&lt;br /&gt;
&lt;br /&gt;
The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons. &lt;br /&gt;
&lt;br /&gt;
Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.&lt;br /&gt;
&lt;br /&gt;
== Q: How to get around a bad inode repair is unable to clean up ==&lt;br /&gt;
&lt;br /&gt;
The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.&lt;br /&gt;
&lt;br /&gt;
  xfs_db -x -c &#039;inode XXX&#039; -c &#039;write core.nextents 0&#039; -c &#039;write core.size 0&#039; /dev/hdXX&lt;br /&gt;
&lt;br /&gt;
== Q: How to calculate the correct sunit,swidth values for optimal performance ==&lt;br /&gt;
&lt;br /&gt;
XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.&lt;br /&gt;
&lt;br /&gt;
These options can be sometimes autodetected (for example with md raid and recent enough kernel (&amp;gt;= 2.6.32) and xfsprogs (&amp;gt;= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.&lt;br /&gt;
&lt;br /&gt;
The calculation of these values is quite simple:&lt;br /&gt;
&lt;br /&gt;
  su = &amp;lt;RAID controllers stripe size in BYTES (or KiBytes when used with k)&amp;gt;&lt;br /&gt;
  sw = &amp;lt;# of data disks (don&#039;t count parity disks)&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use&lt;br /&gt;
&lt;br /&gt;
  su = 64k&lt;br /&gt;
  sw = 6 (RAID-6 of 8 disks has 6 data disks)&lt;br /&gt;
&lt;br /&gt;
A RAID stripe size of 256KB with a RAID-10 over 16 disks should use&lt;br /&gt;
&lt;br /&gt;
  su = 256k&lt;br /&gt;
  sw = 8 (RAID-10 of 16 disks has 8 data disks)&lt;br /&gt;
&lt;br /&gt;
Alternatively, you can use &amp;quot;sunit&amp;quot; instead of &amp;quot;su&amp;quot; and &amp;quot;swidth&amp;quot; instead of &amp;quot;sw&amp;quot; but then sunit/swidth values need to be specified in &amp;quot;number of 512B sectors&amp;quot;!&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; interpret sunit and swidth as being specified in units of 512B sectors; that&#039;s unfortunately not the unit they&#039;re reported in, however.&lt;br /&gt;
&amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; report them in multiples of your basic block size (bsize) and not in 512B sectors.&lt;br /&gt;
&lt;br /&gt;
Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.&lt;br /&gt;
&lt;br /&gt;
When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.&lt;br /&gt;
&lt;br /&gt;
== Q: Why doesn&#039;t NFS-exporting subdirectories of inode64-mounted filesystem work? ==&lt;br /&gt;
&lt;br /&gt;
The default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; type encodes only 32-bit of the inode number for subdirectory exports.  However, exporting the root of the filesystem works, or using one of the non-default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; types (&amp;lt;tt&amp;gt;fsid=uuid&amp;lt;/tt&amp;gt; in &amp;lt;tt&amp;gt;/etc/exports&amp;lt;/tt&amp;gt; with recent &amp;lt;tt&amp;gt;nfs-utils&amp;lt;/tt&amp;gt;) should work as well. (Thanks, Christoph!)&lt;br /&gt;
&lt;br /&gt;
== Q: What is the inode64 mount option for? ==&lt;br /&gt;
&lt;br /&gt;
By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like &amp;quot;disk full&amp;quot; when you still have plenty space free, but there&#039;s no more place in the first TB to create a new inode. Also, performance sucks.&lt;br /&gt;
&lt;br /&gt;
To come around this, use the inode64 mount options for filesystems &amp;gt;1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.&lt;br /&gt;
&lt;br /&gt;
Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.&lt;br /&gt;
&lt;br /&gt;
== Q: Can I just try the inode64 option to see if it helps me? ==&lt;br /&gt;
&lt;br /&gt;
Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can&#039;t access files &amp;amp; dirs that have been created with an inode &amp;gt;32bit anymore.&lt;br /&gt;
&lt;br /&gt;
== Q: Performance: mkfs.xfs -n size=64k option ==&lt;br /&gt;
&lt;br /&gt;
Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:&lt;br /&gt;
&lt;br /&gt;
Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a&lt;br /&gt;
directory entry is determined by the length of the name.&lt;br /&gt;
&lt;br /&gt;
There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there&#039;s the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.&lt;br /&gt;
&lt;br /&gt;
For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.&lt;br /&gt;
&lt;br /&gt;
In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don&#039;t have any numbers on what the difference might be - I&#039;m getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....&lt;br /&gt;
&lt;br /&gt;
== Q: I want to tune my XFS filesystems for &amp;lt;something&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Premature optimization is the root of all evil.&#039;&#039; - Donald Knuth&lt;br /&gt;
&lt;br /&gt;
The standard answer you will get to this question is this: use the defaults.&lt;br /&gt;
&lt;br /&gt;
There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to  configure the filesystem appropriately.&lt;br /&gt;
&lt;br /&gt;
There are a lot of &amp;quot;XFS tuning guides&amp;quot; that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don&#039;t expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.&lt;br /&gt;
&lt;br /&gt;
In most cases, the only thing you need to to consider for &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; mount options. Increasing &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; reduces the number of journal IOs for a given workload, and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; will reduce them even further. The trade off for this increase in metadata performance is that more operations may be &amp;quot;missing&amp;quot; after recovery if the system crashes while actively making modifications.&lt;br /&gt;
&lt;br /&gt;
As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.&lt;br /&gt;
&lt;br /&gt;
== Q: Which factors influence the memory usage of xfs_repair? ==&lt;br /&gt;
&lt;br /&gt;
This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -n -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2096.&lt;br /&gt;
  #&lt;br /&gt;
&lt;br /&gt;
xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,&lt;br /&gt;
of which 2,097,152KB is needed for tracking free space. &lt;br /&gt;
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)&lt;br /&gt;
&lt;br /&gt;
Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2289.&lt;br /&gt;
&lt;br /&gt;
That is now needs at least another 200MB of RAM to run.&lt;br /&gt;
&lt;br /&gt;
The numbers reported by xfs_repair are the absolute minimum required and approximate at that;&lt;br /&gt;
more RAM than this may be required to complete successfully.&lt;br /&gt;
Also, if you only give xfs_repair the minimum required RAM, it will be slow;&lt;br /&gt;
for best repair performance, the more RAM you can give it the better.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why some files of my filesystem shows as &amp;quot;?????????? ? ?      ?          ?                ? filename&amp;quot; ? ==&lt;br /&gt;
&lt;br /&gt;
If ls -l shows you a listing as&lt;br /&gt;
&lt;br /&gt;
  # ?????????? ? ?      ?          ?                ? file1&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file2&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file3&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file4&lt;br /&gt;
&lt;br /&gt;
and errors like:&lt;br /&gt;
  # ls /pathtodir/&lt;br /&gt;
    ls: cannot access /pathtodir/file1: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file2: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file3: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file4: Invalid argument&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
or even:&lt;br /&gt;
  # failed to stat /pathtodir/file1&lt;br /&gt;
&lt;br /&gt;
It is very probable your filesystem must be mounted with inode64&lt;br /&gt;
  # mount -oremount,inode64 /dev/diskpart /mnt/xfs&lt;br /&gt;
&lt;br /&gt;
should make it work ok again.&lt;br /&gt;
If it works, add the option to fstab.&lt;br /&gt;
&lt;br /&gt;
== Q: The xfs_db &amp;quot;frag&amp;quot; command says I&#039;m over 50%.  Is that bad? ==&lt;br /&gt;
&lt;br /&gt;
It depends.  It&#039;s important to know how the value is calculated.  xfs_db looks at the extents in all files, and returns:&lt;br /&gt;
&lt;br /&gt;
  (actual extents - ideal extents) / actual extents&lt;br /&gt;
&lt;br /&gt;
This means that if, for example, you have an average of 2 extents per file, you&#039;ll get an answer of 50%.  4 extents per file would give you 75%.  This may or may not be a problem, especially depending on the size of the files in question.  (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented).  The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.&lt;br /&gt;
&lt;br /&gt;
Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:&lt;br /&gt;
[[Image:Frag_factor.png|500px]]&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2828</id>
		<title>XFS FAQ</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2828"/>
		<updated>2012-11-02T17:35:39Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: /* Q: Does the filesystem have an undelete capability? */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about XFS? ==&lt;br /&gt;
&lt;br /&gt;
The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.&lt;br /&gt;
&lt;br /&gt;
You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the &#039;&#039;&#039;&amp;lt;nowiki&amp;gt;#xfs&amp;lt;/nowiki&amp;gt;&#039;&#039;&#039; IRC channel on &#039;&#039;irc.freenode.net&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about ACLs? ==&lt;br /&gt;
&lt;br /&gt;
Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;acl(5)&#039;&#039;&#039; manual page is also quite extensive.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find information about the internals of XFS? ==&lt;br /&gt;
&lt;br /&gt;
An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.&lt;br /&gt;
&lt;br /&gt;
Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.&lt;br /&gt;
&lt;br /&gt;
== Q: What partition type should I use for XFS on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Linux native filesystem (83).&lt;br /&gt;
&lt;br /&gt;
== Q: What mount options does XFS have? ==&lt;br /&gt;
&lt;br /&gt;
There are a number of mount options influencing XFS filesystems - refer to the &#039;&#039;&#039;mount(8)&#039;&#039;&#039; manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])&lt;br /&gt;
&lt;br /&gt;
== Q: Is there any relation between the XFS utilities and the kernel version? ==&lt;br /&gt;
&lt;br /&gt;
No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Does it run on platforms other than i386? ==&lt;br /&gt;
&lt;br /&gt;
XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Do quotas work on XFS? ==&lt;br /&gt;
&lt;br /&gt;
Yes.&lt;br /&gt;
&lt;br /&gt;
To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/  http://sourceforge.net/projects/linuxquota/] or use &#039;&#039;&#039;xfs_quota(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: What&#039;s project quota? ==&lt;br /&gt;
&lt;br /&gt;
The  project  quota  is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Can group quota and project quota be used at the same time? ==&lt;br /&gt;
&lt;br /&gt;
No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==&lt;br /&gt;
&lt;br /&gt;
To be answered.&lt;br /&gt;
&lt;br /&gt;
== Q: Are there any dump/restore tools for XFS? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and &#039;&#039;&#039;xfsrestore(8)&#039;&#039;&#039; are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.&lt;br /&gt;
&lt;br /&gt;
== Q: Does LILO work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
This depends on where you install LILO.&lt;br /&gt;
&lt;br /&gt;
Yes, for MBR (Master Boot Record) installations.&lt;br /&gt;
&lt;br /&gt;
No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.&lt;br /&gt;
&lt;br /&gt;
== Q: Does GRUB work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.&lt;br /&gt;
&lt;br /&gt;
== Q: Can XFS be used for a root filesystem? ==&lt;br /&gt;
&lt;br /&gt;
Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the &amp;quot;rootflags=&amp;quot; kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit &amp;quot;logdev=&amp;quot; specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]&lt;br /&gt;
&lt;br /&gt;
== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be &amp;quot;clean&amp;quot; when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.&lt;br /&gt;
&lt;br /&gt;
== Q: Is there a way to make a XFS filesystem larger or smaller? ==&lt;br /&gt;
&lt;br /&gt;
You can &#039;&#039;NOT&#039;&#039; make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.&lt;br /&gt;
&lt;br /&gt;
An XFS filesystem may be enlarged by using &#039;&#039;&#039;xfs_growfs(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the &#039;&#039;exact same&#039;&#039; starting point. Run &#039;&#039;&#039;xfs_growfs&#039;&#039;&#039; to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.&lt;br /&gt;
&lt;br /&gt;
Using XFS filesystems on top of a volume manager makes this a lot easier.&lt;br /&gt;
&lt;br /&gt;
== Q: What information should I include when reporting a problem? ==&lt;br /&gt;
&lt;br /&gt;
What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:&lt;br /&gt;
&lt;br /&gt;
* kernel version (uname -a)&lt;br /&gt;
* xfsprogs version (xfs_repair -V)&lt;br /&gt;
* number of CPUs&lt;br /&gt;
* contents of /proc/meminfo&lt;br /&gt;
* contents of /proc/mounts&lt;br /&gt;
* contents of /proc/partitions&lt;br /&gt;
* RAID layout (hardware and/or software)&lt;br /&gt;
* LVM configuration&lt;br /&gt;
* type of disks you are using&lt;br /&gt;
* write cache status of drives&lt;br /&gt;
* size of BBWC and mode it is running in&lt;br /&gt;
* xfs_info output on the filesystem in question&lt;br /&gt;
* dmesg output showing all error messages and stack traces&lt;br /&gt;
 &lt;br /&gt;
Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:&lt;br /&gt;
&lt;br /&gt;
# iostat -x -d -m 5&lt;br /&gt;
# vmstat 5&lt;br /&gt;
 &lt;br /&gt;
can give us insight into the IO and memory utilisation of your machine at the time of the problem.&lt;br /&gt;
&lt;br /&gt;
If the filesystem is hanging, then capture the output of the dmesg command after running:&lt;br /&gt;
&lt;br /&gt;
 # echo w &amp;gt; /proc/sysrq-trigger&lt;br /&gt;
 # dmesg&lt;br /&gt;
&lt;br /&gt;
will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.&lt;br /&gt;
&lt;br /&gt;
And for advanced users, capturing an event trace using &#039;&#039;&#039;trace-cmd&#039;&#039;&#039; (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it&#039;s a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd record -e xfs\*&lt;br /&gt;
&lt;br /&gt;
before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd report &amp;gt; trace_report.txt&lt;br /&gt;
&lt;br /&gt;
Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.&lt;br /&gt;
&lt;br /&gt;
If you have a problem with &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039;, make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using &#039;&#039;&#039;xfs_metadump(8)&#039;&#039;&#039; (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.&lt;br /&gt;
&lt;br /&gt;
== Q: Mounting an XFS filesystem does not work - what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
If mount prints an error message something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     mount: /dev/hda5 has wrong major or minor number&lt;br /&gt;
&lt;br /&gt;
you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the &amp;quot;-t xfs&amp;quot; option on mount or the &amp;quot;xfs&amp;quot; option in &amp;lt;tt&amp;gt;/etc/fstab&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
If you get something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 mount: wrong fs type, bad option, bad superblock on /dev/sda1,&lt;br /&gt;
        or too many mounted file systems&lt;br /&gt;
&lt;br /&gt;
Refer to your system log file (&amp;lt;tt&amp;gt;/var/log/messages&amp;lt;/tt&amp;gt;) for a detailed diagnostic message from the kernel.&lt;br /&gt;
&lt;br /&gt;
== Q: Does the filesystem have an undelete capability? ==&lt;br /&gt;
&lt;br /&gt;
There is no undelete in XFS.&lt;br /&gt;
&lt;br /&gt;
However, if an inode is unlinked but neither it nor its associated data blocks get immediately re-used and overwritten, there is some small chance to recover the file from the disk.  &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;photorec&#039;&#039;, &#039;&#039;xfs_irecover&#039;&#039; or &#039;&#039;xfsr&#039;&#039; are some tools which attempt to do this, with varying success.&lt;br /&gt;
&lt;br /&gt;
There are also commercial data recovery services and closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS] which claims to recover data, although this has not been tested by the XFS developers.&lt;br /&gt;
&lt;br /&gt;
As always, the best advice is to always keep backups.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I backup a XFS filesystem and ACLs? ==&lt;br /&gt;
&lt;br /&gt;
You can backup a XFS filesystem with utilities like &#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and standard &#039;&#039;&#039;tar(1)&#039;&#039;&#039; for standard files. If you want to backup ACLs you will need to use &#039;&#039;&#039;xfsdump&#039;&#039;&#039; or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (&amp;gt; version 3.1.4) or [http://rsync.samba.org/ rsync] (&amp;gt;= version 3.0.0) to backup ACLs and EAs. &#039;&#039;&#039;xfsdump&#039;&#039;&#039; can also be integrated with [http://www.amanda.org/ amanda(8)].&lt;br /&gt;
&lt;br /&gt;
== Q: I see applications returning error 990 or &amp;quot;Structure needs cleaning&amp;quot;, what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], &amp;quot;Structure needs cleaning.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.&lt;br /&gt;
&lt;br /&gt;
There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.&lt;br /&gt;
&lt;br /&gt;
You can use xfs_repair to remedy the problem (with the file system unmounted).&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==&lt;br /&gt;
&lt;br /&gt;
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.&lt;br /&gt;
&lt;br /&gt;
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.&lt;br /&gt;
&lt;br /&gt;
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you&#039;ll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the &#039;&#039;&#039;xfs_bmap(8)&#039;&#039;&#039; command).&lt;br /&gt;
&lt;br /&gt;
== Q: What is the problem with the write cache on journaled filesystems? ==&lt;br /&gt;
&lt;br /&gt;
Many drives use a write back cache in order to speed up the performance of writes.  However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk.  Further, the drive can de-stage data from the write cache to the platters in any order that it chooses.  This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk.  When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.&lt;br /&gt;
&lt;br /&gt;
With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information.  In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.&lt;br /&gt;
&lt;br /&gt;
With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued.  A powerfail &amp;quot;only&amp;quot; loses data in the cache but no essential ordering is violated, and corruption will not occur.&lt;br /&gt;
&lt;br /&gt;
With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance.  But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I tell if I have the disk write cache enabled? ==&lt;br /&gt;
&lt;br /&gt;
For SCSI/SATA:&lt;br /&gt;
&lt;br /&gt;
* Look in dmesg(8) output for a driver line, such as:&amp;lt;br /&amp;gt; &amp;quot;SCSI device sda: drive cache: write back&amp;quot;&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# sginfo -c /dev/sda | grep -i &#039;write cache&#039; &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -I /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; and look under &amp;quot;Enabled Supported&amp;quot; for &amp;quot;Write cache&amp;quot;&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
== Q: How can I address the problem with the disk write cache? ==&lt;br /&gt;
&lt;br /&gt;
=== Disabling the disk write back cache. ===&lt;br /&gt;
&lt;br /&gt;
For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -W0 /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # hdparm -W0 /dev/hda&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# blktool /dev/sda wcache off&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # blktool /dev/hda wcache off&lt;br /&gt;
&lt;br /&gt;
For SCSI:&lt;br /&gt;
&lt;br /&gt;
* Using sginfo(8) which is a little tedious&amp;lt;br /&amp;gt; It takes 3 steps. For example:&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -c /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives a list of attribute names and values&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cX /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives an array of cache values which you must match up with from step 1, e.g.&amp;lt;br /&amp;gt; 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; allows you to reset the value of the cache attributes.&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Using an external log. ===&lt;br /&gt;
&lt;br /&gt;
Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will &#039;&#039;&#039;not&#039;&#039;&#039; solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won&#039;t be able to guarantee that if the metadata is on a drive with the write cache enabled.&lt;br /&gt;
&lt;br /&gt;
In fact using an external log will disable XFS&#039; write barrier support.&lt;br /&gt;
&lt;br /&gt;
=== Write barrier support. ===&lt;br /&gt;
&lt;br /&gt;
Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with &amp;quot;nobarrier&amp;quot;. Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported with external log device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported by the underlying device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, trial barrier write failed&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If the filesystem is mounted with an external log device then we currently don&#039;t support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn&#039;t support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.&lt;br /&gt;
&lt;br /&gt;
== Q. Should barriers be enabled with storage which has a persistent write cache? ==&lt;br /&gt;
&lt;br /&gt;
Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with &amp;quot;nobarrier&amp;quot;, assuming your RAID controller is infallible and not resetting randomly like some common ones do.  But take care about the hard disk write cache, which should be off.&lt;br /&gt;
&lt;br /&gt;
== Q. Which settings does my RAID controller need ? ==&lt;br /&gt;
&lt;br /&gt;
It&#039;s hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:&lt;br /&gt;
&lt;br /&gt;
Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory &amp;quot;[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]&amp;quot;) which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.&lt;br /&gt;
&lt;br /&gt;
If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.&lt;br /&gt;
&lt;br /&gt;
* onboard RAID controllers: there are so many different types it&#039;s hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn&#039;t even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.&lt;br /&gt;
&lt;br /&gt;
* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86); &lt;br /&gt;
&lt;br /&gt;
* Adaptec: allows setting individual drives cache&lt;br /&gt;
arcconf setcache &amp;lt;disk&amp;gt; wb|wt&lt;br /&gt;
wb=write back, which means write cache on, wt=write through, which means write cache off. So &amp;quot;wt&amp;quot; should be chosen.&lt;br /&gt;
&lt;br /&gt;
* Areca: In archttp under &amp;quot;System Controls&amp;quot; -&amp;gt; &amp;quot;System Configuration&amp;quot; there&#039;s the option &amp;quot;Disk Write Cache Mode&amp;quot; (defaults &amp;quot;Auto&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Off&amp;quot;: disk write cache is turned off&lt;br /&gt;
&lt;br /&gt;
&amp;quot;On&amp;quot;: disk write cache is enabled, this is not safe for your data but fast&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Auto&amp;quot;: If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to &amp;quot;On&amp;quot;, because neither controller cache nor disk cache is safe so you don&#039;t seem to care about your data and just want high speed (which you get then).&lt;br /&gt;
&lt;br /&gt;
That&#039;s a very sensible default so you can let it &amp;quot;Auto&amp;quot; or enforce &amp;quot;Off&amp;quot; to be sure.&lt;br /&gt;
&lt;br /&gt;
* LSI MegaRAID: allows setting individual disks cache:&lt;br /&gt;
 MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL                          # flushes the controller cache&lt;br /&gt;
 MegaCli -LDGetProp -Cache    -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the controller cache settings&lt;br /&gt;
 MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the disk cache settings (for all phys. disks in logical disk)&lt;br /&gt;
 MegaCli -LDSetProp -EnDskCache|DisDskCache  -LN|-L0,1,2|-LAll  -aN|-a0,1,2|-aALL # set disk cache setting&lt;br /&gt;
&lt;br /&gt;
* Xyratex: from the docs: &amp;quot;Write cache includes the disk drive cache and controller cache.&amp;quot;. So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.&lt;br /&gt;
&lt;br /&gt;
== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==&lt;br /&gt;
&lt;br /&gt;
The biggest problem is that those products seem to also virtualize disk &lt;br /&gt;
writes in a way that even barriers don&#039;t work any more, which means even &lt;br /&gt;
a fsync is not reliable. Tests confirm that unplugging the power from &lt;br /&gt;
such a system even with RAID controller with battery backed cache and &lt;br /&gt;
hard disk cache turned off (which is safe on a normal host) you can &lt;br /&gt;
destroy a database within the virtual machine (client, domU whatever you &lt;br /&gt;
call it).&lt;br /&gt;
&lt;br /&gt;
In qemu you can specify cache=off on the line specifying the virtual &lt;br /&gt;
disk. For others information is missing.&lt;br /&gt;
&lt;br /&gt;
== Q: What is the issue with directory corruption in Linux 2.6.17? ==&lt;br /&gt;
&lt;br /&gt;
In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some &amp;quot;sparse&amp;quot; endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: the fix is included in 2.6.17.7 and later kernels.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
To add insult to injury, &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039; is currently not correcting these directories on detection of this corrupt state either. This &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; issue is actively being worked on, and a fixed version will be available shortly.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfs_repair -n&#039;&#039;&#039; should be able to detect any directory corruption.&lt;br /&gt;
&lt;br /&gt;
Until a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; binary is available, one can make use of the &#039;&#039;&#039;xfs_db(8)&#039;&#039;&#039; command to mark the problem directory for removal (see the example below). A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; invocation will remove the directory and move all contents into &amp;quot;lost+found&amp;quot;, named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 core.mode = 040755&lt;br /&gt;
 core.version = 2&lt;br /&gt;
 core.format = 3 (btree)&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; write core.mode 0&lt;br /&gt;
 xfs_db&amp;amp;gt; quit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; will clear the directory, and add new entries (named by inode number) in lost+found.&lt;br /&gt;
&lt;br /&gt;
The easiest way to map inode numbers to full paths is via &#039;&#039;&#039;xfs_ncheck(8)&#039;&#039;&#039;&amp;lt;nowiki&amp;gt;: &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_ncheck -i 14101 -i 14102 /dev/sdXXX&lt;br /&gt;
       14101 full/path/mumble_fratz_foo_bar_1495&lt;br /&gt;
       14102 full/path/mumble_fratz_foo_bar_1494&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 ...&lt;br /&gt;
 next_unlinked = null&lt;br /&gt;
 u.bmbt.level = 1&lt;br /&gt;
 u.bmbt.numrecs = 1&lt;br /&gt;
 u.bmbt.keys[1] = [startoff] 1:[0]&lt;br /&gt;
 u.bmbt.ptrs[1] = 1:3628&lt;br /&gt;
 xfs_db&amp;amp;gt; fsblock 3628&lt;br /&gt;
 xfs_db&amp;amp;gt; type bmapbtd&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 magic = 0x424d4150&lt;br /&gt;
 level = 0&lt;br /&gt;
 numrecs = 19&lt;br /&gt;
 leftsib = null&lt;br /&gt;
 rightsib = null&lt;br /&gt;
 recs[1-19] = [startoff,startblock,blockcount,extentflag]&lt;br /&gt;
        1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]&lt;br /&gt;
        5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]&lt;br /&gt;
        9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]&lt;br /&gt;
        12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]&lt;br /&gt;
        15:[33554436,3488,8,0] 16:[33554444,3629,4,0]&lt;br /&gt;
        17:[33554448,3748,4,0] 18:[33554452,3900,4,0]&lt;br /&gt;
        19:[67108864,3364,4,0]&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the &#039;&#039;&#039;xfs_db&#039;&#039;&#039; dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; dblock 20&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 dhdr.magic = 0x58443244&lt;br /&gt;
 dhdr.bestfree[0].offset = 0&lt;br /&gt;
 dhdr.bestfree[0].length = 0&lt;br /&gt;
 dhdr.bestfree[1].offset = 0&lt;br /&gt;
 dhdr.bestfree[1].length = 0&lt;br /&gt;
 dhdr.bestfree[2].offset = 0&lt;br /&gt;
 dhdr.bestfree[2].length = 0&lt;br /&gt;
 du[0].inumber = 13937&lt;br /&gt;
 du[0].namelen = 25&lt;br /&gt;
 du[0].name = &amp;quot;mumble_fratz_foo_bar_1595&amp;quot;&lt;br /&gt;
 du[0].tag = 0x10&lt;br /&gt;
 du[1].inumber = 13938&lt;br /&gt;
 du[1].namelen = 25&lt;br /&gt;
 du[1].name = &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;&lt;br /&gt;
 du[1].tag = 0x38&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
So, here we can see that inode number 13938 matches up with name &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;. Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at &amp;quot;lost+found&amp;quot; (once &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; has removed the corrupt directory).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why does my &amp;gt; 2TB XFS partition disappear when I reboot ? ==&lt;br /&gt;
&lt;br /&gt;
Strictly speaking this is not an XFS problem.&lt;br /&gt;
&lt;br /&gt;
To support &amp;gt; 2TB partitions you need two things: a kernel that supports large block devices (&amp;lt;tt&amp;gt;CONFIG_LBD=y&amp;lt;/tt&amp;gt;) and a partition table format that can hold large partitions.  The default DOS partition tables don&#039;t.  The best partition format for&lt;br /&gt;
&amp;gt; 2TB partitions is the EFI GPT format (&amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Without CONFIG_LBD=y you can&#039;t even create the filesystem, but without &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt; it works fine until you reboot at which point the partition will disappear.  Note that you need to enable the &amp;lt;tt&amp;gt;CONFIG_PARTITION_ADVANCED&amp;lt;/tt&amp;gt; option before you can set &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I receive &amp;lt;tt&amp;gt;No space left on device&amp;lt;/tt&amp;gt; after &amp;lt;tt&amp;gt;xfs_growfs&amp;lt;/tt&amp;gt;? ==&lt;br /&gt;
&lt;br /&gt;
After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:&lt;br /&gt;
&lt;br /&gt;
  The only way to fix this is to move data around to free up space&lt;br /&gt;
  below 1TB. Find your oldest data (i.e. that was around before even&lt;br /&gt;
  the first grow) and move it off the filesystem (move, not copy).&lt;br /&gt;
  Then if you copy it back on, the data blocks will end up above 1TB&lt;br /&gt;
  and that should leave you with plenty of space for inodes below 1TB.&lt;br /&gt;
  &lt;br /&gt;
  A complete dump and restore will also fix the problem ;)&lt;br /&gt;
&lt;br /&gt;
Also, you can add &#039;inode64&#039; to your mount options to allow inodes to live above 1TB.&lt;br /&gt;
&lt;br /&gt;
example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&amp;amp;forum=38 | No space left on device on xfs filesystem with 7.7TB free]&lt;br /&gt;
&lt;br /&gt;
== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==&lt;br /&gt;
&lt;br /&gt;
The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons. &lt;br /&gt;
&lt;br /&gt;
Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.&lt;br /&gt;
&lt;br /&gt;
== Q: How to get around a bad inode repair is unable to clean up ==&lt;br /&gt;
&lt;br /&gt;
The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.&lt;br /&gt;
&lt;br /&gt;
  xfs_db -x -c &#039;inode XXX&#039; -c &#039;write core.nextents 0&#039; -c &#039;write core.size 0&#039; /dev/hdXX&lt;br /&gt;
&lt;br /&gt;
== Q: How to calculate the correct sunit,swidth values for optimal performance ==&lt;br /&gt;
&lt;br /&gt;
XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.&lt;br /&gt;
&lt;br /&gt;
These options can be sometimes autodetected (for example with md raid and recent enough kernel (&amp;gt;= 2.6.32) and xfsprogs (&amp;gt;= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.&lt;br /&gt;
&lt;br /&gt;
The calculation of these values is quite simple:&lt;br /&gt;
&lt;br /&gt;
  su = &amp;lt;RAID controllers stripe size in BYTES (or KiBytes when used with k)&amp;gt;&lt;br /&gt;
  sw = &amp;lt;# of data disks (don&#039;t count parity disks)&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use&lt;br /&gt;
&lt;br /&gt;
  su = 64k&lt;br /&gt;
  sw = 6 (RAID-6 of 8 disks has 6 data disks)&lt;br /&gt;
&lt;br /&gt;
A RAID stripe size of 256KB with a RAID-10 over 16 disks should use&lt;br /&gt;
&lt;br /&gt;
  su = 256k&lt;br /&gt;
  sw = 8 (RAID-10 of 16 disks has 8 data disks)&lt;br /&gt;
&lt;br /&gt;
Alternatively, you can use &amp;quot;sunit&amp;quot; instead of &amp;quot;su&amp;quot; and &amp;quot;swidth&amp;quot; instead of &amp;quot;sw&amp;quot; but then sunit/swidth values need to be specified in &amp;quot;number of 512B sectors&amp;quot;!&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; interpret sunit and swidth as being specified in units of 512B sectors; that&#039;s unfortunately not the unit they&#039;re reported in, however.&lt;br /&gt;
&amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; report them in multiples of your basic block size (bsize) and not in 512B sectors.&lt;br /&gt;
&lt;br /&gt;
Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.&lt;br /&gt;
&lt;br /&gt;
When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.&lt;br /&gt;
&lt;br /&gt;
== Q: Why doesn&#039;t NFS-exporting subdirectories of inode64-mounted filesystem work? ==&lt;br /&gt;
&lt;br /&gt;
The default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; type encodes only 32-bit of the inode number for subdirectory exports.  However, exporting the root of the filesystem works, or using one of the non-default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; types (&amp;lt;tt&amp;gt;fsid=uuid&amp;lt;/tt&amp;gt; in &amp;lt;tt&amp;gt;/etc/exports&amp;lt;/tt&amp;gt; with recent &amp;lt;tt&amp;gt;nfs-utils&amp;lt;/tt&amp;gt;) should work as well. (Thanks, Christoph!)&lt;br /&gt;
&lt;br /&gt;
== Q: What is the inode64 mount option for? ==&lt;br /&gt;
&lt;br /&gt;
By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like &amp;quot;disk full&amp;quot; when you still have plenty space free, but there&#039;s no more place in the first TB to create a new inode. Also, performance sucks.&lt;br /&gt;
&lt;br /&gt;
To come around this, use the inode64 mount options for filesystems &amp;gt;1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.&lt;br /&gt;
&lt;br /&gt;
Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.&lt;br /&gt;
&lt;br /&gt;
== Q: Can I just try the inode64 option to see if it helps me? ==&lt;br /&gt;
&lt;br /&gt;
Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can&#039;t access files &amp;amp; dirs that have been created with an inode &amp;gt;32bit anymore.&lt;br /&gt;
&lt;br /&gt;
== Q: Performance: mkfs.xfs -n size=64k option ==&lt;br /&gt;
&lt;br /&gt;
Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:&lt;br /&gt;
&lt;br /&gt;
Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a&lt;br /&gt;
directory entry is determined by the length of the name.&lt;br /&gt;
&lt;br /&gt;
There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there&#039;s the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.&lt;br /&gt;
&lt;br /&gt;
For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.&lt;br /&gt;
&lt;br /&gt;
In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don&#039;t have any numbers on what the difference might be - I&#039;m getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....&lt;br /&gt;
&lt;br /&gt;
== Q: I want to tune my XFS filesystems for &amp;lt;something&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Premature optimization is the root of all evil.&#039;&#039; - Donald Knuth&lt;br /&gt;
&lt;br /&gt;
The standard answer you will get to this question is this: use the defaults.&lt;br /&gt;
&lt;br /&gt;
There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to  configure the filesystem appropriately.&lt;br /&gt;
&lt;br /&gt;
There are a lot of &amp;quot;XFS tuning guides&amp;quot; that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don&#039;t expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.&lt;br /&gt;
&lt;br /&gt;
In most cases, the only thing you need to to consider for &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; mount options. Increasing &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; reduces the number of journal IOs for a given workload, and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; will reduce them even further. The trade off for this increase in metadata performance is that more operations may be &amp;quot;missing&amp;quot; after recovery if the system crashes while actively making modifications.&lt;br /&gt;
&lt;br /&gt;
As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.&lt;br /&gt;
&lt;br /&gt;
== Q: Which factors influence the memory usage of xfs_repair? ==&lt;br /&gt;
&lt;br /&gt;
This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -n -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2096.&lt;br /&gt;
  #&lt;br /&gt;
&lt;br /&gt;
xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,&lt;br /&gt;
of which 2,097,152KB is needed for tracking free space. &lt;br /&gt;
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)&lt;br /&gt;
&lt;br /&gt;
Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2289.&lt;br /&gt;
&lt;br /&gt;
That is now needs at least another 200MB of RAM to run.&lt;br /&gt;
&lt;br /&gt;
The numbers reported by xfs_repair are the absolute minimum required and approximate at that;&lt;br /&gt;
more RAM than this may be required to complete successfully.&lt;br /&gt;
Also, if you only give xfs_repair the minimum required RAM, it will be slow;&lt;br /&gt;
for best repair performance, the more RAM you can give it the better.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why some files of my filesystem shows as &amp;quot;?????????? ? ?      ?          ?                ? filename&amp;quot; ? ==&lt;br /&gt;
&lt;br /&gt;
If ls -l shows you a listing as&lt;br /&gt;
&lt;br /&gt;
  # ?????????? ? ?      ?          ?                ? file1&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file2&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file3&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file4&lt;br /&gt;
&lt;br /&gt;
and errors like:&lt;br /&gt;
  # ls /pathtodir/&lt;br /&gt;
    ls: cannot access /pathtodir/file1: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file2: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file3: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file4: Invalid argument&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
or even:&lt;br /&gt;
  # failed to stat /pathtodir/file1&lt;br /&gt;
&lt;br /&gt;
It is very probable your filesystem must be mounted with inode64&lt;br /&gt;
  # mount -oremount,inode64 /dev/diskpart /mnt/xfs&lt;br /&gt;
&lt;br /&gt;
should make it work ok again.&lt;br /&gt;
If it works, add the option to fstab.&lt;br /&gt;
&lt;br /&gt;
== Q: The xfs_db &amp;quot;frag&amp;quot; command says I&#039;m over 50%.  Is that bad? ==&lt;br /&gt;
&lt;br /&gt;
It depends.  It&#039;s important to know how the value is calculated.  xfs_db looks at the extents in all files, and returns:&lt;br /&gt;
&lt;br /&gt;
  (actual extents - ideal extents) / actual extents&lt;br /&gt;
&lt;br /&gt;
This means that if, for example, you have an average of 2 extents per file, you&#039;ll get an answer of 50%.  4 extents per file would give you 75%.  This may or may not be a problem, especially depending on the size of the files in question.  (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented).  The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.&lt;br /&gt;
&lt;br /&gt;
Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:&lt;br /&gt;
[[Image:Frag_factor.png|500px]]&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2754</id>
		<title>XFS FAQ</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2754"/>
		<updated>2012-07-19T15:51:03Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: /* Q: I want to tune my XFS filesystems for  */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about XFS? ==&lt;br /&gt;
&lt;br /&gt;
The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.&lt;br /&gt;
&lt;br /&gt;
You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the &#039;&#039;&#039;&amp;lt;nowiki&amp;gt;#xfs&amp;lt;/nowiki&amp;gt;&#039;&#039;&#039; IRC channel on &#039;&#039;irc.freenode.net&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about ACLs? ==&lt;br /&gt;
&lt;br /&gt;
Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;acl(5)&#039;&#039;&#039; manual page is also quite extensive.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find information about the internals of XFS? ==&lt;br /&gt;
&lt;br /&gt;
An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.&lt;br /&gt;
&lt;br /&gt;
Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.&lt;br /&gt;
&lt;br /&gt;
== Q: What partition type should I use for XFS on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Linux native filesystem (83).&lt;br /&gt;
&lt;br /&gt;
== Q: What mount options does XFS have? ==&lt;br /&gt;
&lt;br /&gt;
There are a number of mount options influencing XFS filesystems - refer to the &#039;&#039;&#039;mount(8)&#039;&#039;&#039; manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])&lt;br /&gt;
&lt;br /&gt;
== Q: Is there any relation between the XFS utilities and the kernel version? ==&lt;br /&gt;
&lt;br /&gt;
No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Does it run on platforms other than i386? ==&lt;br /&gt;
&lt;br /&gt;
XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Do quotas work on XFS? ==&lt;br /&gt;
&lt;br /&gt;
Yes.&lt;br /&gt;
&lt;br /&gt;
To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/  http://sourceforge.net/projects/linuxquota/] or use &#039;&#039;&#039;xfs_quota(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: What&#039;s project quota? ==&lt;br /&gt;
&lt;br /&gt;
The  project  quota  is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Can group quota and project quota be used at the same time? ==&lt;br /&gt;
&lt;br /&gt;
No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==&lt;br /&gt;
&lt;br /&gt;
To be answered.&lt;br /&gt;
&lt;br /&gt;
== Q: Are there any dump/restore tools for XFS? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and &#039;&#039;&#039;xfsrestore(8)&#039;&#039;&#039; are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.&lt;br /&gt;
&lt;br /&gt;
== Q: Does LILO work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
This depends on where you install LILO.&lt;br /&gt;
&lt;br /&gt;
Yes, for MBR (Master Boot Record) installations.&lt;br /&gt;
&lt;br /&gt;
No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.&lt;br /&gt;
&lt;br /&gt;
== Q: Does GRUB work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.&lt;br /&gt;
&lt;br /&gt;
== Q: Can XFS be used for a root filesystem? ==&lt;br /&gt;
&lt;br /&gt;
Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the &amp;quot;rootflags=&amp;quot; kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit &amp;quot;logdev=&amp;quot; specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]&lt;br /&gt;
&lt;br /&gt;
== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be &amp;quot;clean&amp;quot; when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.&lt;br /&gt;
&lt;br /&gt;
== Q: Is there a way to make a XFS filesystem larger or smaller? ==&lt;br /&gt;
&lt;br /&gt;
You can &#039;&#039;NOT&#039;&#039; make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.&lt;br /&gt;
&lt;br /&gt;
An XFS filesystem may be enlarged by using &#039;&#039;&#039;xfs_growfs(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the &#039;&#039;exact same&#039;&#039; starting point. Run &#039;&#039;&#039;xfs_growfs&#039;&#039;&#039; to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.&lt;br /&gt;
&lt;br /&gt;
Using XFS filesystems on top of a volume manager makes this a lot easier.&lt;br /&gt;
&lt;br /&gt;
== Q: What information should I include when reporting a problem? ==&lt;br /&gt;
&lt;br /&gt;
What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:&lt;br /&gt;
&lt;br /&gt;
* kernel version (uname -a)&lt;br /&gt;
* xfsprogs version (xfs_repair -V)&lt;br /&gt;
* number of CPUs&lt;br /&gt;
* contents of /proc/meminfo&lt;br /&gt;
* contents of /proc/mounts&lt;br /&gt;
* contents of /proc/partitions&lt;br /&gt;
* RAID layout (hardware and/or software)&lt;br /&gt;
* LVM configuration&lt;br /&gt;
* type of disks you are using&lt;br /&gt;
* write cache status of drives&lt;br /&gt;
* size of BBWC and mode it is running in&lt;br /&gt;
* xfs_info output on the filesystem in question&lt;br /&gt;
* dmesg output showing all error messages and stack traces&lt;br /&gt;
 &lt;br /&gt;
Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:&lt;br /&gt;
&lt;br /&gt;
# iostat -x -d -m 5&lt;br /&gt;
# vmstat 5&lt;br /&gt;
 &lt;br /&gt;
can give us insight into the IO and memory utilisation of your machine at the time of the problem.&lt;br /&gt;
&lt;br /&gt;
If the filesystem is hanging, then capture the output of the dmesg command after running:&lt;br /&gt;
&lt;br /&gt;
 # echo w &amp;gt; /proc/sysrq-trigger&lt;br /&gt;
 # dmesg&lt;br /&gt;
&lt;br /&gt;
will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.&lt;br /&gt;
&lt;br /&gt;
And for advanced users, capturing an event trace using &#039;&#039;&#039;trace-cmd&#039;&#039;&#039; (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it&#039;s a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd record -e xfs\*&lt;br /&gt;
&lt;br /&gt;
before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd report &amp;gt; trace_report.txt&lt;br /&gt;
&lt;br /&gt;
Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.&lt;br /&gt;
&lt;br /&gt;
If you have a problem with &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039;, make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using &#039;&#039;&#039;xfs_metadump(8)&#039;&#039;&#039; (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.&lt;br /&gt;
&lt;br /&gt;
== Q: Mounting an XFS filesystem does not work - what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
If mount prints an error message something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     mount: /dev/hda5 has wrong major or minor number&lt;br /&gt;
&lt;br /&gt;
you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the &amp;quot;-t xfs&amp;quot; option on mount or the &amp;quot;xfs&amp;quot; option in &amp;lt;tt&amp;gt;/etc/fstab&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
If you get something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 mount: wrong fs type, bad option, bad superblock on /dev/sda1,&lt;br /&gt;
        or too many mounted file systems&lt;br /&gt;
&lt;br /&gt;
Refer to your system log file (&amp;lt;tt&amp;gt;/var/log/messages&amp;lt;/tt&amp;gt;) for a detailed diagnostic message from the kernel.&lt;br /&gt;
&lt;br /&gt;
== Q: Does the filesystem have an undelete capability? ==&lt;br /&gt;
&lt;br /&gt;
There is no undelete in XFS (so far).&lt;br /&gt;
&lt;br /&gt;
However at least some XFS driver implementations do not wipe file information nodes completely so there are chance to recover files with specialized commercial closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS].&lt;br /&gt;
&lt;br /&gt;
In this kind of XFS driver implementation it does not re-use directory entries immediately so there are chance to get back recently deleted files even with their real names.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;xfs_irecover&#039;&#039; or &#039;&#039;xfsr&#039;&#039; may help too, [http://www.who.is.free.fr/wiki/doku.php?id=recover this site] has a few links.&lt;br /&gt;
&lt;br /&gt;
This applies to most recent Linux distributions (versions?), as well as to most popular NAS boxes that use embedded linux and XFS file system.&lt;br /&gt;
&lt;br /&gt;
Anyway, the best is to always keep backups.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I backup a XFS filesystem and ACLs? ==&lt;br /&gt;
&lt;br /&gt;
You can backup a XFS filesystem with utilities like &#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and standard &#039;&#039;&#039;tar(1)&#039;&#039;&#039; for standard files. If you want to backup ACLs you will need to use &#039;&#039;&#039;xfsdump&#039;&#039;&#039; or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (&amp;gt; version 3.1.4) or [http://rsync.samba.org/ rsync] (&amp;gt;= version 3.0.0) to backup ACLs and EAs. &#039;&#039;&#039;xfsdump&#039;&#039;&#039; can also be integrated with [http://www.amanda.org/ amanda(8)].&lt;br /&gt;
&lt;br /&gt;
== Q: I see applications returning error 990 or &amp;quot;Structure needs cleaning&amp;quot;, what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], &amp;quot;Structure needs cleaning.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.&lt;br /&gt;
&lt;br /&gt;
There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.&lt;br /&gt;
&lt;br /&gt;
You can use xfs_repair to remedy the problem (with the file system unmounted).&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==&lt;br /&gt;
&lt;br /&gt;
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.&lt;br /&gt;
&lt;br /&gt;
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.&lt;br /&gt;
&lt;br /&gt;
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you&#039;ll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the &#039;&#039;&#039;xfs_bmap(8)&#039;&#039;&#039; command).&lt;br /&gt;
&lt;br /&gt;
== Q: What is the problem with the write cache on journaled filesystems? ==&lt;br /&gt;
&lt;br /&gt;
Many drives use a write back cache in order to speed up the performance of writes.  However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk.  Further, the drive can de-stage data from the write cache to the platters in any order that it chooses.  This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk.  When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.&lt;br /&gt;
&lt;br /&gt;
With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information.  In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.&lt;br /&gt;
&lt;br /&gt;
With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued.  A powerfail &amp;quot;only&amp;quot; loses data in the cache but no essential ordering is violated, and corruption will not occur.&lt;br /&gt;
&lt;br /&gt;
With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance.  But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I tell if I have the disk write cache enabled? ==&lt;br /&gt;
&lt;br /&gt;
For SCSI/SATA:&lt;br /&gt;
&lt;br /&gt;
* Look in dmesg(8) output for a driver line, such as:&amp;lt;br /&amp;gt; &amp;quot;SCSI device sda: drive cache: write back&amp;quot;&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# sginfo -c /dev/sda | grep -i &#039;write cache&#039; &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -I /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; and look under &amp;quot;Enabled Supported&amp;quot; for &amp;quot;Write cache&amp;quot;&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
== Q: How can I address the problem with the disk write cache? ==&lt;br /&gt;
&lt;br /&gt;
=== Disabling the disk write back cache. ===&lt;br /&gt;
&lt;br /&gt;
For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -W0 /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # hdparm -W0 /dev/hda&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# blktool /dev/sda wcache off&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # blktool /dev/hda wcache off&lt;br /&gt;
&lt;br /&gt;
For SCSI:&lt;br /&gt;
&lt;br /&gt;
* Using sginfo(8) which is a little tedious&amp;lt;br /&amp;gt; It takes 3 steps. For example:&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -c /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives a list of attribute names and values&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cX /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives an array of cache values which you must match up with from step 1, e.g.&amp;lt;br /&amp;gt; 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; allows you to reset the value of the cache attributes.&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Using an external log. ===&lt;br /&gt;
&lt;br /&gt;
Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will &#039;&#039;&#039;not&#039;&#039;&#039; solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won&#039;t be able to guarantee that if the metadata is on a drive with the write cache enabled.&lt;br /&gt;
&lt;br /&gt;
In fact using an external log will disable XFS&#039; write barrier support.&lt;br /&gt;
&lt;br /&gt;
=== Write barrier support. ===&lt;br /&gt;
&lt;br /&gt;
Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with &amp;quot;nobarrier&amp;quot;. Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported with external log device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported by the underlying device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, trial barrier write failed&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If the filesystem is mounted with an external log device then we currently don&#039;t support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn&#039;t support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.&lt;br /&gt;
&lt;br /&gt;
== Q. Should barriers be enabled with storage which has a persistent write cache? ==&lt;br /&gt;
&lt;br /&gt;
Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with &amp;quot;nobarrier&amp;quot;, assuming your RAID controller is infallible and not resetting randomly like some common ones do.  But take care about the hard disk write cache, which should be off.&lt;br /&gt;
&lt;br /&gt;
== Q. Which settings does my RAID controller need ? ==&lt;br /&gt;
&lt;br /&gt;
It&#039;s hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:&lt;br /&gt;
&lt;br /&gt;
Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory &amp;quot;[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]&amp;quot;) which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.&lt;br /&gt;
&lt;br /&gt;
If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.&lt;br /&gt;
&lt;br /&gt;
* onboard RAID controllers: there are so many different types it&#039;s hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn&#039;t even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.&lt;br /&gt;
&lt;br /&gt;
* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86); &lt;br /&gt;
&lt;br /&gt;
* Adaptec: allows setting individual drives cache&lt;br /&gt;
arcconf setcache &amp;lt;disk&amp;gt; wb|wt&lt;br /&gt;
wb=write back, which means write cache on, wt=write through, which means write cache off. So &amp;quot;wt&amp;quot; should be chosen.&lt;br /&gt;
&lt;br /&gt;
* Areca: In archttp under &amp;quot;System Controls&amp;quot; -&amp;gt; &amp;quot;System Configuration&amp;quot; there&#039;s the option &amp;quot;Disk Write Cache Mode&amp;quot; (defaults &amp;quot;Auto&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Off&amp;quot;: disk write cache is turned off&lt;br /&gt;
&lt;br /&gt;
&amp;quot;On&amp;quot;: disk write cache is enabled, this is not safe for your data but fast&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Auto&amp;quot;: If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to &amp;quot;On&amp;quot;, because neither controller cache nor disk cache is safe so you don&#039;t seem to care about your data and just want high speed (which you get then).&lt;br /&gt;
&lt;br /&gt;
That&#039;s a very sensible default so you can let it &amp;quot;Auto&amp;quot; or enforce &amp;quot;Off&amp;quot; to be sure.&lt;br /&gt;
&lt;br /&gt;
* LSI MegaRAID: allows setting individual disks cache:&lt;br /&gt;
 MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL                          # flushes the controller cache&lt;br /&gt;
 MegaCli -LDGetProp -Cache    -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the controller cache settings&lt;br /&gt;
 MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the disk cache settings (for all phys. disks in logical disk)&lt;br /&gt;
 MegaCli -LDSetProp -EnDskCache|DisDskCache  -LN|-L0,1,2|-LAll  -aN|-a0,1,2|-aALL # set disk cache setting&lt;br /&gt;
&lt;br /&gt;
* Xyratex: from the docs: &amp;quot;Write cache includes the disk drive cache and controller cache.&amp;quot;. So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.&lt;br /&gt;
&lt;br /&gt;
== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==&lt;br /&gt;
&lt;br /&gt;
The biggest problem is that those products seem to also virtualize disk &lt;br /&gt;
writes in a way that even barriers don&#039;t work any more, which means even &lt;br /&gt;
a fsync is not reliable. Tests confirm that unplugging the power from &lt;br /&gt;
such a system even with RAID controller with battery backed cache and &lt;br /&gt;
hard disk cache turned off (which is safe on a normal host) you can &lt;br /&gt;
destroy a database within the virtual machine (client, domU whatever you &lt;br /&gt;
call it).&lt;br /&gt;
&lt;br /&gt;
In qemu you can specify cache=off on the line specifying the virtual &lt;br /&gt;
disk. For others information is missing.&lt;br /&gt;
&lt;br /&gt;
== Q: What is the issue with directory corruption in Linux 2.6.17? ==&lt;br /&gt;
&lt;br /&gt;
In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some &amp;quot;sparse&amp;quot; endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: the fix is included in 2.6.17.7 and later kernels.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
To add insult to injury, &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039; is currently not correcting these directories on detection of this corrupt state either. This &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; issue is actively being worked on, and a fixed version will be available shortly.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfs_repair -n&#039;&#039;&#039; should be able to detect any directory corruption.&lt;br /&gt;
&lt;br /&gt;
Until a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; binary is available, one can make use of the &#039;&#039;&#039;xfs_db(8)&#039;&#039;&#039; command to mark the problem directory for removal (see the example below). A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; invocation will remove the directory and move all contents into &amp;quot;lost+found&amp;quot;, named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 core.mode = 040755&lt;br /&gt;
 core.version = 2&lt;br /&gt;
 core.format = 3 (btree)&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; write core.mode 0&lt;br /&gt;
 xfs_db&amp;amp;gt; quit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; will clear the directory, and add new entries (named by inode number) in lost+found.&lt;br /&gt;
&lt;br /&gt;
The easiest way to map inode numbers to full paths is via &#039;&#039;&#039;xfs_ncheck(8)&#039;&#039;&#039;&amp;lt;nowiki&amp;gt;: &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_ncheck -i 14101 -i 14102 /dev/sdXXX&lt;br /&gt;
       14101 full/path/mumble_fratz_foo_bar_1495&lt;br /&gt;
       14102 full/path/mumble_fratz_foo_bar_1494&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 ...&lt;br /&gt;
 next_unlinked = null&lt;br /&gt;
 u.bmbt.level = 1&lt;br /&gt;
 u.bmbt.numrecs = 1&lt;br /&gt;
 u.bmbt.keys[1] = [startoff] 1:[0]&lt;br /&gt;
 u.bmbt.ptrs[1] = 1:3628&lt;br /&gt;
 xfs_db&amp;amp;gt; fsblock 3628&lt;br /&gt;
 xfs_db&amp;amp;gt; type bmapbtd&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 magic = 0x424d4150&lt;br /&gt;
 level = 0&lt;br /&gt;
 numrecs = 19&lt;br /&gt;
 leftsib = null&lt;br /&gt;
 rightsib = null&lt;br /&gt;
 recs[1-19] = [startoff,startblock,blockcount,extentflag]&lt;br /&gt;
        1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]&lt;br /&gt;
        5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]&lt;br /&gt;
        9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]&lt;br /&gt;
        12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]&lt;br /&gt;
        15:[33554436,3488,8,0] 16:[33554444,3629,4,0]&lt;br /&gt;
        17:[33554448,3748,4,0] 18:[33554452,3900,4,0]&lt;br /&gt;
        19:[67108864,3364,4,0]&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the &#039;&#039;&#039;xfs_db&#039;&#039;&#039; dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; dblock 20&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 dhdr.magic = 0x58443244&lt;br /&gt;
 dhdr.bestfree[0].offset = 0&lt;br /&gt;
 dhdr.bestfree[0].length = 0&lt;br /&gt;
 dhdr.bestfree[1].offset = 0&lt;br /&gt;
 dhdr.bestfree[1].length = 0&lt;br /&gt;
 dhdr.bestfree[2].offset = 0&lt;br /&gt;
 dhdr.bestfree[2].length = 0&lt;br /&gt;
 du[0].inumber = 13937&lt;br /&gt;
 du[0].namelen = 25&lt;br /&gt;
 du[0].name = &amp;quot;mumble_fratz_foo_bar_1595&amp;quot;&lt;br /&gt;
 du[0].tag = 0x10&lt;br /&gt;
 du[1].inumber = 13938&lt;br /&gt;
 du[1].namelen = 25&lt;br /&gt;
 du[1].name = &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;&lt;br /&gt;
 du[1].tag = 0x38&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
So, here we can see that inode number 13938 matches up with name &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;. Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at &amp;quot;lost+found&amp;quot; (once &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; has removed the corrupt directory).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why does my &amp;gt; 2TB XFS partition disappear when I reboot ? ==&lt;br /&gt;
&lt;br /&gt;
Strictly speaking this is not an XFS problem.&lt;br /&gt;
&lt;br /&gt;
To support &amp;gt; 2TB partitions you need two things: a kernel that supports large block devices (&amp;lt;tt&amp;gt;CONFIG_LBD=y&amp;lt;/tt&amp;gt;) and a partition table format that can hold large partitions.  The default DOS partition tables don&#039;t.  The best partition format for&lt;br /&gt;
&amp;gt; 2TB partitions is the EFI GPT format (&amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Without CONFIG_LBD=y you can&#039;t even create the filesystem, but without &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt; it works fine until you reboot at which point the partition will disappear.  Note that you need to enable the &amp;lt;tt&amp;gt;CONFIG_PARTITION_ADVANCED&amp;lt;/tt&amp;gt; option before you can set &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I receive &amp;lt;tt&amp;gt;No space left on device&amp;lt;/tt&amp;gt; after &amp;lt;tt&amp;gt;xfs_growfs&amp;lt;/tt&amp;gt;? ==&lt;br /&gt;
&lt;br /&gt;
After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:&lt;br /&gt;
&lt;br /&gt;
  The only way to fix this is to move data around to free up space&lt;br /&gt;
  below 1TB. Find your oldest data (i.e. that was around before even&lt;br /&gt;
  the first grow) and move it off the filesystem (move, not copy).&lt;br /&gt;
  Then if you copy it back on, the data blocks will end up above 1TB&lt;br /&gt;
  and that should leave you with plenty of space for inodes below 1TB.&lt;br /&gt;
  &lt;br /&gt;
  A complete dump and restore will also fix the problem ;)&lt;br /&gt;
&lt;br /&gt;
Also, you can add &#039;inode64&#039; to your mount options to allow inodes to live above 1TB.&lt;br /&gt;
&lt;br /&gt;
example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&amp;amp;forum=38 | No space left on device on xfs filesystem with 7.7TB free]&lt;br /&gt;
&lt;br /&gt;
== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==&lt;br /&gt;
&lt;br /&gt;
The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons. &lt;br /&gt;
&lt;br /&gt;
Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.&lt;br /&gt;
&lt;br /&gt;
== Q: How to get around a bad inode repair is unable to clean up ==&lt;br /&gt;
&lt;br /&gt;
The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.&lt;br /&gt;
&lt;br /&gt;
  xfs_db -x -c &#039;inode XXX&#039; -c &#039;write core.nextents 0&#039; -c &#039;write core.size 0&#039; /dev/hdXX&lt;br /&gt;
&lt;br /&gt;
== Q: How to calculate the correct sunit,swidth values for optimal performance ==&lt;br /&gt;
&lt;br /&gt;
XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.&lt;br /&gt;
&lt;br /&gt;
These options can be sometimes autodetected (for example with md raid and recent enough kernel (&amp;gt;= 2.6.32) and xfsprogs (&amp;gt;= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.&lt;br /&gt;
&lt;br /&gt;
The calculation of these values is quite simple:&lt;br /&gt;
&lt;br /&gt;
  su = &amp;lt;RAID controllers stripe size in BYTES (or KiBytes when used with k)&amp;gt;&lt;br /&gt;
  sw = &amp;lt;# of data disks (don&#039;t count parity disks)&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use&lt;br /&gt;
&lt;br /&gt;
  su = 64k&lt;br /&gt;
  sw = 6 (RAID-6 of 8 disks has 6 data disks)&lt;br /&gt;
&lt;br /&gt;
A RAID stripe size of 256KB with a RAID-10 over 16 disks should use&lt;br /&gt;
&lt;br /&gt;
  su = 256k&lt;br /&gt;
  sw = 8 (RAID-10 of 16 disks has 8 data disks)&lt;br /&gt;
&lt;br /&gt;
Alternatively, you can use &amp;quot;sunit&amp;quot; instead of &amp;quot;su&amp;quot; and &amp;quot;swidth&amp;quot; instead of &amp;quot;sw&amp;quot; but then sunit/swidth values need to be specified in &amp;quot;number of 512B sectors&amp;quot;!&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; interpret sunit and swidth as being specified in units of 512B sectors; that&#039;s unfortunately not the unit they&#039;re reported in, however.&lt;br /&gt;
&amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; report them in multiples of your basic block size (bsize) and not in 512B sectors.&lt;br /&gt;
&lt;br /&gt;
Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.&lt;br /&gt;
&lt;br /&gt;
When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.&lt;br /&gt;
&lt;br /&gt;
== Q: Why doesn&#039;t NFS-exporting subdirectories of inode64-mounted filesystem work? ==&lt;br /&gt;
&lt;br /&gt;
The default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; type encodes only 32-bit of the inode number for subdirectory exports.  However, exporting the root of the filesystem works, or using one of the non-default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; types (&amp;lt;tt&amp;gt;fsid=uuid&amp;lt;/tt&amp;gt; in &amp;lt;tt&amp;gt;/etc/exports&amp;lt;/tt&amp;gt; with recent &amp;lt;tt&amp;gt;nfs-utils&amp;lt;/tt&amp;gt;) should work as well. (Thanks, Christoph!)&lt;br /&gt;
&lt;br /&gt;
== Q: What is the inode64 mount option for? ==&lt;br /&gt;
&lt;br /&gt;
By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like &amp;quot;disk full&amp;quot; when you still have plenty space free, but there&#039;s no more place in the first TB to create a new inode. Also, performance sucks.&lt;br /&gt;
&lt;br /&gt;
To come around this, use the inode64 mount options for filesystems &amp;gt;1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.&lt;br /&gt;
&lt;br /&gt;
Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.&lt;br /&gt;
&lt;br /&gt;
== Q: Can I just try the inode64 option to see if it helps me? ==&lt;br /&gt;
&lt;br /&gt;
Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can&#039;t access files &amp;amp; dirs that have been created with an inode &amp;gt;32bit anymore.&lt;br /&gt;
&lt;br /&gt;
== Q: Performance: mkfs.xfs -n size=64k option ==&lt;br /&gt;
&lt;br /&gt;
Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:&lt;br /&gt;
&lt;br /&gt;
Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a&lt;br /&gt;
directory entry is determined by the length of the name.&lt;br /&gt;
&lt;br /&gt;
There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there&#039;s the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.&lt;br /&gt;
&lt;br /&gt;
For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.&lt;br /&gt;
&lt;br /&gt;
In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don&#039;t have any numbers on what the difference might be - I&#039;m getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....&lt;br /&gt;
&lt;br /&gt;
== Q: I want to tune my XFS filesystems for &amp;lt;something&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Premature optimization is the root of all evil.&#039;&#039; - Donald Knuth&lt;br /&gt;
&lt;br /&gt;
The standard answer you will get to this question is this: use the defaults.&lt;br /&gt;
&lt;br /&gt;
There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to  configure the filesystem appropriately.&lt;br /&gt;
&lt;br /&gt;
There are a lot of &amp;quot;XFS tuning guides&amp;quot; that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don&#039;t expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.&lt;br /&gt;
&lt;br /&gt;
In most cases, the only thing you need to to consider for &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; mount options. Increasing &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; reduces the number of journal IOs for a given workload, and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; will reduce them even further. The trade off for this increase in metadata performance is that more operations may be &amp;quot;missing&amp;quot; after recovery if the system crashes while actively making modifications.&lt;br /&gt;
&lt;br /&gt;
As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.&lt;br /&gt;
&lt;br /&gt;
== Q: Which factors influence the memory usage of xfs_repair? ==&lt;br /&gt;
&lt;br /&gt;
This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -n -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2096.&lt;br /&gt;
  #&lt;br /&gt;
&lt;br /&gt;
xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,&lt;br /&gt;
of which 2,097,152KB is needed for tracking free space. &lt;br /&gt;
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)&lt;br /&gt;
&lt;br /&gt;
Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2289.&lt;br /&gt;
&lt;br /&gt;
That is now needs at least another 200MB of RAM to run.&lt;br /&gt;
&lt;br /&gt;
The numbers reported by xfs_repair are the absolute minimum required and approximate at that;&lt;br /&gt;
more RAM than this may be required to complete successfully.&lt;br /&gt;
Also, if you only give xfs_repair the minimum required RAM, it will be slow;&lt;br /&gt;
for best repair performance, the more RAM you can give it the better.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why some files of my filesystem shows as &amp;quot;?????????? ? ?      ?          ?                ? filename&amp;quot; ? ==&lt;br /&gt;
&lt;br /&gt;
If ls -l shows you a listing as&lt;br /&gt;
&lt;br /&gt;
  # ?????????? ? ?      ?          ?                ? file1&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file2&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file3&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file4&lt;br /&gt;
&lt;br /&gt;
and errors like:&lt;br /&gt;
  # ls /pathtodir/&lt;br /&gt;
    ls: cannot access /pathtodir/file1: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file2: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file3: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file4: Invalid argument&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
or even:&lt;br /&gt;
  # failed to stat /pathtodir/file1&lt;br /&gt;
&lt;br /&gt;
It is very probable your filesystem must be mounted with inode64&lt;br /&gt;
  # mount -oremount,inode64 /dev/diskpart /mnt/xfs&lt;br /&gt;
&lt;br /&gt;
should make it work ok again.&lt;br /&gt;
If it works, add the option to fstab.&lt;br /&gt;
&lt;br /&gt;
== Q: The xfs_db &amp;quot;frag&amp;quot; command says I&#039;m over 50%.  Is that bad? ==&lt;br /&gt;
&lt;br /&gt;
It depends.  It&#039;s important to know how the value is calculated.  xfs_db looks at the extents in all files, and returns:&lt;br /&gt;
&lt;br /&gt;
  (actual extents - ideal extents) / actual extents&lt;br /&gt;
&lt;br /&gt;
This means that if, for example, you have an average of 2 extents per file, you&#039;ll get an answer of 50%.  4 extents per file would give you 75%.  This may or may not be a problem, especially depending on the size of the files in question.  (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented).  The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.&lt;br /&gt;
&lt;br /&gt;
Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:&lt;br /&gt;
[[Image:Frag_factor.png|500px]]&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2753</id>
		<title>XFS FAQ</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2753"/>
		<updated>2012-07-19T15:50:34Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: /* Q: I want to tune my XFS filesystems for  */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about XFS? ==&lt;br /&gt;
&lt;br /&gt;
The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.&lt;br /&gt;
&lt;br /&gt;
You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the &#039;&#039;&#039;&amp;lt;nowiki&amp;gt;#xfs&amp;lt;/nowiki&amp;gt;&#039;&#039;&#039; IRC channel on &#039;&#039;irc.freenode.net&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about ACLs? ==&lt;br /&gt;
&lt;br /&gt;
Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;acl(5)&#039;&#039;&#039; manual page is also quite extensive.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find information about the internals of XFS? ==&lt;br /&gt;
&lt;br /&gt;
An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.&lt;br /&gt;
&lt;br /&gt;
Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.&lt;br /&gt;
&lt;br /&gt;
== Q: What partition type should I use for XFS on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Linux native filesystem (83).&lt;br /&gt;
&lt;br /&gt;
== Q: What mount options does XFS have? ==&lt;br /&gt;
&lt;br /&gt;
There are a number of mount options influencing XFS filesystems - refer to the &#039;&#039;&#039;mount(8)&#039;&#039;&#039; manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])&lt;br /&gt;
&lt;br /&gt;
== Q: Is there any relation between the XFS utilities and the kernel version? ==&lt;br /&gt;
&lt;br /&gt;
No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Does it run on platforms other than i386? ==&lt;br /&gt;
&lt;br /&gt;
XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Do quotas work on XFS? ==&lt;br /&gt;
&lt;br /&gt;
Yes.&lt;br /&gt;
&lt;br /&gt;
To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/  http://sourceforge.net/projects/linuxquota/] or use &#039;&#039;&#039;xfs_quota(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: What&#039;s project quota? ==&lt;br /&gt;
&lt;br /&gt;
The  project  quota  is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Can group quota and project quota be used at the same time? ==&lt;br /&gt;
&lt;br /&gt;
No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==&lt;br /&gt;
&lt;br /&gt;
To be answered.&lt;br /&gt;
&lt;br /&gt;
== Q: Are there any dump/restore tools for XFS? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and &#039;&#039;&#039;xfsrestore(8)&#039;&#039;&#039; are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.&lt;br /&gt;
&lt;br /&gt;
== Q: Does LILO work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
This depends on where you install LILO.&lt;br /&gt;
&lt;br /&gt;
Yes, for MBR (Master Boot Record) installations.&lt;br /&gt;
&lt;br /&gt;
No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.&lt;br /&gt;
&lt;br /&gt;
== Q: Does GRUB work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.&lt;br /&gt;
&lt;br /&gt;
== Q: Can XFS be used for a root filesystem? ==&lt;br /&gt;
&lt;br /&gt;
Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the &amp;quot;rootflags=&amp;quot; kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit &amp;quot;logdev=&amp;quot; specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]&lt;br /&gt;
&lt;br /&gt;
== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be &amp;quot;clean&amp;quot; when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.&lt;br /&gt;
&lt;br /&gt;
== Q: Is there a way to make a XFS filesystem larger or smaller? ==&lt;br /&gt;
&lt;br /&gt;
You can &#039;&#039;NOT&#039;&#039; make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.&lt;br /&gt;
&lt;br /&gt;
An XFS filesystem may be enlarged by using &#039;&#039;&#039;xfs_growfs(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the &#039;&#039;exact same&#039;&#039; starting point. Run &#039;&#039;&#039;xfs_growfs&#039;&#039;&#039; to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.&lt;br /&gt;
&lt;br /&gt;
Using XFS filesystems on top of a volume manager makes this a lot easier.&lt;br /&gt;
&lt;br /&gt;
== Q: What information should I include when reporting a problem? ==&lt;br /&gt;
&lt;br /&gt;
What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:&lt;br /&gt;
&lt;br /&gt;
* kernel version (uname -a)&lt;br /&gt;
* xfsprogs version (xfs_repair -V)&lt;br /&gt;
* number of CPUs&lt;br /&gt;
* contents of /proc/meminfo&lt;br /&gt;
* contents of /proc/mounts&lt;br /&gt;
* contents of /proc/partitions&lt;br /&gt;
* RAID layout (hardware and/or software)&lt;br /&gt;
* LVM configuration&lt;br /&gt;
* type of disks you are using&lt;br /&gt;
* write cache status of drives&lt;br /&gt;
* size of BBWC and mode it is running in&lt;br /&gt;
* xfs_info output on the filesystem in question&lt;br /&gt;
* dmesg output showing all error messages and stack traces&lt;br /&gt;
 &lt;br /&gt;
Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:&lt;br /&gt;
&lt;br /&gt;
# iostat -x -d -m 5&lt;br /&gt;
# vmstat 5&lt;br /&gt;
 &lt;br /&gt;
can give us insight into the IO and memory utilisation of your machine at the time of the problem.&lt;br /&gt;
&lt;br /&gt;
If the filesystem is hanging, then capture the output of the dmesg command after running:&lt;br /&gt;
&lt;br /&gt;
 # echo w &amp;gt; /proc/sysrq-trigger&lt;br /&gt;
 # dmesg&lt;br /&gt;
&lt;br /&gt;
will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.&lt;br /&gt;
&lt;br /&gt;
And for advanced users, capturing an event trace using &#039;&#039;&#039;trace-cmd&#039;&#039;&#039; (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it&#039;s a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd record -e xfs\*&lt;br /&gt;
&lt;br /&gt;
before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd report &amp;gt; trace_report.txt&lt;br /&gt;
&lt;br /&gt;
Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.&lt;br /&gt;
&lt;br /&gt;
If you have a problem with &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039;, make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using &#039;&#039;&#039;xfs_metadump(8)&#039;&#039;&#039; (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.&lt;br /&gt;
&lt;br /&gt;
== Q: Mounting an XFS filesystem does not work - what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
If mount prints an error message something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     mount: /dev/hda5 has wrong major or minor number&lt;br /&gt;
&lt;br /&gt;
you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the &amp;quot;-t xfs&amp;quot; option on mount or the &amp;quot;xfs&amp;quot; option in &amp;lt;tt&amp;gt;/etc/fstab&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
If you get something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 mount: wrong fs type, bad option, bad superblock on /dev/sda1,&lt;br /&gt;
        or too many mounted file systems&lt;br /&gt;
&lt;br /&gt;
Refer to your system log file (&amp;lt;tt&amp;gt;/var/log/messages&amp;lt;/tt&amp;gt;) for a detailed diagnostic message from the kernel.&lt;br /&gt;
&lt;br /&gt;
== Q: Does the filesystem have an undelete capability? ==&lt;br /&gt;
&lt;br /&gt;
There is no undelete in XFS (so far).&lt;br /&gt;
&lt;br /&gt;
However at least some XFS driver implementations do not wipe file information nodes completely so there are chance to recover files with specialized commercial closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS].&lt;br /&gt;
&lt;br /&gt;
In this kind of XFS driver implementation it does not re-use directory entries immediately so there are chance to get back recently deleted files even with their real names.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;xfs_irecover&#039;&#039; or &#039;&#039;xfsr&#039;&#039; may help too, [http://www.who.is.free.fr/wiki/doku.php?id=recover this site] has a few links.&lt;br /&gt;
&lt;br /&gt;
This applies to most recent Linux distributions (versions?), as well as to most popular NAS boxes that use embedded linux and XFS file system.&lt;br /&gt;
&lt;br /&gt;
Anyway, the best is to always keep backups.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I backup a XFS filesystem and ACLs? ==&lt;br /&gt;
&lt;br /&gt;
You can backup a XFS filesystem with utilities like &#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and standard &#039;&#039;&#039;tar(1)&#039;&#039;&#039; for standard files. If you want to backup ACLs you will need to use &#039;&#039;&#039;xfsdump&#039;&#039;&#039; or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (&amp;gt; version 3.1.4) or [http://rsync.samba.org/ rsync] (&amp;gt;= version 3.0.0) to backup ACLs and EAs. &#039;&#039;&#039;xfsdump&#039;&#039;&#039; can also be integrated with [http://www.amanda.org/ amanda(8)].&lt;br /&gt;
&lt;br /&gt;
== Q: I see applications returning error 990 or &amp;quot;Structure needs cleaning&amp;quot;, what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], &amp;quot;Structure needs cleaning.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.&lt;br /&gt;
&lt;br /&gt;
There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.&lt;br /&gt;
&lt;br /&gt;
You can use xfs_repair to remedy the problem (with the file system unmounted).&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==&lt;br /&gt;
&lt;br /&gt;
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.&lt;br /&gt;
&lt;br /&gt;
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.&lt;br /&gt;
&lt;br /&gt;
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you&#039;ll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the &#039;&#039;&#039;xfs_bmap(8)&#039;&#039;&#039; command).&lt;br /&gt;
&lt;br /&gt;
== Q: What is the problem with the write cache on journaled filesystems? ==&lt;br /&gt;
&lt;br /&gt;
Many drives use a write back cache in order to speed up the performance of writes.  However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk.  Further, the drive can de-stage data from the write cache to the platters in any order that it chooses.  This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk.  When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.&lt;br /&gt;
&lt;br /&gt;
With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information.  In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.&lt;br /&gt;
&lt;br /&gt;
With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued.  A powerfail &amp;quot;only&amp;quot; loses data in the cache but no essential ordering is violated, and corruption will not occur.&lt;br /&gt;
&lt;br /&gt;
With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance.  But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I tell if I have the disk write cache enabled? ==&lt;br /&gt;
&lt;br /&gt;
For SCSI/SATA:&lt;br /&gt;
&lt;br /&gt;
* Look in dmesg(8) output for a driver line, such as:&amp;lt;br /&amp;gt; &amp;quot;SCSI device sda: drive cache: write back&amp;quot;&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# sginfo -c /dev/sda | grep -i &#039;write cache&#039; &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -I /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; and look under &amp;quot;Enabled Supported&amp;quot; for &amp;quot;Write cache&amp;quot;&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
== Q: How can I address the problem with the disk write cache? ==&lt;br /&gt;
&lt;br /&gt;
=== Disabling the disk write back cache. ===&lt;br /&gt;
&lt;br /&gt;
For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -W0 /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # hdparm -W0 /dev/hda&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# blktool /dev/sda wcache off&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # blktool /dev/hda wcache off&lt;br /&gt;
&lt;br /&gt;
For SCSI:&lt;br /&gt;
&lt;br /&gt;
* Using sginfo(8) which is a little tedious&amp;lt;br /&amp;gt; It takes 3 steps. For example:&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -c /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives a list of attribute names and values&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cX /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives an array of cache values which you must match up with from step 1, e.g.&amp;lt;br /&amp;gt; 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; allows you to reset the value of the cache attributes.&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Using an external log. ===&lt;br /&gt;
&lt;br /&gt;
Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will &#039;&#039;&#039;not&#039;&#039;&#039; solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won&#039;t be able to guarantee that if the metadata is on a drive with the write cache enabled.&lt;br /&gt;
&lt;br /&gt;
In fact using an external log will disable XFS&#039; write barrier support.&lt;br /&gt;
&lt;br /&gt;
=== Write barrier support. ===&lt;br /&gt;
&lt;br /&gt;
Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with &amp;quot;nobarrier&amp;quot;. Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported with external log device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported by the underlying device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, trial barrier write failed&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If the filesystem is mounted with an external log device then we currently don&#039;t support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn&#039;t support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.&lt;br /&gt;
&lt;br /&gt;
== Q. Should barriers be enabled with storage which has a persistent write cache? ==&lt;br /&gt;
&lt;br /&gt;
Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with &amp;quot;nobarrier&amp;quot;, assuming your RAID controller is infallible and not resetting randomly like some common ones do.  But take care about the hard disk write cache, which should be off.&lt;br /&gt;
&lt;br /&gt;
== Q. Which settings does my RAID controller need ? ==&lt;br /&gt;
&lt;br /&gt;
It&#039;s hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:&lt;br /&gt;
&lt;br /&gt;
Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory &amp;quot;[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]&amp;quot;) which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.&lt;br /&gt;
&lt;br /&gt;
If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.&lt;br /&gt;
&lt;br /&gt;
* onboard RAID controllers: there are so many different types it&#039;s hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn&#039;t even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.&lt;br /&gt;
&lt;br /&gt;
* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86); &lt;br /&gt;
&lt;br /&gt;
* Adaptec: allows setting individual drives cache&lt;br /&gt;
arcconf setcache &amp;lt;disk&amp;gt; wb|wt&lt;br /&gt;
wb=write back, which means write cache on, wt=write through, which means write cache off. So &amp;quot;wt&amp;quot; should be chosen.&lt;br /&gt;
&lt;br /&gt;
* Areca: In archttp under &amp;quot;System Controls&amp;quot; -&amp;gt; &amp;quot;System Configuration&amp;quot; there&#039;s the option &amp;quot;Disk Write Cache Mode&amp;quot; (defaults &amp;quot;Auto&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Off&amp;quot;: disk write cache is turned off&lt;br /&gt;
&lt;br /&gt;
&amp;quot;On&amp;quot;: disk write cache is enabled, this is not safe for your data but fast&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Auto&amp;quot;: If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to &amp;quot;On&amp;quot;, because neither controller cache nor disk cache is safe so you don&#039;t seem to care about your data and just want high speed (which you get then).&lt;br /&gt;
&lt;br /&gt;
That&#039;s a very sensible default so you can let it &amp;quot;Auto&amp;quot; or enforce &amp;quot;Off&amp;quot; to be sure.&lt;br /&gt;
&lt;br /&gt;
* LSI MegaRAID: allows setting individual disks cache:&lt;br /&gt;
 MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL                          # flushes the controller cache&lt;br /&gt;
 MegaCli -LDGetProp -Cache    -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the controller cache settings&lt;br /&gt;
 MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the disk cache settings (for all phys. disks in logical disk)&lt;br /&gt;
 MegaCli -LDSetProp -EnDskCache|DisDskCache  -LN|-L0,1,2|-LAll  -aN|-a0,1,2|-aALL # set disk cache setting&lt;br /&gt;
&lt;br /&gt;
* Xyratex: from the docs: &amp;quot;Write cache includes the disk drive cache and controller cache.&amp;quot;. So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.&lt;br /&gt;
&lt;br /&gt;
== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==&lt;br /&gt;
&lt;br /&gt;
The biggest problem is that those products seem to also virtualize disk &lt;br /&gt;
writes in a way that even barriers don&#039;t work any more, which means even &lt;br /&gt;
a fsync is not reliable. Tests confirm that unplugging the power from &lt;br /&gt;
such a system even with RAID controller with battery backed cache and &lt;br /&gt;
hard disk cache turned off (which is safe on a normal host) you can &lt;br /&gt;
destroy a database within the virtual machine (client, domU whatever you &lt;br /&gt;
call it).&lt;br /&gt;
&lt;br /&gt;
In qemu you can specify cache=off on the line specifying the virtual &lt;br /&gt;
disk. For others information is missing.&lt;br /&gt;
&lt;br /&gt;
== Q: What is the issue with directory corruption in Linux 2.6.17? ==&lt;br /&gt;
&lt;br /&gt;
In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some &amp;quot;sparse&amp;quot; endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: the fix is included in 2.6.17.7 and later kernels.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
To add insult to injury, &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039; is currently not correcting these directories on detection of this corrupt state either. This &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; issue is actively being worked on, and a fixed version will be available shortly.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfs_repair -n&#039;&#039;&#039; should be able to detect any directory corruption.&lt;br /&gt;
&lt;br /&gt;
Until a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; binary is available, one can make use of the &#039;&#039;&#039;xfs_db(8)&#039;&#039;&#039; command to mark the problem directory for removal (see the example below). A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; invocation will remove the directory and move all contents into &amp;quot;lost+found&amp;quot;, named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 core.mode = 040755&lt;br /&gt;
 core.version = 2&lt;br /&gt;
 core.format = 3 (btree)&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; write core.mode 0&lt;br /&gt;
 xfs_db&amp;amp;gt; quit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; will clear the directory, and add new entries (named by inode number) in lost+found.&lt;br /&gt;
&lt;br /&gt;
The easiest way to map inode numbers to full paths is via &#039;&#039;&#039;xfs_ncheck(8)&#039;&#039;&#039;&amp;lt;nowiki&amp;gt;: &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_ncheck -i 14101 -i 14102 /dev/sdXXX&lt;br /&gt;
       14101 full/path/mumble_fratz_foo_bar_1495&lt;br /&gt;
       14102 full/path/mumble_fratz_foo_bar_1494&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 ...&lt;br /&gt;
 next_unlinked = null&lt;br /&gt;
 u.bmbt.level = 1&lt;br /&gt;
 u.bmbt.numrecs = 1&lt;br /&gt;
 u.bmbt.keys[1] = [startoff] 1:[0]&lt;br /&gt;
 u.bmbt.ptrs[1] = 1:3628&lt;br /&gt;
 xfs_db&amp;amp;gt; fsblock 3628&lt;br /&gt;
 xfs_db&amp;amp;gt; type bmapbtd&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 magic = 0x424d4150&lt;br /&gt;
 level = 0&lt;br /&gt;
 numrecs = 19&lt;br /&gt;
 leftsib = null&lt;br /&gt;
 rightsib = null&lt;br /&gt;
 recs[1-19] = [startoff,startblock,blockcount,extentflag]&lt;br /&gt;
        1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]&lt;br /&gt;
        5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]&lt;br /&gt;
        9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]&lt;br /&gt;
        12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]&lt;br /&gt;
        15:[33554436,3488,8,0] 16:[33554444,3629,4,0]&lt;br /&gt;
        17:[33554448,3748,4,0] 18:[33554452,3900,4,0]&lt;br /&gt;
        19:[67108864,3364,4,0]&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the &#039;&#039;&#039;xfs_db&#039;&#039;&#039; dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; dblock 20&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 dhdr.magic = 0x58443244&lt;br /&gt;
 dhdr.bestfree[0].offset = 0&lt;br /&gt;
 dhdr.bestfree[0].length = 0&lt;br /&gt;
 dhdr.bestfree[1].offset = 0&lt;br /&gt;
 dhdr.bestfree[1].length = 0&lt;br /&gt;
 dhdr.bestfree[2].offset = 0&lt;br /&gt;
 dhdr.bestfree[2].length = 0&lt;br /&gt;
 du[0].inumber = 13937&lt;br /&gt;
 du[0].namelen = 25&lt;br /&gt;
 du[0].name = &amp;quot;mumble_fratz_foo_bar_1595&amp;quot;&lt;br /&gt;
 du[0].tag = 0x10&lt;br /&gt;
 du[1].inumber = 13938&lt;br /&gt;
 du[1].namelen = 25&lt;br /&gt;
 du[1].name = &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;&lt;br /&gt;
 du[1].tag = 0x38&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
So, here we can see that inode number 13938 matches up with name &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;. Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at &amp;quot;lost+found&amp;quot; (once &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; has removed the corrupt directory).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why does my &amp;gt; 2TB XFS partition disappear when I reboot ? ==&lt;br /&gt;
&lt;br /&gt;
Strictly speaking this is not an XFS problem.&lt;br /&gt;
&lt;br /&gt;
To support &amp;gt; 2TB partitions you need two things: a kernel that supports large block devices (&amp;lt;tt&amp;gt;CONFIG_LBD=y&amp;lt;/tt&amp;gt;) and a partition table format that can hold large partitions.  The default DOS partition tables don&#039;t.  The best partition format for&lt;br /&gt;
&amp;gt; 2TB partitions is the EFI GPT format (&amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Without CONFIG_LBD=y you can&#039;t even create the filesystem, but without &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt; it works fine until you reboot at which point the partition will disappear.  Note that you need to enable the &amp;lt;tt&amp;gt;CONFIG_PARTITION_ADVANCED&amp;lt;/tt&amp;gt; option before you can set &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I receive &amp;lt;tt&amp;gt;No space left on device&amp;lt;/tt&amp;gt; after &amp;lt;tt&amp;gt;xfs_growfs&amp;lt;/tt&amp;gt;? ==&lt;br /&gt;
&lt;br /&gt;
After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:&lt;br /&gt;
&lt;br /&gt;
  The only way to fix this is to move data around to free up space&lt;br /&gt;
  below 1TB. Find your oldest data (i.e. that was around before even&lt;br /&gt;
  the first grow) and move it off the filesystem (move, not copy).&lt;br /&gt;
  Then if you copy it back on, the data blocks will end up above 1TB&lt;br /&gt;
  and that should leave you with plenty of space for inodes below 1TB.&lt;br /&gt;
  &lt;br /&gt;
  A complete dump and restore will also fix the problem ;)&lt;br /&gt;
&lt;br /&gt;
Also, you can add &#039;inode64&#039; to your mount options to allow inodes to live above 1TB.&lt;br /&gt;
&lt;br /&gt;
example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&amp;amp;forum=38 | No space left on device on xfs filesystem with 7.7TB free]&lt;br /&gt;
&lt;br /&gt;
== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==&lt;br /&gt;
&lt;br /&gt;
The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons. &lt;br /&gt;
&lt;br /&gt;
Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.&lt;br /&gt;
&lt;br /&gt;
== Q: How to get around a bad inode repair is unable to clean up ==&lt;br /&gt;
&lt;br /&gt;
The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.&lt;br /&gt;
&lt;br /&gt;
  xfs_db -x -c &#039;inode XXX&#039; -c &#039;write core.nextents 0&#039; -c &#039;write core.size 0&#039; /dev/hdXX&lt;br /&gt;
&lt;br /&gt;
== Q: How to calculate the correct sunit,swidth values for optimal performance ==&lt;br /&gt;
&lt;br /&gt;
XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.&lt;br /&gt;
&lt;br /&gt;
These options can be sometimes autodetected (for example with md raid and recent enough kernel (&amp;gt;= 2.6.32) and xfsprogs (&amp;gt;= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.&lt;br /&gt;
&lt;br /&gt;
The calculation of these values is quite simple:&lt;br /&gt;
&lt;br /&gt;
  su = &amp;lt;RAID controllers stripe size in BYTES (or KiBytes when used with k)&amp;gt;&lt;br /&gt;
  sw = &amp;lt;# of data disks (don&#039;t count parity disks)&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use&lt;br /&gt;
&lt;br /&gt;
  su = 64k&lt;br /&gt;
  sw = 6 (RAID-6 of 8 disks has 6 data disks)&lt;br /&gt;
&lt;br /&gt;
A RAID stripe size of 256KB with a RAID-10 over 16 disks should use&lt;br /&gt;
&lt;br /&gt;
  su = 256k&lt;br /&gt;
  sw = 8 (RAID-10 of 16 disks has 8 data disks)&lt;br /&gt;
&lt;br /&gt;
Alternatively, you can use &amp;quot;sunit&amp;quot; instead of &amp;quot;su&amp;quot; and &amp;quot;swidth&amp;quot; instead of &amp;quot;sw&amp;quot; but then sunit/swidth values need to be specified in &amp;quot;number of 512B sectors&amp;quot;!&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; interpret sunit and swidth as being specified in units of 512B sectors; that&#039;s unfortunately not the unit they&#039;re reported in, however.&lt;br /&gt;
&amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; report them in multiples of your basic block size (bsize) and not in 512B sectors.&lt;br /&gt;
&lt;br /&gt;
Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.&lt;br /&gt;
&lt;br /&gt;
When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.&lt;br /&gt;
&lt;br /&gt;
== Q: Why doesn&#039;t NFS-exporting subdirectories of inode64-mounted filesystem work? ==&lt;br /&gt;
&lt;br /&gt;
The default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; type encodes only 32-bit of the inode number for subdirectory exports.  However, exporting the root of the filesystem works, or using one of the non-default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; types (&amp;lt;tt&amp;gt;fsid=uuid&amp;lt;/tt&amp;gt; in &amp;lt;tt&amp;gt;/etc/exports&amp;lt;/tt&amp;gt; with recent &amp;lt;tt&amp;gt;nfs-utils&amp;lt;/tt&amp;gt;) should work as well. (Thanks, Christoph!)&lt;br /&gt;
&lt;br /&gt;
== Q: What is the inode64 mount option for? ==&lt;br /&gt;
&lt;br /&gt;
By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like &amp;quot;disk full&amp;quot; when you still have plenty space free, but there&#039;s no more place in the first TB to create a new inode. Also, performance sucks.&lt;br /&gt;
&lt;br /&gt;
To come around this, use the inode64 mount options for filesystems &amp;gt;1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.&lt;br /&gt;
&lt;br /&gt;
Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.&lt;br /&gt;
&lt;br /&gt;
== Q: Can I just try the inode64 option to see if it helps me? ==&lt;br /&gt;
&lt;br /&gt;
Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can&#039;t access files &amp;amp; dirs that have been created with an inode &amp;gt;32bit anymore.&lt;br /&gt;
&lt;br /&gt;
== Q: Performance: mkfs.xfs -n size=64k option ==&lt;br /&gt;
&lt;br /&gt;
Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:&lt;br /&gt;
&lt;br /&gt;
Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a&lt;br /&gt;
directory entry is determined by the length of the name.&lt;br /&gt;
&lt;br /&gt;
There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there&#039;s the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.&lt;br /&gt;
&lt;br /&gt;
For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.&lt;br /&gt;
&lt;br /&gt;
In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don&#039;t have any numbers on what the difference might be - I&#039;m getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....&lt;br /&gt;
&lt;br /&gt;
== Q: I want to tune my XFS filesystems for &amp;lt;something&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Premature optimization is the root of all evil.&amp;quot; - Donald Knuth&lt;br /&gt;
&lt;br /&gt;
The standard answer you will get to this question is this: use the defaults.&lt;br /&gt;
&lt;br /&gt;
There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to  configure the filesystem appropriately.&lt;br /&gt;
&lt;br /&gt;
There are a lot of &amp;quot;XFS tuning guides&amp;quot; that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don&#039;t expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.&lt;br /&gt;
&lt;br /&gt;
In most cases, the only thing you need to to consider for &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; mount options. Increasing &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; reduces the number of journal IOs for a given workload, and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; will reduce them even further. The trade off for this increase in metadata performance is that more operations may be &amp;quot;missing&amp;quot; after recovery if the system crashes while actively making modifications.&lt;br /&gt;
&lt;br /&gt;
As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.&lt;br /&gt;
&lt;br /&gt;
== Q: Which factors influence the memory usage of xfs_repair? ==&lt;br /&gt;
&lt;br /&gt;
This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -n -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2096.&lt;br /&gt;
  #&lt;br /&gt;
&lt;br /&gt;
xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,&lt;br /&gt;
of which 2,097,152KB is needed for tracking free space. &lt;br /&gt;
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)&lt;br /&gt;
&lt;br /&gt;
Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2289.&lt;br /&gt;
&lt;br /&gt;
That is now needs at least another 200MB of RAM to run.&lt;br /&gt;
&lt;br /&gt;
The numbers reported by xfs_repair are the absolute minimum required and approximate at that;&lt;br /&gt;
more RAM than this may be required to complete successfully.&lt;br /&gt;
Also, if you only give xfs_repair the minimum required RAM, it will be slow;&lt;br /&gt;
for best repair performance, the more RAM you can give it the better.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why some files of my filesystem shows as &amp;quot;?????????? ? ?      ?          ?                ? filename&amp;quot; ? ==&lt;br /&gt;
&lt;br /&gt;
If ls -l shows you a listing as&lt;br /&gt;
&lt;br /&gt;
  # ?????????? ? ?      ?          ?                ? file1&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file2&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file3&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file4&lt;br /&gt;
&lt;br /&gt;
and errors like:&lt;br /&gt;
  # ls /pathtodir/&lt;br /&gt;
    ls: cannot access /pathtodir/file1: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file2: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file3: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file4: Invalid argument&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
or even:&lt;br /&gt;
  # failed to stat /pathtodir/file1&lt;br /&gt;
&lt;br /&gt;
It is very probable your filesystem must be mounted with inode64&lt;br /&gt;
  # mount -oremount,inode64 /dev/diskpart /mnt/xfs&lt;br /&gt;
&lt;br /&gt;
should make it work ok again.&lt;br /&gt;
If it works, add the option to fstab.&lt;br /&gt;
&lt;br /&gt;
== Q: The xfs_db &amp;quot;frag&amp;quot; command says I&#039;m over 50%.  Is that bad? ==&lt;br /&gt;
&lt;br /&gt;
It depends.  It&#039;s important to know how the value is calculated.  xfs_db looks at the extents in all files, and returns:&lt;br /&gt;
&lt;br /&gt;
  (actual extents - ideal extents) / actual extents&lt;br /&gt;
&lt;br /&gt;
This means that if, for example, you have an average of 2 extents per file, you&#039;ll get an answer of 50%.  4 extents per file would give you 75%.  This may or may not be a problem, especially depending on the size of the files in question.  (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented).  The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.&lt;br /&gt;
&lt;br /&gt;
Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:&lt;br /&gt;
[[Image:Frag_factor.png|500px]]&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2638</id>
		<title>XFS FAQ</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2638"/>
		<updated>2012-06-21T00:44:00Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: DIE xfs_check DIE!&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about XFS? ==&lt;br /&gt;
&lt;br /&gt;
The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.&lt;br /&gt;
&lt;br /&gt;
You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the &#039;&#039;&#039;&amp;lt;nowiki&amp;gt;#xfs&amp;lt;/nowiki&amp;gt;&#039;&#039;&#039; IRC channel on &#039;&#039;irc.freenode.net&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about ACLs? ==&lt;br /&gt;
&lt;br /&gt;
Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;acl(5)&#039;&#039;&#039; manual page is also quite extensive.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find information about the internals of XFS? ==&lt;br /&gt;
&lt;br /&gt;
An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.&lt;br /&gt;
&lt;br /&gt;
Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.&lt;br /&gt;
&lt;br /&gt;
== Q: What partition type should I use for XFS on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Linux native filesystem (83).&lt;br /&gt;
&lt;br /&gt;
== Q: What mount options does XFS have? ==&lt;br /&gt;
&lt;br /&gt;
There are a number of mount options influencing XFS filesystems - refer to the &#039;&#039;&#039;mount(8)&#039;&#039;&#039; manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])&lt;br /&gt;
&lt;br /&gt;
== Q: Is there any relation between the XFS utilities and the kernel version? ==&lt;br /&gt;
&lt;br /&gt;
No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Does it run on platforms other than i386? ==&lt;br /&gt;
&lt;br /&gt;
XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Do quotas work on XFS? ==&lt;br /&gt;
&lt;br /&gt;
Yes.&lt;br /&gt;
&lt;br /&gt;
To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/  http://sourceforge.net/projects/linuxquota/] or use &#039;&#039;&#039;xfs_quota(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: What&#039;s project quota? ==&lt;br /&gt;
&lt;br /&gt;
The  project  quota  is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Can group quota and project quota be used at the same time? ==&lt;br /&gt;
&lt;br /&gt;
No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==&lt;br /&gt;
&lt;br /&gt;
To be answered.&lt;br /&gt;
&lt;br /&gt;
== Q: Are there any dump/restore tools for XFS? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and &#039;&#039;&#039;xfsrestore(8)&#039;&#039;&#039; are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.&lt;br /&gt;
&lt;br /&gt;
== Q: Does LILO work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
This depends on where you install LILO.&lt;br /&gt;
&lt;br /&gt;
Yes, for MBR (Master Boot Record) installations.&lt;br /&gt;
&lt;br /&gt;
No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.&lt;br /&gt;
&lt;br /&gt;
== Q: Does GRUB work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.&lt;br /&gt;
&lt;br /&gt;
== Q: Can XFS be used for a root filesystem? ==&lt;br /&gt;
&lt;br /&gt;
Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the &amp;quot;rootflags=&amp;quot; kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit &amp;quot;logdev=&amp;quot; specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]&lt;br /&gt;
&lt;br /&gt;
== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be &amp;quot;clean&amp;quot; when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.&lt;br /&gt;
&lt;br /&gt;
== Q: Is there a way to make a XFS filesystem larger or smaller? ==&lt;br /&gt;
&lt;br /&gt;
You can &#039;&#039;NOT&#039;&#039; make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.&lt;br /&gt;
&lt;br /&gt;
An XFS filesystem may be enlarged by using &#039;&#039;&#039;xfs_growfs(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the &#039;&#039;exact same&#039;&#039; starting point. Run &#039;&#039;&#039;xfs_growfs&#039;&#039;&#039; to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.&lt;br /&gt;
&lt;br /&gt;
Using XFS filesystems on top of a volume manager makes this a lot easier.&lt;br /&gt;
&lt;br /&gt;
== Q: What information should I include when reporting a problem? ==&lt;br /&gt;
&lt;br /&gt;
What you need to report depend on the problem you are seeing. Firstly, your machine hardware and storage configuration needs to be described. That includes:&lt;br /&gt;
&lt;br /&gt;
* kernel version (uname -a)&lt;br /&gt;
* xfsprogs version (xfs_repair -V)&lt;br /&gt;
* number of CPUs&lt;br /&gt;
* contents of /proc/meminfo&lt;br /&gt;
* contents of /proc/mounts&lt;br /&gt;
* contents of /proc/partitions&lt;br /&gt;
* RAID layout (hardware and/or software)&lt;br /&gt;
* LVM configuration&lt;br /&gt;
* type of disks you are using&lt;br /&gt;
* write cache status of drives&lt;br /&gt;
* size of BBWC and mode it is running in&lt;br /&gt;
* xfs_info output on the filesystem in question&lt;br /&gt;
* dmesg output showing all error messages and stack traces&lt;br /&gt;
 &lt;br /&gt;
Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:&lt;br /&gt;
&lt;br /&gt;
# iostat -x -d -m 5&lt;br /&gt;
# vmstat 5&lt;br /&gt;
 &lt;br /&gt;
can give us insight into the IO and memory utilisation of your machine at the time of the problem.&lt;br /&gt;
&lt;br /&gt;
If the filesystem is hanging, then capture the output of the dmesg command after running:&lt;br /&gt;
&lt;br /&gt;
 # echo w &amp;gt; /proc/sysrq-trigger&lt;br /&gt;
 # dmesg&lt;br /&gt;
&lt;br /&gt;
will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.&lt;br /&gt;
&lt;br /&gt;
And for advanced users, capturing an event trace using &#039;&#039;&#039;trace-cmd&#039;&#039;&#039; (git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git) will be very helpful. In many cases the XFS developers will ask for this information anyway, so it&#039;s a good idea to be ready with it in advance. Start the trace with this command, either from a directory not on an XFS filesystem or with an output file destination on a non-XFS filesystem:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd record -e xfs\*&lt;br /&gt;
&lt;br /&gt;
before the problem occurs, and once it has occurred, kill the trace-cmd with ctrl-C, and then run:&lt;br /&gt;
&lt;br /&gt;
 # trace-cmd report &amp;gt; trace_report.txt&lt;br /&gt;
&lt;br /&gt;
Compress the trace_report.txt file and include that with the bug report. The reason for trying to host the output of the record command on a different filesystem is so that the writing of the output file does not pollute the trace of the problem we are trying to diagnose.&lt;br /&gt;
&lt;br /&gt;
If you have a problem with &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039;, make sure that you save the entire output of the problematic run so that the developers can see exactly where it encountered the problem. You might be asked to capture the metadata in the filesystem using &#039;&#039;&#039;xfs_metadump(8)&#039;&#039;&#039; (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.&lt;br /&gt;
&lt;br /&gt;
== Q: Mounting an XFS filesystem does not work - what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
If mount prints an error message something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     mount: /dev/hda5 has wrong major or minor number&lt;br /&gt;
&lt;br /&gt;
you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the &amp;quot;-t xfs&amp;quot; option on mount or the &amp;quot;xfs&amp;quot; option in &amp;lt;tt&amp;gt;/etc/fstab&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
If you get something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 mount: wrong fs type, bad option, bad superblock on /dev/sda1,&lt;br /&gt;
        or too many mounted file systems&lt;br /&gt;
&lt;br /&gt;
Refer to your system log file (&amp;lt;tt&amp;gt;/var/log/messages&amp;lt;/tt&amp;gt;) for a detailed diagnostic message from the kernel.&lt;br /&gt;
&lt;br /&gt;
== Q: Does the filesystem have an undelete capability? ==&lt;br /&gt;
&lt;br /&gt;
There is no undelete in XFS (so far).&lt;br /&gt;
&lt;br /&gt;
However at least some XFS driver implementations do not wipe file information nodes completely so there are chance to recover files with specialized commercial closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS].&lt;br /&gt;
&lt;br /&gt;
In this kind of XFS driver implementation it does not re-use directory entries immediately so there are chance to get back recently deleted files even with their real names.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;xfs_irecover&#039;&#039; or &#039;&#039;xfsr&#039;&#039; may help too, [http://www.who.is.free.fr/wiki/doku.php?id=recover this site] has a few links.&lt;br /&gt;
&lt;br /&gt;
This applies to most recent Linux distributions (versions?), as well as to most popular NAS boxes that use embedded linux and XFS file system.&lt;br /&gt;
&lt;br /&gt;
Anyway, the best is to always keep backups.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I backup a XFS filesystem and ACLs? ==&lt;br /&gt;
&lt;br /&gt;
You can backup a XFS filesystem with utilities like &#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and standard &#039;&#039;&#039;tar(1)&#039;&#039;&#039; for standard files. If you want to backup ACLs you will need to use &#039;&#039;&#039;xfsdump&#039;&#039;&#039; or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (&amp;gt; version 3.1.4) or [http://rsync.samba.org/ rsync] (&amp;gt;= version 3.0.0) to backup ACLs and EAs. &#039;&#039;&#039;xfsdump&#039;&#039;&#039; can also be integrated with [http://www.amanda.org/ amanda(8)].&lt;br /&gt;
&lt;br /&gt;
== Q: I see applications returning error 990 or &amp;quot;Structure needs cleaning&amp;quot;, what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], &amp;quot;Structure needs cleaning.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.&lt;br /&gt;
&lt;br /&gt;
There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.&lt;br /&gt;
&lt;br /&gt;
You can use xfs_repair to remedy the problem (with the file system unmounted).&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==&lt;br /&gt;
&lt;br /&gt;
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.&lt;br /&gt;
&lt;br /&gt;
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.&lt;br /&gt;
&lt;br /&gt;
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you&#039;ll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the &#039;&#039;&#039;xfs_bmap(8)&#039;&#039;&#039; command).&lt;br /&gt;
&lt;br /&gt;
== Q: What is the problem with the write cache on journaled filesystems? ==&lt;br /&gt;
&lt;br /&gt;
Many drives use a write back cache in order to speed up the performance of writes.  However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk.  Further, the drive can de-stage data from the write cache to the platters in any order that it chooses.  This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk.  When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.&lt;br /&gt;
&lt;br /&gt;
With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information.  In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.&lt;br /&gt;
&lt;br /&gt;
With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued.  A powerfail &amp;quot;only&amp;quot; loses data in the cache but no essential ordering is violated, and corruption will not occur.&lt;br /&gt;
&lt;br /&gt;
With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance.  But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I tell if I have the disk write cache enabled? ==&lt;br /&gt;
&lt;br /&gt;
For SCSI/SATA:&lt;br /&gt;
&lt;br /&gt;
* Look in dmesg(8) output for a driver line, such as:&amp;lt;br /&amp;gt; &amp;quot;SCSI device sda: drive cache: write back&amp;quot;&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# sginfo -c /dev/sda | grep -i &#039;write cache&#039; &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For PATA/SATA (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -I /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; and look under &amp;quot;Enabled Supported&amp;quot; for &amp;quot;Write cache&amp;quot;&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
== Q: How can I address the problem with the disk write cache? ==&lt;br /&gt;
&lt;br /&gt;
=== Disabling the disk write back cache. ===&lt;br /&gt;
&lt;br /&gt;
For SATA/PATA(IDE) (for SATA this requires at least kernel 2.6.15 because [http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b095518ef51c37658c58367bd19240b8a113f25c ATA command passthrough support]):&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -W0 /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # hdparm -W0 /dev/hda&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# blktool /dev/sda wcache off&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # blktool /dev/hda wcache off&lt;br /&gt;
&lt;br /&gt;
For SCSI:&lt;br /&gt;
&lt;br /&gt;
* Using sginfo(8) which is a little tedious&amp;lt;br /&amp;gt; It takes 3 steps. For example:&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -c /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives a list of attribute names and values&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cX /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives an array of cache values which you must match up with from step 1, e.g.&amp;lt;br /&amp;gt; 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; allows you to reset the value of the cache attributes.&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Using an external log. ===&lt;br /&gt;
&lt;br /&gt;
Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will &#039;&#039;&#039;not&#039;&#039;&#039; solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won&#039;t be able to guarantee that if the metadata is on a drive with the write cache enabled.&lt;br /&gt;
&lt;br /&gt;
In fact using an external log will disable XFS&#039; write barrier support.&lt;br /&gt;
&lt;br /&gt;
=== Write barrier support. ===&lt;br /&gt;
&lt;br /&gt;
Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with &amp;quot;nobarrier&amp;quot;. Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported with external log device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported by the underlying device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, trial barrier write failed&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If the filesystem is mounted with an external log device then we currently don&#039;t support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn&#039;t support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.&lt;br /&gt;
&lt;br /&gt;
== Q. Should barriers be enabled with storage which has a persistent write cache? ==&lt;br /&gt;
&lt;br /&gt;
Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with &amp;quot;nobarrier&amp;quot;, assuming your RAID controller is infallible and not resetting randomly like some common ones do.  But take care about the hard disk write cache, which should be off.&lt;br /&gt;
&lt;br /&gt;
== Q. Which settings does my RAID controller need ? ==&lt;br /&gt;
&lt;br /&gt;
It&#039;s hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:&lt;br /&gt;
&lt;br /&gt;
Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory &amp;quot;[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]&amp;quot;) which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.&lt;br /&gt;
&lt;br /&gt;
If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.&lt;br /&gt;
&lt;br /&gt;
* onboard RAID controllers: there are so many different types it&#039;s hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn&#039;t even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.&lt;br /&gt;
&lt;br /&gt;
* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86); &lt;br /&gt;
&lt;br /&gt;
* Adaptec: allows setting individual drives cache&lt;br /&gt;
arcconf setcache &amp;lt;disk&amp;gt; wb|wt&lt;br /&gt;
wb=write back, which means write cache on, wt=write through, which means write cache off. So &amp;quot;wt&amp;quot; should be chosen.&lt;br /&gt;
&lt;br /&gt;
* Areca: In archttp under &amp;quot;System Controls&amp;quot; -&amp;gt; &amp;quot;System Configuration&amp;quot; there&#039;s the option &amp;quot;Disk Write Cache Mode&amp;quot; (defaults &amp;quot;Auto&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Off&amp;quot;: disk write cache is turned off&lt;br /&gt;
&lt;br /&gt;
&amp;quot;On&amp;quot;: disk write cache is enabled, this is not safe for your data but fast&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Auto&amp;quot;: If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to &amp;quot;On&amp;quot;, because neither controller cache nor disk cache is safe so you don&#039;t seem to care about your data and just want high speed (which you get then).&lt;br /&gt;
&lt;br /&gt;
That&#039;s a very sensible default so you can let it &amp;quot;Auto&amp;quot; or enforce &amp;quot;Off&amp;quot; to be sure.&lt;br /&gt;
&lt;br /&gt;
* LSI MegaRAID: allows setting individual disks cache:&lt;br /&gt;
 MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL                          # flushes the controller cache&lt;br /&gt;
 MegaCli -LDGetProp -Cache    -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the controller cache settings&lt;br /&gt;
 MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the disk cache settings (for all phys. disks in logical disk)&lt;br /&gt;
 MegaCli -LDSetProp -EnDskCache|DisDskCache  -LN|-L0,1,2|-LAll  -aN|-a0,1,2|-aALL # set disk cache setting&lt;br /&gt;
&lt;br /&gt;
* Xyratex: from the docs: &amp;quot;Write cache includes the disk drive cache and controller cache.&amp;quot;. So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.&lt;br /&gt;
&lt;br /&gt;
== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==&lt;br /&gt;
&lt;br /&gt;
The biggest problem is that those products seem to also virtualize disk &lt;br /&gt;
writes in a way that even barriers don&#039;t work any more, which means even &lt;br /&gt;
a fsync is not reliable. Tests confirm that unplugging the power from &lt;br /&gt;
such a system even with RAID controller with battery backed cache and &lt;br /&gt;
hard disk cache turned off (which is safe on a normal host) you can &lt;br /&gt;
destroy a database within the virtual machine (client, domU whatever you &lt;br /&gt;
call it).&lt;br /&gt;
&lt;br /&gt;
In qemu you can specify cache=off on the line specifying the virtual &lt;br /&gt;
disk. For others information is missing.&lt;br /&gt;
&lt;br /&gt;
== Q: What is the issue with directory corruption in Linux 2.6.17? ==&lt;br /&gt;
&lt;br /&gt;
In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some &amp;quot;sparse&amp;quot; endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: the fix is included in 2.6.17.7 and later kernels.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
To add insult to injury, &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039; is currently not correcting these directories on detection of this corrupt state either. This &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; issue is actively being worked on, and a fixed version will be available shortly.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfs_repair -n&#039;&#039;&#039; should be able to detect any directory corruption.&lt;br /&gt;
&lt;br /&gt;
Until a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; binary is available, one can make use of the &#039;&#039;&#039;xfs_db(8)&#039;&#039;&#039; command to mark the problem directory for removal (see the example below). A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; invocation will remove the directory and move all contents into &amp;quot;lost+found&amp;quot;, named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 core.mode = 040755&lt;br /&gt;
 core.version = 2&lt;br /&gt;
 core.format = 3 (btree)&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; write core.mode 0&lt;br /&gt;
 xfs_db&amp;amp;gt; quit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; will clear the directory, and add new entries (named by inode number) in lost+found.&lt;br /&gt;
&lt;br /&gt;
The easiest way to map inode numbers to full paths is via &#039;&#039;&#039;xfs_ncheck(8)&#039;&#039;&#039;&amp;lt;nowiki&amp;gt;: &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_ncheck -i 14101 -i 14102 /dev/sdXXX&lt;br /&gt;
       14101 full/path/mumble_fratz_foo_bar_1495&lt;br /&gt;
       14102 full/path/mumble_fratz_foo_bar_1494&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 ...&lt;br /&gt;
 next_unlinked = null&lt;br /&gt;
 u.bmbt.level = 1&lt;br /&gt;
 u.bmbt.numrecs = 1&lt;br /&gt;
 u.bmbt.keys[1] = [startoff] 1:[0]&lt;br /&gt;
 u.bmbt.ptrs[1] = 1:3628&lt;br /&gt;
 xfs_db&amp;amp;gt; fsblock 3628&lt;br /&gt;
 xfs_db&amp;amp;gt; type bmapbtd&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 magic = 0x424d4150&lt;br /&gt;
 level = 0&lt;br /&gt;
 numrecs = 19&lt;br /&gt;
 leftsib = null&lt;br /&gt;
 rightsib = null&lt;br /&gt;
 recs[1-19] = [startoff,startblock,blockcount,extentflag]&lt;br /&gt;
        1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]&lt;br /&gt;
        5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]&lt;br /&gt;
        9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]&lt;br /&gt;
        12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]&lt;br /&gt;
        15:[33554436,3488,8,0] 16:[33554444,3629,4,0]&lt;br /&gt;
        17:[33554448,3748,4,0] 18:[33554452,3900,4,0]&lt;br /&gt;
        19:[67108864,3364,4,0]&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the &#039;&#039;&#039;xfs_db&#039;&#039;&#039; dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; dblock 20&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 dhdr.magic = 0x58443244&lt;br /&gt;
 dhdr.bestfree[0].offset = 0&lt;br /&gt;
 dhdr.bestfree[0].length = 0&lt;br /&gt;
 dhdr.bestfree[1].offset = 0&lt;br /&gt;
 dhdr.bestfree[1].length = 0&lt;br /&gt;
 dhdr.bestfree[2].offset = 0&lt;br /&gt;
 dhdr.bestfree[2].length = 0&lt;br /&gt;
 du[0].inumber = 13937&lt;br /&gt;
 du[0].namelen = 25&lt;br /&gt;
 du[0].name = &amp;quot;mumble_fratz_foo_bar_1595&amp;quot;&lt;br /&gt;
 du[0].tag = 0x10&lt;br /&gt;
 du[1].inumber = 13938&lt;br /&gt;
 du[1].namelen = 25&lt;br /&gt;
 du[1].name = &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;&lt;br /&gt;
 du[1].tag = 0x38&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
So, here we can see that inode number 13938 matches up with name &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;. Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at &amp;quot;lost+found&amp;quot; (once &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; has removed the corrupt directory).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why does my &amp;gt; 2TB XFS partition disappear when I reboot ? ==&lt;br /&gt;
&lt;br /&gt;
Strictly speaking this is not an XFS problem.&lt;br /&gt;
&lt;br /&gt;
To support &amp;gt; 2TB partitions you need two things: a kernel that supports large block devices (&amp;lt;tt&amp;gt;CONFIG_LBD=y&amp;lt;/tt&amp;gt;) and a partition table format that can hold large partitions.  The default DOS partition tables don&#039;t.  The best partition format for&lt;br /&gt;
&amp;gt; 2TB partitions is the EFI GPT format (&amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Without CONFIG_LBD=y you can&#039;t even create the filesystem, but without &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt; it works fine until you reboot at which point the partition will disappear.  Note that you need to enable the &amp;lt;tt&amp;gt;CONFIG_PARTITION_ADVANCED&amp;lt;/tt&amp;gt; option before you can set &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I receive &amp;lt;tt&amp;gt;No space left on device&amp;lt;/tt&amp;gt; after &amp;lt;tt&amp;gt;xfs_growfs&amp;lt;/tt&amp;gt;? ==&lt;br /&gt;
&lt;br /&gt;
After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:&lt;br /&gt;
&lt;br /&gt;
  The only way to fix this is to move data around to free up space&lt;br /&gt;
  below 1TB. Find your oldest data (i.e. that was around before even&lt;br /&gt;
  the first grow) and move it off the filesystem (move, not copy).&lt;br /&gt;
  Then if you copy it back on, the data blocks will end up above 1TB&lt;br /&gt;
  and that should leave you with plenty of space for inodes below 1TB.&lt;br /&gt;
  &lt;br /&gt;
  A complete dump and restore will also fix the problem ;)&lt;br /&gt;
&lt;br /&gt;
Also, you can add &#039;inode64&#039; to your mount options to allow inodes to live above 1TB.&lt;br /&gt;
&lt;br /&gt;
example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&amp;amp;forum=38 | No space left on device on xfs filesystem with 7.7TB free]&lt;br /&gt;
&lt;br /&gt;
== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==&lt;br /&gt;
&lt;br /&gt;
The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons. &lt;br /&gt;
&lt;br /&gt;
Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.&lt;br /&gt;
&lt;br /&gt;
== Q: How to get around a bad inode repair is unable to clean up ==&lt;br /&gt;
&lt;br /&gt;
The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.&lt;br /&gt;
&lt;br /&gt;
  xfs_db -x -c &#039;inode XXX&#039; -c &#039;write core.nextents 0&#039; -c &#039;write core.size 0&#039; /dev/hdXX&lt;br /&gt;
&lt;br /&gt;
== Q: How to calculate the correct sunit,swidth values for optimal performance ==&lt;br /&gt;
&lt;br /&gt;
XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.&lt;br /&gt;
&lt;br /&gt;
These options can be sometimes autodetected (for example with md raid and recent enough kernel (&amp;gt;= 2.6.32) and xfsprogs (&amp;gt;= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.&lt;br /&gt;
&lt;br /&gt;
The calculation of these values is quite simple:&lt;br /&gt;
&lt;br /&gt;
  su = &amp;lt;RAID controllers stripe size in BYTES (or KiBytes when used with k)&amp;gt;&lt;br /&gt;
  sw = &amp;lt;# of data disks (don&#039;t count parity disks)&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use&lt;br /&gt;
&lt;br /&gt;
  su = 64k&lt;br /&gt;
  sw = 6 (RAID-6 of 8 disks has 6 data disks)&lt;br /&gt;
&lt;br /&gt;
A RAID stripe size of 256KB with a RAID-10 over 16 disks should use&lt;br /&gt;
&lt;br /&gt;
  su = 256k&lt;br /&gt;
  sw = 8 (RAID-10 of 16 disks has 8 data disks)&lt;br /&gt;
&lt;br /&gt;
Alternatively, you can use &amp;quot;sunit&amp;quot; instead of &amp;quot;su&amp;quot; and &amp;quot;swidth&amp;quot; instead of &amp;quot;sw&amp;quot; but then sunit/swidth values need to be specified in &amp;quot;number of 512B sectors&amp;quot;!&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; interpret sunit and swidth as being specified in units of 512B sectors; that&#039;s unfortunately not the unit they&#039;re reported in, however.&lt;br /&gt;
&amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; report them in multiples of your basic block size (bsize) and not in 512B sectors.&lt;br /&gt;
&lt;br /&gt;
Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.&lt;br /&gt;
&lt;br /&gt;
When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.&lt;br /&gt;
&lt;br /&gt;
== Q: Why doesn&#039;t NFS-exporting subdirectories of inode64-mounted filesystem work? ==&lt;br /&gt;
&lt;br /&gt;
The default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; type encodes only 32-bit of the inode number for subdirectory exports.  However, exporting the root of the filesystem works, or using one of the non-default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; types (&amp;lt;tt&amp;gt;fsid=uuid&amp;lt;/tt&amp;gt; in &amp;lt;tt&amp;gt;/etc/exports&amp;lt;/tt&amp;gt; with recent &amp;lt;tt&amp;gt;nfs-utils&amp;lt;/tt&amp;gt;) should work as well. (Thanks, Christoph!)&lt;br /&gt;
&lt;br /&gt;
== Q: What is the inode64 mount option for? ==&lt;br /&gt;
&lt;br /&gt;
By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like &amp;quot;disk full&amp;quot; when you still have plenty space free, but there&#039;s no more place in the first TB to create a new inode. Also, performance sucks.&lt;br /&gt;
&lt;br /&gt;
To come around this, use the inode64 mount options for filesystems &amp;gt;1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.&lt;br /&gt;
&lt;br /&gt;
Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.&lt;br /&gt;
&lt;br /&gt;
== Q: Can I just try the inode64 option to see if it helps me? ==&lt;br /&gt;
&lt;br /&gt;
Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can&#039;t access files &amp;amp; dirs that have been created with an inode &amp;gt;32bit anymore.&lt;br /&gt;
&lt;br /&gt;
== Q: Performance: mkfs.xfs -n size=64k option ==&lt;br /&gt;
&lt;br /&gt;
Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:&lt;br /&gt;
&lt;br /&gt;
Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a&lt;br /&gt;
directory entry is determined by the length of the name.&lt;br /&gt;
&lt;br /&gt;
There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there&#039;s the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.&lt;br /&gt;
&lt;br /&gt;
For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.&lt;br /&gt;
&lt;br /&gt;
In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don&#039;t have any numbers on what the difference might be - I&#039;m getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....&lt;br /&gt;
&lt;br /&gt;
== Q: I want to tune my XFS filesystems for &amp;lt;something&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
The standard answer you will get to this question is this: use the defaults.&lt;br /&gt;
&lt;br /&gt;
There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to  configure the filesystem appropriately.&lt;br /&gt;
&lt;br /&gt;
There are a lot of &amp;quot;XFS tuning guides&amp;quot; that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don&#039;t expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.&lt;br /&gt;
&lt;br /&gt;
In most cases, the only thing you need to to consider for &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; mount options. Increasing &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; reduces the number of journal IOs for a given workload, and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; will reduce them even further. The trade off for this increase in metadata performance is that more operations may be &amp;quot;missing&amp;quot; after recovery if the system crashes while actively making modifications.&lt;br /&gt;
&lt;br /&gt;
As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.&lt;br /&gt;
&lt;br /&gt;
== Q: Which factors influence the memory usage of xfs_repair? ==&lt;br /&gt;
&lt;br /&gt;
This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -n -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2096.&lt;br /&gt;
  #&lt;br /&gt;
&lt;br /&gt;
xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,&lt;br /&gt;
of which 2,097,152KB is needed for tracking free space. &lt;br /&gt;
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)&lt;br /&gt;
&lt;br /&gt;
Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2289.&lt;br /&gt;
&lt;br /&gt;
That is now needs at least another 200MB of RAM to run.&lt;br /&gt;
&lt;br /&gt;
The numbers reported by xfs_repair are the absolute minimum required and approximate at that;&lt;br /&gt;
more RAM than this may be required to complete successfully.&lt;br /&gt;
Also, if you only give xfs_repair the minimum required RAM, it will be slow;&lt;br /&gt;
for best repair performance, the more RAM you can give it the better.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why some files of my filesystem shows as &amp;quot;?????????? ? ?      ?          ?                ? filename&amp;quot; ? ==&lt;br /&gt;
&lt;br /&gt;
If ls -l shows you a listing as&lt;br /&gt;
&lt;br /&gt;
  # ?????????? ? ?      ?          ?                ? file1&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file2&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file3&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file4&lt;br /&gt;
&lt;br /&gt;
and errors like:&lt;br /&gt;
  # ls /pathtodir/&lt;br /&gt;
    ls: cannot access /pathtodir/file1: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file2: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file3: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file4: Invalid argument&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
or even:&lt;br /&gt;
  # failed to stat /pathtodir/file1&lt;br /&gt;
&lt;br /&gt;
It is very probable your filesystem must be mounted with inode64&lt;br /&gt;
  # mount -oremount,inode64 /dev/diskpart /mnt/xfs&lt;br /&gt;
&lt;br /&gt;
should make it work ok again.&lt;br /&gt;
If it works, add the option to fstab.&lt;br /&gt;
&lt;br /&gt;
== Q: The xfs_db &amp;quot;frag&amp;quot; command says I&#039;m over 50%.  Is that bad? ==&lt;br /&gt;
&lt;br /&gt;
It depends.  It&#039;s important to know how the value is calculated.  xfs_db looks at the extents in all files, and returns:&lt;br /&gt;
&lt;br /&gt;
  (actual extents - ideal extents) / actual extents&lt;br /&gt;
&lt;br /&gt;
This means that if, for example, you have an average of 2 extents per file, you&#039;ll get an answer of 50%.  4 extents per file would give you 75%.  This may or may not be a problem, especially depending on the size of the files in question.  (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented).  The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.&lt;br /&gt;
&lt;br /&gt;
Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:&lt;br /&gt;
[[Image:Frag_factor.png|500px]]&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2482</id>
		<title>XFS FAQ</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2482"/>
		<updated>2012-04-12T04:06:37Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about XFS? ==&lt;br /&gt;
&lt;br /&gt;
The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.&lt;br /&gt;
&lt;br /&gt;
You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the &#039;&#039;&#039;&amp;lt;nowiki&amp;gt;#xfs&amp;lt;/nowiki&amp;gt;&#039;&#039;&#039; IRC channel on &#039;&#039;irc.freenode.net&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about ACLs? ==&lt;br /&gt;
&lt;br /&gt;
Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;acl(5)&#039;&#039;&#039; manual page is also quite extensive.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find information about the internals of XFS? ==&lt;br /&gt;
&lt;br /&gt;
An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.&lt;br /&gt;
&lt;br /&gt;
Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.&lt;br /&gt;
&lt;br /&gt;
== Q: What partition type should I use for XFS on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Linux native filesystem (83).&lt;br /&gt;
&lt;br /&gt;
== Q: What mount options does XFS have? ==&lt;br /&gt;
&lt;br /&gt;
There are a number of mount options influencing XFS filesystems - refer to the &#039;&#039;&#039;mount(8)&#039;&#039;&#039; manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])&lt;br /&gt;
&lt;br /&gt;
== Q: Is there any relation between the XFS utilities and the kernel version? ==&lt;br /&gt;
&lt;br /&gt;
No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Does it run on platforms other than i386? ==&lt;br /&gt;
&lt;br /&gt;
XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Do quotas work on XFS? ==&lt;br /&gt;
&lt;br /&gt;
Yes.&lt;br /&gt;
&lt;br /&gt;
To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/  http://sourceforge.net/projects/linuxquota/] or use &#039;&#039;&#039;xfs_quota(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: What&#039;s project quota? ==&lt;br /&gt;
&lt;br /&gt;
The  project  quota  is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Can group quota and project quota be used at the same time? ==&lt;br /&gt;
&lt;br /&gt;
No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==&lt;br /&gt;
&lt;br /&gt;
To be answered.&lt;br /&gt;
&lt;br /&gt;
== Q: Are there any dump/restore tools for XFS? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and &#039;&#039;&#039;xfsrestore(8)&#039;&#039;&#039; are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.&lt;br /&gt;
&lt;br /&gt;
== Q: Does LILO work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
This depends on where you install LILO.&lt;br /&gt;
&lt;br /&gt;
Yes, for MBR (Master Boot Record) installations.&lt;br /&gt;
&lt;br /&gt;
No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.&lt;br /&gt;
&lt;br /&gt;
== Q: Does GRUB work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.&lt;br /&gt;
&lt;br /&gt;
== Q: Can XFS be used for a root filesystem? ==&lt;br /&gt;
&lt;br /&gt;
Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the &amp;quot;rootflags=&amp;quot; kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit &amp;quot;logdev=&amp;quot; specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]&lt;br /&gt;
&lt;br /&gt;
== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be &amp;quot;clean&amp;quot; when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.&lt;br /&gt;
&lt;br /&gt;
== Q: Is there a way to make a XFS filesystem larger or smaller? ==&lt;br /&gt;
&lt;br /&gt;
You can &#039;&#039;NOT&#039;&#039; make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.&lt;br /&gt;
&lt;br /&gt;
An XFS filesystem may be enlarged by using &#039;&#039;&#039;xfs_growfs(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the &#039;&#039;exact same&#039;&#039; starting point. Run &#039;&#039;&#039;xfs_growfs&#039;&#039;&#039; to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.&lt;br /&gt;
&lt;br /&gt;
Using XFS filesystems on top of a volume manager makes this a lot easier.&lt;br /&gt;
&lt;br /&gt;
== Q: What information should I include when reporting a problem? ==&lt;br /&gt;
&lt;br /&gt;
Things to include are what version of XFS you are using, if this is a CVS version of what date and version of the kernel. If you have problems with userland packages please report the version of the package you are using.&lt;br /&gt;
&lt;br /&gt;
If the problem relates to a particular filesystem, the output from the &#039;&#039;&#039;xfs_info(8)&#039;&#039;&#039; command and any &#039;&#039;&#039;mount(8)&#039;&#039;&#039; options in use will also be useful to the developers.&lt;br /&gt;
&lt;br /&gt;
If you experience an oops, please run it through &#039;&#039;&#039;ksymoops&#039;&#039;&#039; so that it can be interpreted.&lt;br /&gt;
&lt;br /&gt;
If you have a filesystem that cannot be repaired, make sure you have xfsprogs 2.9.0 or later and run &#039;&#039;&#039;xfs_metadump(8)&#039;&#039;&#039; to capture the metadata (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.&lt;br /&gt;
&lt;br /&gt;
== Q: Mounting an XFS filesystem does not work - what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
If mount prints an error message something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     mount: /dev/hda5 has wrong major or minor number&lt;br /&gt;
&lt;br /&gt;
you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the &amp;quot;-t xfs&amp;quot; option on mount or the &amp;quot;xfs&amp;quot; option in &amp;lt;tt&amp;gt;/etc/fstab&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
If you get something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 mount: wrong fs type, bad option, bad superblock on /dev/sda1,&lt;br /&gt;
        or too many mounted file systems&lt;br /&gt;
&lt;br /&gt;
Refer to your system log file (&amp;lt;tt&amp;gt;/var/log/messages&amp;lt;/tt&amp;gt;) for a detailed diagnostic message from the kernel.&lt;br /&gt;
&lt;br /&gt;
== Q: Does the filesystem have an undelete capability? ==&lt;br /&gt;
&lt;br /&gt;
There is no undelete in XFS (so far).&lt;br /&gt;
&lt;br /&gt;
However at least some XFS driver implementations do not wipe file information nodes completely so there are chance to recover files with specialized commercial closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS].&lt;br /&gt;
&lt;br /&gt;
In this kind of XFS driver implementation it does not re-use directory entries immediately so there are chance to get back recently deleted files even with their real names.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;xfs_irecover&#039;&#039; or &#039;&#039;xfsr&#039;&#039; may help too, [http://www.who.is.free.fr/wiki/doku.php?id=recover this site] has a few links.&lt;br /&gt;
&lt;br /&gt;
This applies to most recent Linux distributions (versions?), as well as to most popular NAS boxes that use embedded linux and XFS file system.&lt;br /&gt;
&lt;br /&gt;
Anyway, the best is to always keep backups.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I backup a XFS filesystem and ACLs? ==&lt;br /&gt;
&lt;br /&gt;
You can backup a XFS filesystem with utilities like &#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and standard &#039;&#039;&#039;tar(1)&#039;&#039;&#039; for standard files. If you want to backup ACLs you will need to use &#039;&#039;&#039;xfsdump&#039;&#039;&#039; or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (&amp;gt; version 3.1.4) or [http://rsync.samba.org/ rsync] (&amp;gt;= version 3.0.0) to backup ACLs and EAs. &#039;&#039;&#039;xfsdump&#039;&#039;&#039; can also be integrated with [http://www.amanda.org/ amanda(8)].&lt;br /&gt;
&lt;br /&gt;
== Q: I see applications returning error 990 or &amp;quot;Structure needs cleaning&amp;quot;, what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], &amp;quot;Structure needs cleaning.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.&lt;br /&gt;
&lt;br /&gt;
There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.&lt;br /&gt;
&lt;br /&gt;
You can use xfs_check and xfs_repair to remedy the problem (with the file system unmounted).&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==&lt;br /&gt;
&lt;br /&gt;
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.&lt;br /&gt;
&lt;br /&gt;
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.&lt;br /&gt;
&lt;br /&gt;
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you&#039;ll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the &#039;&#039;&#039;xfs_bmap(8)&#039;&#039;&#039; command).&lt;br /&gt;
&lt;br /&gt;
== Q: What is the problem with the write cache on journaled filesystems? ==&lt;br /&gt;
&lt;br /&gt;
Many drives use a write back cache in order to speed up the performance of writes.  However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk.  Further, the drive can de-stage data from the write cache to the platters in any order that it chooses.  This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk.  When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.&lt;br /&gt;
&lt;br /&gt;
With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information.  In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.&lt;br /&gt;
&lt;br /&gt;
With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued.  A powerfail &amp;quot;only&amp;quot; loses data in the cache but no essential ordering is violated, and corruption will not occur.&lt;br /&gt;
&lt;br /&gt;
With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance.  But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I tell if I have the disk write cache enabled? ==&lt;br /&gt;
&lt;br /&gt;
For SCSI/SATA:&lt;br /&gt;
&lt;br /&gt;
* Look in dmesg(8) output for a driver line, such as:&amp;lt;br /&amp;gt; &amp;quot;SCSI device sda: drive cache: write back&amp;quot;&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# sginfo -c /dev/sda | grep -i &#039;write cache&#039; &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For PATA/SATA (although for SATA this only works on a recent kernel with ATA command passthrough):&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -I /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; and look under &amp;quot;Enabled Supported&amp;quot; for &amp;quot;Write cache&amp;quot;&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
== Q: How can I address the problem with the disk write cache? ==&lt;br /&gt;
&lt;br /&gt;
=== Disabling the disk write back cache. ===&lt;br /&gt;
&lt;br /&gt;
For SATA/PATA(IDE): (although for SATA this only works on a recent kernel with ATA command passthrough):&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -W0 /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # hdparm -W0 /dev/hda&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# blktool /dev/sda wcache off&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # blktool /dev/hda wcache off&lt;br /&gt;
&lt;br /&gt;
For SCSI:&lt;br /&gt;
&lt;br /&gt;
* Using sginfo(8) which is a little tedious&amp;lt;br /&amp;gt; It takes 3 steps. For example:&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -c /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives a list of attribute names and values&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cX /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives an array of cache values which you must match up with from step 1, e.g.&amp;lt;br /&amp;gt; 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; allows you to reset the value of the cache attributes.&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Using an external log. ===&lt;br /&gt;
&lt;br /&gt;
Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will &#039;&#039;&#039;not&#039;&#039;&#039; solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won&#039;t be able to guarantee that if the metadata is on a drive with the write cache enabled.&lt;br /&gt;
&lt;br /&gt;
In fact using an external log will disable XFS&#039; write barrier support.&lt;br /&gt;
&lt;br /&gt;
=== Write barrier support. ===&lt;br /&gt;
&lt;br /&gt;
Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with &amp;quot;nobarrier&amp;quot;. Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported with external log device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported by the underlying device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, trial barrier write failed&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If the filesystem is mounted with an external log device then we currently don&#039;t support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn&#039;t support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.&lt;br /&gt;
&lt;br /&gt;
== Q. Should barriers be enabled with storage which has a persistent write cache? ==&lt;br /&gt;
&lt;br /&gt;
Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with &amp;quot;nobarrier&amp;quot;, assuming your RAID controller is infallible and not resetting randomly like some common ones do.  But take care about the hard disk write cache, which should be off.&lt;br /&gt;
&lt;br /&gt;
== Q. Which settings does my RAID controller need ? ==&lt;br /&gt;
&lt;br /&gt;
It&#039;s hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:&lt;br /&gt;
&lt;br /&gt;
Real RAID controllers (not those found onboard of mainboards) normally have a battery or flash backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory &amp;quot;[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]&amp;quot;) which is used for buffering writes to improve speed. This battery backed cache should ensure that if power fails or a PSU dies, the contents of the cache will be written to disk on next boot. However, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents.&lt;br /&gt;
&lt;br /&gt;
If you do not have a battery or flash backed cache you should seriously consider disabling write cache if you value your data.&lt;br /&gt;
&lt;br /&gt;
* onboard RAID controllers: there are so many different types it&#039;s hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn&#039;t even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.&lt;br /&gt;
&lt;br /&gt;
* 3ware: /cX/uX set cache=off, this will disable the controller and disk cache (see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf, page 86); &lt;br /&gt;
&lt;br /&gt;
* Adaptec: allows setting individual drives cache&lt;br /&gt;
arcconf setcache &amp;lt;disk&amp;gt; wb|wt&lt;br /&gt;
wb=write back, which means write cache on, wt=write through, which means write cache off. So &amp;quot;wt&amp;quot; should be chosen.&lt;br /&gt;
&lt;br /&gt;
* Areca: In archttp under &amp;quot;System Controls&amp;quot; -&amp;gt; &amp;quot;System Configuration&amp;quot; there&#039;s the option &amp;quot;Disk Write Cache Mode&amp;quot; (defaults &amp;quot;Auto&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Off&amp;quot;: disk write cache is turned off&lt;br /&gt;
&lt;br /&gt;
&amp;quot;On&amp;quot;: disk write cache is enabled, this is not safe for your data but fast&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Auto&amp;quot;: If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to &amp;quot;On&amp;quot;, because neither controller cache nor disk cache is safe so you don&#039;t seem to care about your data and just want high speed (which you get then).&lt;br /&gt;
&lt;br /&gt;
That&#039;s a very sensible default so you can let it &amp;quot;Auto&amp;quot; or enforce &amp;quot;Off&amp;quot; to be sure.&lt;br /&gt;
&lt;br /&gt;
* LSI MegaRAID: allows setting individual disks cache:&lt;br /&gt;
 MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL                          # flushes the controller cache&lt;br /&gt;
 MegaCli -LDGetProp -Cache    -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the controller cache settings&lt;br /&gt;
 MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the disk cache settings (for all phys. disks in logical disk)&lt;br /&gt;
 MegaCli -LDSetProp -EnDskCache|DisDskCache  -LN|-L0,1,2|-LAll  -aN|-a0,1,2|-aALL # set disk cache setting&lt;br /&gt;
&lt;br /&gt;
* Xyratex: from the docs: &amp;quot;Write cache includes the disk drive cache and controller cache.&amp;quot;. So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.&lt;br /&gt;
&lt;br /&gt;
== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==&lt;br /&gt;
&lt;br /&gt;
The biggest problem is that those products seem to also virtualize disk &lt;br /&gt;
writes in a way that even barriers don&#039;t work any more, which means even &lt;br /&gt;
a fsync is not reliable. Tests confirm that unplugging the power from &lt;br /&gt;
such a system even with RAID controller with battery backed cache and &lt;br /&gt;
hard disk cache turned off (which is safe on a normal host) you can &lt;br /&gt;
destroy a database within the virtual machine (client, domU whatever you &lt;br /&gt;
call it).&lt;br /&gt;
&lt;br /&gt;
In qemu you can specify cache=off on the line specifying the virtual &lt;br /&gt;
disk. For others information is missing.&lt;br /&gt;
&lt;br /&gt;
== Q: What is the issue with directory corruption in Linux 2.6.17? ==&lt;br /&gt;
&lt;br /&gt;
In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some &amp;quot;sparse&amp;quot; endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: the fix is included in 2.6.17.7 and later kernels.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
To add insult to injury, &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039; is currently not correcting these directories on detection of this corrupt state either. This &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; issue is actively being worked on, and a fixed version will be available shortly.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;xfs_check&#039;&#039;&#039; tool, or &#039;&#039;&#039;xfs_repair -n&#039;&#039;&#039;, should be able to detect any directory corruption.&lt;br /&gt;
&lt;br /&gt;
Until a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; binary is available, one can make use of the &#039;&#039;&#039;xfs_db(8)&#039;&#039;&#039; command to mark the problem directory for removal (see the example below). A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; invocation will remove the directory and move all contents into &amp;quot;lost+found&amp;quot;, named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 core.mode = 040755&lt;br /&gt;
 core.version = 2&lt;br /&gt;
 core.format = 3 (btree)&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; write core.mode 0&lt;br /&gt;
 xfs_db&amp;amp;gt; quit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; will clear the directory, and add new entries (named by inode number) in lost+found.&lt;br /&gt;
&lt;br /&gt;
The easiest way to map inode numbers to full paths is via &#039;&#039;&#039;xfs_ncheck(8)&#039;&#039;&#039;&amp;lt;nowiki&amp;gt;: &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_ncheck -i 14101 -i 14102 /dev/sdXXX&lt;br /&gt;
       14101 full/path/mumble_fratz_foo_bar_1495&lt;br /&gt;
       14102 full/path/mumble_fratz_foo_bar_1494&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 ...&lt;br /&gt;
 next_unlinked = null&lt;br /&gt;
 u.bmbt.level = 1&lt;br /&gt;
 u.bmbt.numrecs = 1&lt;br /&gt;
 u.bmbt.keys[1] = [startoff] 1:[0]&lt;br /&gt;
 u.bmbt.ptrs[1] = 1:3628&lt;br /&gt;
 xfs_db&amp;amp;gt; fsblock 3628&lt;br /&gt;
 xfs_db&amp;amp;gt; type bmapbtd&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 magic = 0x424d4150&lt;br /&gt;
 level = 0&lt;br /&gt;
 numrecs = 19&lt;br /&gt;
 leftsib = null&lt;br /&gt;
 rightsib = null&lt;br /&gt;
 recs[1-19] = [startoff,startblock,blockcount,extentflag]&lt;br /&gt;
        1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]&lt;br /&gt;
        5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]&lt;br /&gt;
        9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]&lt;br /&gt;
        12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]&lt;br /&gt;
        15:[33554436,3488,8,0] 16:[33554444,3629,4,0]&lt;br /&gt;
        17:[33554448,3748,4,0] 18:[33554452,3900,4,0]&lt;br /&gt;
        19:[67108864,3364,4,0]&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the &#039;&#039;&#039;xfs_db&#039;&#039;&#039; dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; dblock 20&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 dhdr.magic = 0x58443244&lt;br /&gt;
 dhdr.bestfree[0].offset = 0&lt;br /&gt;
 dhdr.bestfree[0].length = 0&lt;br /&gt;
 dhdr.bestfree[1].offset = 0&lt;br /&gt;
 dhdr.bestfree[1].length = 0&lt;br /&gt;
 dhdr.bestfree[2].offset = 0&lt;br /&gt;
 dhdr.bestfree[2].length = 0&lt;br /&gt;
 du[0].inumber = 13937&lt;br /&gt;
 du[0].namelen = 25&lt;br /&gt;
 du[0].name = &amp;quot;mumble_fratz_foo_bar_1595&amp;quot;&lt;br /&gt;
 du[0].tag = 0x10&lt;br /&gt;
 du[1].inumber = 13938&lt;br /&gt;
 du[1].namelen = 25&lt;br /&gt;
 du[1].name = &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;&lt;br /&gt;
 du[1].tag = 0x38&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
So, here we can see that inode number 13938 matches up with name &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;. Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at &amp;quot;lost+found&amp;quot; (once &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; has removed the corrupt directory).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why does my &amp;gt; 2TB XFS partition disappear when I reboot ? ==&lt;br /&gt;
&lt;br /&gt;
Strictly speaking this is not an XFS problem.&lt;br /&gt;
&lt;br /&gt;
To support &amp;gt; 2TB partitions you need two things: a kernel that supports large block devices (&amp;lt;tt&amp;gt;CONFIG_LBD=y&amp;lt;/tt&amp;gt;) and a partition table format that can hold large partitions.  The default DOS partition tables don&#039;t.  The best partition format for&lt;br /&gt;
&amp;gt; 2TB partitions is the EFI GPT format (&amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Without CONFIG_LBD=y you can&#039;t even create the filesystem, but without &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt; it works fine until you reboot at which point the partition will disappear.  Note that you need to enable the &amp;lt;tt&amp;gt;CONFIG_PARTITION_ADVANCED&amp;lt;/tt&amp;gt; option before you can set &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I receive &amp;lt;tt&amp;gt;No space left on device&amp;lt;/tt&amp;gt; after &amp;lt;tt&amp;gt;xfs_growfs&amp;lt;/tt&amp;gt;? ==&lt;br /&gt;
&lt;br /&gt;
After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:&lt;br /&gt;
&lt;br /&gt;
  The only way to fix this is to move data around to free up space&lt;br /&gt;
  below 1TB. Find your oldest data (i.e. that was around before even&lt;br /&gt;
  the first grow) and move it off the filesystem (move, not copy).&lt;br /&gt;
  Then if you copy it back on, the data blocks will end up above 1TB&lt;br /&gt;
  and that should leave you with plenty of space for inodes below 1TB.&lt;br /&gt;
  &lt;br /&gt;
  A complete dump and restore will also fix the problem ;)&lt;br /&gt;
&lt;br /&gt;
Also, you can add &#039;inode64&#039; to your mount options to allow inodes to live above 1TB.&lt;br /&gt;
&lt;br /&gt;
example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&amp;amp;forum=38 | No space left on device on xfs filesystem with 7.7TB free]&lt;br /&gt;
&lt;br /&gt;
== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==&lt;br /&gt;
&lt;br /&gt;
The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons. &lt;br /&gt;
&lt;br /&gt;
Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.&lt;br /&gt;
&lt;br /&gt;
== Q: How to get around a bad inode repair is unable to clean up ==&lt;br /&gt;
&lt;br /&gt;
The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.&lt;br /&gt;
&lt;br /&gt;
  xfs_db -x -c &#039;inode XXX&#039; -c &#039;write core.nextents 0&#039; -c &#039;write core.size 0&#039; /dev/hdXX&lt;br /&gt;
&lt;br /&gt;
== Q: How to calculate the correct sunit,swidth values for optimal performance ==&lt;br /&gt;
&lt;br /&gt;
XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.&lt;br /&gt;
&lt;br /&gt;
These options can be sometimes autodetected (for example with md raid and recent enough kernel (&amp;gt;= 2.6.32) and xfsprogs (&amp;gt;= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.&lt;br /&gt;
&lt;br /&gt;
The calculation of these values is quite simple:&lt;br /&gt;
&lt;br /&gt;
  su = &amp;lt;RAID controllers stripe size in BYTES (or KiBytes when used with k)&amp;gt;&lt;br /&gt;
  sw = &amp;lt;# of data disks (don&#039;t count parity disks)&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use&lt;br /&gt;
&lt;br /&gt;
  su = 64k&lt;br /&gt;
  sw = 6 (RAID-6 of 8 disks has 6 data disks)&lt;br /&gt;
&lt;br /&gt;
A RAID stripe size of 256KB with a RAID-10 over 16 disks should use&lt;br /&gt;
&lt;br /&gt;
  su = 256k&lt;br /&gt;
  sw = 8 (RAID-10 of 16 disks has 8 data disks)&lt;br /&gt;
&lt;br /&gt;
Alternatively, you can use &amp;quot;sunit&amp;quot; instead of &amp;quot;su&amp;quot; and &amp;quot;swidth&amp;quot; instead of &amp;quot;sw&amp;quot; but then sunit/swidth values need to be specified in &amp;quot;number of 512B sectors&amp;quot;!&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; interpret sunit and swidth as being specified in units of 512B sectors; that&#039;s unfortunately not the unit they&#039;re reported in, however.&lt;br /&gt;
&amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; report them in multiples of your basic block size (bsize) and not in 512B sectors.&lt;br /&gt;
&lt;br /&gt;
Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.&lt;br /&gt;
&lt;br /&gt;
When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.&lt;br /&gt;
&lt;br /&gt;
== Q: Why doesn&#039;t NFS-exporting subdirectories of inode64-mounted filesystem work? ==&lt;br /&gt;
&lt;br /&gt;
The default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; type encodes only 32-bit of the inode number for subdirectory exports.  However, exporting the root of the filesystem works, or using one of the non-default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; types (&amp;lt;tt&amp;gt;fsid=uuid&amp;lt;/tt&amp;gt; in &amp;lt;tt&amp;gt;/etc/exports&amp;lt;/tt&amp;gt; with recent &amp;lt;tt&amp;gt;nfs-utils&amp;lt;/tt&amp;gt;) should work as well. (Thanks, Christoph!)&lt;br /&gt;
&lt;br /&gt;
== Q: What is the inode64 mount option for? ==&lt;br /&gt;
&lt;br /&gt;
By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like &amp;quot;disk full&amp;quot; when you still have plenty space free, but there&#039;s no more place in the first TB to create a new inode. Also, performance sucks.&lt;br /&gt;
&lt;br /&gt;
To come around this, use the inode64 mount options for filesystems &amp;gt;1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.&lt;br /&gt;
&lt;br /&gt;
Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.&lt;br /&gt;
&lt;br /&gt;
== Q: Can I just try the inode64 option to see if it helps me? ==&lt;br /&gt;
&lt;br /&gt;
Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can&#039;t access files &amp;amp; dirs that have been created with an inode &amp;gt;32bit anymore.&lt;br /&gt;
&lt;br /&gt;
== Q: Performance: mkfs.xfs -n size=64k option ==&lt;br /&gt;
&lt;br /&gt;
Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:&lt;br /&gt;
&lt;br /&gt;
Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a&lt;br /&gt;
directory entry is determined by the length of the name.&lt;br /&gt;
&lt;br /&gt;
There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there&#039;s the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.&lt;br /&gt;
&lt;br /&gt;
For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.&lt;br /&gt;
&lt;br /&gt;
In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don&#039;t have any numbers on what the difference might be - I&#039;m getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....&lt;br /&gt;
&lt;br /&gt;
== Q: I want to tune my XFS filesystems for &amp;lt;something&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
The standard answer you will get to this question is this: use the defaults.&lt;br /&gt;
&lt;br /&gt;
There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to  configure the filesystem appropriately.&lt;br /&gt;
&lt;br /&gt;
There are a lot of &amp;quot;XFS tuning guides&amp;quot; that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don&#039;t expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.&lt;br /&gt;
&lt;br /&gt;
In most cases, the only thing you need to to consider for &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; mount options. Increasing &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; reduces the number of journal IOs for a given workload, and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; will reduce them even further. The trade off for this increase in metadata performance is that more operations may be &amp;quot;missing&amp;quot; after recovery if the system crashes while actively making modifications.&lt;br /&gt;
&lt;br /&gt;
As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much of the parallelization in XFS.&lt;br /&gt;
&lt;br /&gt;
== Q: Which factors influence the memory usage of xfs_repair? ==&lt;br /&gt;
&lt;br /&gt;
This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -n -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2096.&lt;br /&gt;
  #&lt;br /&gt;
&lt;br /&gt;
xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,&lt;br /&gt;
of which 2,097,152KB is needed for tracking free space. &lt;br /&gt;
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)&lt;br /&gt;
&lt;br /&gt;
Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2289.&lt;br /&gt;
&lt;br /&gt;
That is now needs at least another 200MB of RAM to run.&lt;br /&gt;
&lt;br /&gt;
The numbers reported by xfs_repair are the absolute minimum required and approximate at that;&lt;br /&gt;
more RAM than this may be required to complete successfully.&lt;br /&gt;
Also, if you only give xfs_repair the minimum required RAM, it will be slow;&lt;br /&gt;
for best repair performance, the more RAM you can give it the better.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why some files of my filesystem shows as &amp;quot;?????????? ? ?      ?          ?                ? filename&amp;quot; ? ==&lt;br /&gt;
&lt;br /&gt;
If ls -l shows you a listing as&lt;br /&gt;
&lt;br /&gt;
  # ?????????? ? ?      ?          ?                ? file1&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file2&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file3&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file4&lt;br /&gt;
&lt;br /&gt;
and errors like:&lt;br /&gt;
  # ls /pathtodir/&lt;br /&gt;
    ls: cannot access /pathtodir/file1: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file2: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file3: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file4: Invalid argument&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
or even:&lt;br /&gt;
  # failed to stat /pathtodir/file1&lt;br /&gt;
&lt;br /&gt;
It is very probable your filesystem must be mounted with inode64&lt;br /&gt;
  # mount -oremount,inode64 /dev/diskpart /mnt/xfs&lt;br /&gt;
&lt;br /&gt;
should make it work ok again.&lt;br /&gt;
If it works, add the option to fstab.&lt;br /&gt;
&lt;br /&gt;
== Q: The xfs_db &amp;quot;frag&amp;quot; command says I&#039;m over 50%.  Is that bad? ==&lt;br /&gt;
&lt;br /&gt;
It depends.  It&#039;s important to know how the value is calculated.  xfs_db looks at the extents in all files, and returns:&lt;br /&gt;
&lt;br /&gt;
  (actual extents - ideal extents) / actual extents&lt;br /&gt;
&lt;br /&gt;
This means that if, for example, you have an average of 2 extents per file, you&#039;ll get an answer of 50%.  4 extents per file would give you 75%.  This may or may not be a problem, especially depending on the size of the files in question.  (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented).  The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.&lt;br /&gt;
&lt;br /&gt;
Note that above a few average extents per file, the fragmentation factor rapidly approaches 100%:&lt;br /&gt;
[[Image:Frag_factor.png|500px]]&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=File:Frag_factor.png&amp;diff=2481</id>
		<title>File:Frag factor.png</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=File:Frag_factor.png&amp;diff=2481"/>
		<updated>2012-04-12T04:02:26Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: Graph of fragmentation factor vs. avg extents per file&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Graph of fragmentation factor vs. avg extents per file&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2416</id>
		<title>XFS FAQ</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_FAQ&amp;diff=2416"/>
		<updated>2012-02-16T23:14:44Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Info from: [http://oss.sgi.com/projects/xfs/faq.html main XFS faq at SGI]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Many thanks to earlier maintainers of this document - Thomas Graichen and Seth Mos.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about XFS? ==&lt;br /&gt;
&lt;br /&gt;
The SGI XFS project page http://oss.sgi.com/projects/xfs/ is the definitive reference. It contains pointers to whitepapers, books, articles, etc.&lt;br /&gt;
&lt;br /&gt;
You could also join the [[XFS_email_list_and_archives|XFS mailing list]] or the &#039;&#039;&#039;&amp;lt;nowiki&amp;gt;#xfs&amp;lt;/nowiki&amp;gt;&#039;&#039;&#039; IRC channel on &#039;&#039;irc.freenode.net&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find documentation about ACLs? ==&lt;br /&gt;
&lt;br /&gt;
Andreas Gruenbacher maintains the Extended Attribute and POSIX ACL documentation for Linux at http://acl.bestbits.at/&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;acl(5)&#039;&#039;&#039; manual page is also quite extensive.&lt;br /&gt;
&lt;br /&gt;
== Q: Where can I find information about the internals of XFS? ==&lt;br /&gt;
&lt;br /&gt;
An [http://oss.sgi.com/projects/xfs/training/ SGI XFS Training course] aimed at developers, triage and support staff, and serious users has been in development. Parts of the course are clearly still incomplete, but there is enough content to be useful to a broad range of users.&lt;br /&gt;
&lt;br /&gt;
Barry Naujok has documented the [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS ondisk format] which is a very useful reference.&lt;br /&gt;
&lt;br /&gt;
== Q: What partition type should I use for XFS on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Linux native filesystem (83).&lt;br /&gt;
&lt;br /&gt;
== Q: What mount options does XFS have? ==&lt;br /&gt;
&lt;br /&gt;
There are a number of mount options influencing XFS filesystems - refer to the &#039;&#039;&#039;mount(8)&#039;&#039;&#039; manual page or the documentation in the kernel source tree itself ([http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Documentation/filesystems/xfs.txt;hb=HEAD Documentation/filesystems/xfs.txt])&lt;br /&gt;
&lt;br /&gt;
== Q: Is there any relation between the XFS utilities and the kernel version? ==&lt;br /&gt;
&lt;br /&gt;
No, there is no relation. Newer utilities tend to mainly have fixes and checks the previous versions might not have. New features are also added in a backward compatible way - if they are enabled via mkfs, an incapable (old) kernel will recognize that it does not understand the new feature, and refuse to mount the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Does it run on platforms other than i386? ==&lt;br /&gt;
&lt;br /&gt;
XFS runs on all of the platforms that Linux supports. It is more tested on the more common platforms, especially the i386 family. Its also well tested on the IA64 platform since thats the platform SGI Linux products use.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Do quotas work on XFS? ==&lt;br /&gt;
&lt;br /&gt;
Yes.&lt;br /&gt;
&lt;br /&gt;
To use quotas with XFS, you need to enable XFS quota support when you configure your kernel. You also need to specify quota support when mounting. You can get the Linux quota utilities at their sourceforge website [http://sourceforge.net/projects/linuxquota/  http://sourceforge.net/projects/linuxquota/] or use &#039;&#039;&#039;xfs_quota(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: What&#039;s project quota? ==&lt;br /&gt;
&lt;br /&gt;
The  project  quota  is a quota mechanism in XFS can be used to implement a form of directory tree quota, where a specified directory and all of the files and subdirectories below it (i.e. a tree) can be restricted to using a subset of the available space in the filesystem.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Can group quota and project quota be used at the same time? ==&lt;br /&gt;
&lt;br /&gt;
No, project quota cannot be used with group quota at the same time. On the other hand user quota and project quota can be used simultaneously.&lt;br /&gt;
&lt;br /&gt;
== Q: Quota: Is umounting prjquota (project quota) enabled fs and mounting it again with grpquota (group quota) removing prjquota limits previously set from fs (and vice versa) ? ==&lt;br /&gt;
&lt;br /&gt;
To be answered.&lt;br /&gt;
&lt;br /&gt;
== Q: Are there any dump/restore tools for XFS? ==&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and &#039;&#039;&#039;xfsrestore(8)&#039;&#039;&#039; are fully supported. The tape format is the same as on IRIX, so tapes are interchangeable between operating systems.&lt;br /&gt;
&lt;br /&gt;
== Q: Does LILO work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
This depends on where you install LILO.&lt;br /&gt;
&lt;br /&gt;
Yes, for MBR (Master Boot Record) installations.&lt;br /&gt;
&lt;br /&gt;
No, for root partition installations because the XFS superblock is written at block zero, where LILO would be installed. This is to maintain compatibility with the IRIX on-disk format, and will not be changed.&lt;br /&gt;
&lt;br /&gt;
== Q: Does GRUB work with XFS? ==&lt;br /&gt;
&lt;br /&gt;
There is native XFS filesystem support for GRUB starting with version 0.91 and onward. Unfortunately, GRUB used to make incorrect assumptions about being able to read a block device image while a filesystem is mounted and actively being written to, which could cause intermittent problems when using XFS. This has reportedly since been fixed, and the 0.97 version (at least) of GRUB is apparently stable.&lt;br /&gt;
&lt;br /&gt;
== Q: Can XFS be used for a root filesystem? ==&lt;br /&gt;
&lt;br /&gt;
Yes, with one caveat: Linux does not support an external XFS journal for the root filesystem via the &amp;quot;rootflags=&amp;quot; kernel parameter. To use an external journal for the root filesystem in Linux, an init ramdisk must mount the root filesystem with explicit &amp;quot;logdev=&amp;quot; specified. [http://mindplusplus.wordpress.com/2008/07/27/scratching-an-i.html More information here.]&lt;br /&gt;
&lt;br /&gt;
== Q: Will I be able to use my IRIX XFS filesystems on Linux? ==&lt;br /&gt;
&lt;br /&gt;
Yes. The on-disk format of XFS is the same on IRIX and Linux. Obviously, you should back-up your data before trying to move it between systems. Filesystems must be &amp;quot;clean&amp;quot; when moved (i.e. unmounted). If you plan to use IRIX filesystems on Linux keep the following points in mind: the kernel needs to have SGI partition support enabled; there is no XLV support in Linux, so you are unable to read IRIX filesystems which use the XLV volume manager; also not all blocksizes available on IRIX are available on Linux (only blocksizes less than or equal to the pagesize of the architecture: 4k for i386, ppc, ... 8k for alpha, sparc, ... is possible for now). Make sure that the directory format is version 2 on the IRIX filesystems (this is the default since IRIX 6.5.5). Linux can only read v2 directories.&lt;br /&gt;
&lt;br /&gt;
== Q: Is there a way to make a XFS filesystem larger or smaller? ==&lt;br /&gt;
&lt;br /&gt;
You can &#039;&#039;NOT&#039;&#039; make a XFS partition smaller online. The only way to shrink is to do a complete dump, mkfs and restore.&lt;br /&gt;
&lt;br /&gt;
An XFS filesystem may be enlarged by using &#039;&#039;&#039;xfs_growfs(8)&#039;&#039;&#039;.&lt;br /&gt;
&lt;br /&gt;
If using partitions, you need to have free space after this partition to do so. Remove partition, recreate it larger with the &#039;&#039;exact same&#039;&#039; starting point. Run &#039;&#039;&#039;xfs_growfs&#039;&#039;&#039; to make the partition larger. Note - editing partition tables is a dangerous pastime, so back up your filesystem before doing so.&lt;br /&gt;
&lt;br /&gt;
Using XFS filesystems on top of a volume manager makes this a lot easier.&lt;br /&gt;
&lt;br /&gt;
== Q: What information should I include when reporting a problem? ==&lt;br /&gt;
&lt;br /&gt;
Things to include are what version of XFS you are using, if this is a CVS version of what date and version of the kernel. If you have problems with userland packages please report the version of the package you are using.&lt;br /&gt;
&lt;br /&gt;
If the problem relates to a particular filesystem, the output from the &#039;&#039;&#039;xfs_info(8)&#039;&#039;&#039; command and any &#039;&#039;&#039;mount(8)&#039;&#039;&#039; options in use will also be useful to the developers.&lt;br /&gt;
&lt;br /&gt;
If you experience an oops, please run it through &#039;&#039;&#039;ksymoops&#039;&#039;&#039; so that it can be interpreted.&lt;br /&gt;
&lt;br /&gt;
If you have a filesystem that cannot be repaired, make sure you have xfsprogs 2.9.0 or later and run &#039;&#039;&#039;xfs_metadump(8)&#039;&#039;&#039; to capture the metadata (which obfuscates filenames and attributes to protect your privacy) and make the dump available for someone to analyse.&lt;br /&gt;
&lt;br /&gt;
== Q: Mounting an XFS filesystem does not work - what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
If mount prints an error message something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
     mount: /dev/hda5 has wrong major or minor number&lt;br /&gt;
&lt;br /&gt;
you either do not have XFS compiled into the kernel (or you forgot to load the modules) or you did not use the &amp;quot;-t xfs&amp;quot; option on mount or the &amp;quot;xfs&amp;quot; option in &amp;lt;tt&amp;gt;/etc/fstab&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
If you get something like:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 mount: wrong fs type, bad option, bad superblock on /dev/sda1,&lt;br /&gt;
        or too many mounted file systems&lt;br /&gt;
&lt;br /&gt;
Refer to your system log file (&amp;lt;tt&amp;gt;/var/log/messages&amp;lt;/tt&amp;gt;) for a detailed diagnostic message from the kernel.&lt;br /&gt;
&lt;br /&gt;
== Q: Does the filesystem have an undelete capability? ==&lt;br /&gt;
&lt;br /&gt;
There is no undelete in XFS (so far).&lt;br /&gt;
&lt;br /&gt;
However at least some XFS driver implementations do not wipe file information nodes completely so there are chance to recover files with specialized commercial closed source software like [http://www.ufsexplorer.com/rdr_xfs.php Raise Data Recovery for XFS].&lt;br /&gt;
&lt;br /&gt;
In this kind of XFS driver implementation it does not re-use directory entries immediately so there are chance to get back recently deleted files even with their real names.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;xfs_irecover&#039;&#039; or &#039;&#039;xfsr&#039;&#039; may help too, [http://www.who.is.free.fr/wiki/doku.php?id=recover this site] has a few links.&lt;br /&gt;
&lt;br /&gt;
This applies to most recent Linux distributions (versions?), as well as to most popular NAS boxes that use embedded linux and XFS file system.&lt;br /&gt;
&lt;br /&gt;
Anyway, the best is to always keep backups.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I backup a XFS filesystem and ACLs? ==&lt;br /&gt;
&lt;br /&gt;
You can backup a XFS filesystem with utilities like &#039;&#039;&#039;xfsdump(8)&#039;&#039;&#039; and standard &#039;&#039;&#039;tar(1)&#039;&#039;&#039; for standard files. If you want to backup ACLs you will need to use &#039;&#039;&#039;xfsdump&#039;&#039;&#039; or [http://www.bacula.org/en/dev-manual/Current_State_Bacula.html Bacula] (&amp;gt; version 3.1.4) or [http://rsync.samba.org/ rsync] (&amp;gt;= version 3.0.0) to backup ACLs and EAs. &#039;&#039;&#039;xfsdump&#039;&#039;&#039; can also be integrated with [http://www.amanda.org/ amanda(8)].&lt;br /&gt;
&lt;br /&gt;
== Q: I see applications returning error 990 or &amp;quot;Structure needs cleaning&amp;quot;, what is wrong? ==&lt;br /&gt;
&lt;br /&gt;
The error 990 stands for [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_linux.h#l145 EFSCORRUPTED] which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=commit;h=da2f4d679c8070ba5b6a920281e495917b293aa0 converted from EFSCORRUPTED/990 over to using EUCLEAN], &amp;quot;Structure needs cleaning.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
The cause can be pretty much anything, unfortunately - filesystem, virtual memory manager, volume manager, device driver, or hardware.&lt;br /&gt;
&lt;br /&gt;
There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.&lt;br /&gt;
&lt;br /&gt;
You can use xfs_check and xfs_repair to remedy the problem (with the file system unmounted).&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I see binary NULLS in some files after recovery when I unplugged the power? ==&lt;br /&gt;
&lt;br /&gt;
Update: This issue has been addressed with a CVS fix on the 29th March 2007 and merged into mainline on 8th May 2007 for 2.6.22-rc1.&lt;br /&gt;
&lt;br /&gt;
XFS journals metadata updates, not data updates. After a crash you are supposed to get a consistent filesystem which looks like the state sometime shortly before the crash, NOT what the in memory image looked like the instant before the crash.&lt;br /&gt;
&lt;br /&gt;
Since XFS does not write data out immediately unless you tell it to with fsync, an O_SYNC or O_DIRECT open (the same is true of other filesystems), you are looking at an inode which was flushed out, but whose data was not. Typically you&#039;ll find that the inode is not taking any space since all it has is a size but no extents allocated (try examining the file with the &#039;&#039;&#039;xfs_bmap(8)&#039;&#039;&#039; command).&lt;br /&gt;
&lt;br /&gt;
== Q: What is the problem with the write cache on journaled filesystems? ==&lt;br /&gt;
&lt;br /&gt;
Many drives use a write back cache in order to speed up the performance of writes.  However, there are conditions such as power failure when the write cache memory is never flushed to the actual disk.  Further, the drive can de-stage data from the write cache to the platters in any order that it chooses.  This causes problems for XFS and journaled filesystems in general because they rely on knowing when a write has completed to the disk. They need to know that the log information has made it to disk before allowing metadata to go to disk.  When the metadata makes it to disk then the transaction can effectively be deleted from the log resulting in movement of the tail of the log and thus freeing up some log space. So if the writes never make it to the physical disk, then the ordering is violated and the log and metadata can be lost, resulting in filesystem corruption.&lt;br /&gt;
&lt;br /&gt;
With hard disk cache sizes of currently (Jan 2009) up to 32MB that can be a lot of valuable information.  In a RAID with 8 such disks these adds to 256MB, and the chance of having filesystem metadata in the cache is so high that you have a very high chance of big data losses on a power outage.&lt;br /&gt;
&lt;br /&gt;
With a single hard disk and barriers turned on (on=default), the drive write cache is flushed before and after a barrier is issued.  A powerfail &amp;quot;only&amp;quot; loses data in the cache but no essential ordering is violated, and corruption will not occur.&lt;br /&gt;
&lt;br /&gt;
With a RAID controller with battery backed controller cache and cache in write back mode, you should turn off barriers - they are unnecessary in this case, and if the controller honors the cache flushes, it will be harmful to performance.  But then you *must* disable the individual hard disk write cache in order to ensure to keep the filesystem intact after a power failure. The method for doing this is different for each RAID controller. See the section about RAID controllers below.&lt;br /&gt;
&lt;br /&gt;
== Q: How can I tell if I have the disk write cache enabled? ==&lt;br /&gt;
&lt;br /&gt;
For SCSI/SATA:&lt;br /&gt;
&lt;br /&gt;
* Look in dmesg(8) output for a driver line, such as:&amp;lt;br /&amp;gt; &amp;quot;SCSI device sda: drive cache: write back&amp;quot;&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# sginfo -c /dev/sda | grep -i &#039;write cache&#039; &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For PATA/SATA (although for SATA this only works on a recent kernel with ATA command passthrough):&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -I /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; and look under &amp;quot;Enabled Supported&amp;quot; for &amp;quot;Write cache&amp;quot;&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
== Q: How can I address the problem with the disk write cache? ==&lt;br /&gt;
&lt;br /&gt;
=== Disabling the disk write back cache. ===&lt;br /&gt;
&lt;br /&gt;
For SATA/PATA(IDE): (although for SATA this only works on a recent kernel with ATA command passthrough):&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# hdparm -W0 /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # hdparm -W0 /dev/hda&lt;br /&gt;
* &amp;lt;nowiki&amp;gt;# blktool /dev/sda wcache off&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; # blktool /dev/hda wcache off&lt;br /&gt;
&lt;br /&gt;
For SCSI:&lt;br /&gt;
&lt;br /&gt;
* Using sginfo(8) which is a little tedious&amp;lt;br /&amp;gt; It takes 3 steps. For example:&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -c /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives a list of attribute names and values&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cX /dev/sda&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; which gives an array of cache values which you must match up with from step 1, e.g.&amp;lt;br /&amp;gt; 0 0 0 1 0 1 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&lt;br /&gt;
*# &amp;lt;nowiki&amp;gt;#sginfo -cXR /dev/sda 0 0 0 1 0 0 0 0 0 0 65535 0 65535 65535 1 0 0 0 3 0 0&amp;lt;/nowiki&amp;gt;&amp;lt;br /&amp;gt; allows you to reset the value of the cache attributes.&lt;br /&gt;
&lt;br /&gt;
For RAID controllers:&lt;br /&gt;
&lt;br /&gt;
* See the section about RAID controllers below&lt;br /&gt;
&lt;br /&gt;
This disabling is kept persistent for a SCSI disk. However, for a SATA/PATA disk this needs to be done after every reset as it will reset back to the default of the write cache enabled. And a reset can happen after reboot or on error recovery of the drive. This makes it rather difficult to guarantee that the write cache is maintained as disabled.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Using an external log. ===&lt;br /&gt;
&lt;br /&gt;
Some people have considered the idea of using an external log on a separate drive with the write cache disabled and the rest of the file system on another disk with the write cache enabled. However, that will &#039;&#039;&#039;not&#039;&#039;&#039; solve the problem. For example, the tail of the log is moved when we are notified that a metadata write is completed to disk and we won&#039;t be able to guarantee that if the metadata is on a drive with the write cache enabled.&lt;br /&gt;
&lt;br /&gt;
In fact using an external log will disable XFS&#039; write barrier support.&lt;br /&gt;
&lt;br /&gt;
=== Write barrier support. ===&lt;br /&gt;
&lt;br /&gt;
Write barrier support is enabled by default in XFS since kernel version 2.6.17. It is disabled by mounting the filesystem with &amp;quot;nobarrier&amp;quot;. Barrier support will flush the write back cache at the appropriate times (such as on XFS log writes). This is generally the recommended solution, however, you should check the system logs to ensure it was successful. Barriers will be disabled and reported in the log if any of the 3 scenarios occurs:&lt;br /&gt;
&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported with external log device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, not supported by the underlying device&amp;quot;&lt;br /&gt;
* &amp;quot;Disabling barriers, trial barrier write failed&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If the filesystem is mounted with an external log device then we currently don&#039;t support flushing to the data and log devices (this may change in the future). If the driver tells the block layer that the device does not support write cache flushing with the write cache enabled then it will report that the device doesn&#039;t support it. And finally we will actually test out a barrier write on the superblock and test its error state afterwards, reporting if it fails.&lt;br /&gt;
&lt;br /&gt;
== Q. Should barriers be enabled with storage which has a persistent write cache? ==&lt;br /&gt;
&lt;br /&gt;
Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with &amp;quot;nobarrier&amp;quot;, assuming your RAID controller is infallible and not resetting randomly like some common ones do.  But take care about the hard disk write cache, which should be off.&lt;br /&gt;
&lt;br /&gt;
== Q. Which settings does my RAID controller need ? ==&lt;br /&gt;
&lt;br /&gt;
It&#039;s hard to tell because there are so many controllers. Please consult your RAID controller documentation to determine how to change these settings, but we try to give an overview here:&lt;br /&gt;
&lt;br /&gt;
Real RAID controllers (not those found onboard of mainboards) normally have a battery backed cache (or an [http://en.wikipedia.org/wiki/Electric_double-layer_capacitor ultracapacitor] + flash memory &amp;quot;[http://www.tweaktown.com/articles/2800/adaptec_zero_maintenance_cache_protection_explained/ zero maintenance cache]&amp;quot;) which is used for buffering writes to improve speed. Even if it&#039;s battery backed, the individual hard disk write caches need to be turned off, as they are not protected from a powerfail and will just lose all contents in that case.&lt;br /&gt;
&lt;br /&gt;
* onboard RAID controllers: there are so many different types it&#039;s hard to tell. Generally, those controllers have no cache, but let the hard disk write cache on. That can lead to the bad situation that after a powerfail with RAID-1 when only parts of the disk cache have been written, the controller doesn&#039;t even see that the disks are out of sync, as the disks can resort cached blocks and might have saved the superblock info, but then lost different data contents. So, turn off disk write caches before using the RAID function.&lt;br /&gt;
&lt;br /&gt;
* 3ware: /cX/uX set cache=off, see http://www.3ware.com/support/UserDocs/CLIGuide-9.5.1.1.pdf , page 86&lt;br /&gt;
&lt;br /&gt;
* Adaptec: allows setting individual drives cache&lt;br /&gt;
arcconf setcache &amp;lt;disk&amp;gt; wb|wt&lt;br /&gt;
wb=write back, which means write cache on, wt=write through, which means write cache off. So &amp;quot;wt&amp;quot; should be chosen.&lt;br /&gt;
&lt;br /&gt;
* Areca: In archttp under &amp;quot;System Controls&amp;quot; -&amp;gt; &amp;quot;System Configuration&amp;quot; there&#039;s the option &amp;quot;Disk Write Cache Mode&amp;quot; (defaults &amp;quot;Auto&amp;quot;)&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Off&amp;quot;: disk write cache is turned off&lt;br /&gt;
&lt;br /&gt;
&amp;quot;On&amp;quot;: disk write cache is enabled, this is not safe for your data but fast&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Auto&amp;quot;: If you use a BBM (battery backup module, which you really should use if you care about your data), the controller automatically turns disk writes off, to protect your data. In case no BBM is attached, the controller switches to &amp;quot;On&amp;quot;, because neither controller cache nor disk cache is safe so you don&#039;t seem to care about your data and just want high speed (which you get then).&lt;br /&gt;
&lt;br /&gt;
That&#039;s a very sensible default so you can let it &amp;quot;Auto&amp;quot; or enforce &amp;quot;Off&amp;quot; to be sure.&lt;br /&gt;
&lt;br /&gt;
* LSI MegaRAID: allows setting individual disks cache:&lt;br /&gt;
 MegaCli -AdpCacheFlush -aN|-a0,1,2|-aALL                          # flushes the controller cache&lt;br /&gt;
 MegaCli -LDGetProp -Cache    -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the controller cache settings&lt;br /&gt;
 MegaCli -LDGetProp -DskCache -LN|-L0,1,2|-LAll -aN|-a0,1,2|-aALL  # shows the disk cache settings (for all phys. disks in logical disk)&lt;br /&gt;
 MegaCli -LDSetProp -EnDskCache|DisDskCache  -LN|-L0,1,2|-LAll  -aN|-a0,1,2|-aALL # set disk cache setting&lt;br /&gt;
&lt;br /&gt;
* Xyratex: from the docs: &amp;quot;Write cache includes the disk drive cache and controller cache.&amp;quot;. So that means you can only set the drive caches and the unit caches together. To protect your data, turn it off, but write performance will suffer badly as also the controller write cache is disabled.&lt;br /&gt;
&lt;br /&gt;
== Q: Which settings are best with virtualization like VMware, XEN, qemu? ==&lt;br /&gt;
&lt;br /&gt;
The biggest problem is that those products seem to also virtualize disk &lt;br /&gt;
writes in a way that even barriers don&#039;t work any more, which means even &lt;br /&gt;
a fsync is not reliable. Tests confirm that unplugging the power from &lt;br /&gt;
such a system even with RAID controller with battery backed cache and &lt;br /&gt;
hard disk cache turned off (which is safe on a normal host) you can &lt;br /&gt;
destroy a database within the virtual machine (client, domU whatever you &lt;br /&gt;
call it).&lt;br /&gt;
&lt;br /&gt;
In qemu you can specify cache=off on the line specifying the virtual &lt;br /&gt;
disk. For others information is missing.&lt;br /&gt;
&lt;br /&gt;
== Q: What is the issue with directory corruption in Linux 2.6.17? ==&lt;br /&gt;
&lt;br /&gt;
In the Linux kernel 2.6.17 release a subtle bug was accidentally introduced into the XFS directory code by some &amp;quot;sparse&amp;quot; endian annotations. This bug was sufficiently uncommon (it only affects a certain type of format change, in Node or B-Tree format directories, and only in certain situations) that it was not detected during our regular regression testing, but it has been observed in the wild by a number of people now.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: the fix is included in 2.6.17.7 and later kernels.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
To add insult to injury, &#039;&#039;&#039;xfs_repair(8)&#039;&#039;&#039; is currently not correcting these directories on detection of this corrupt state either. This &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; issue is actively being worked on, and a fixed version will be available shortly.&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Update: a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; is now available; version 2.8.10 or later of the xfsprogs package contains the fixed version.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
No other kernel versions are affected. However, using a corrupt filesystem on other kernels can still result in the filesystem being shutdown if the problem has not been rectified (on disk), making it seem like other kernels are affected.&lt;br /&gt;
&lt;br /&gt;
The &#039;&#039;&#039;xfs_check&#039;&#039;&#039; tool, or &#039;&#039;&#039;xfs_repair -n&#039;&#039;&#039;, should be able to detect any directory corruption.&lt;br /&gt;
&lt;br /&gt;
Until a fixed &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; binary is available, one can make use of the &#039;&#039;&#039;xfs_db(8)&#039;&#039;&#039; command to mark the problem directory for removal (see the example below). A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; invocation will remove the directory and move all contents into &amp;quot;lost+found&amp;quot;, named by inode number (see second example on how to map inode number to directory entry name, which needs to be done _before_ removing the directory itself). The inode number of the corrupt directory is included in the shutdown report issued by the kernel on detection of directory corruption. Using that inode number, this is how one would ensure it is removed:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 core.mode = 040755&lt;br /&gt;
 core.version = 2&lt;br /&gt;
 core.format = 3 (btree)&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; write core.mode 0&lt;br /&gt;
 xfs_db&amp;amp;gt; quit&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
A subsequent &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; will clear the directory, and add new entries (named by inode number) in lost+found.&lt;br /&gt;
&lt;br /&gt;
The easiest way to map inode numbers to full paths is via &#039;&#039;&#039;xfs_ncheck(8)&#039;&#039;&#039;&amp;lt;nowiki&amp;gt;: &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_ncheck -i 14101 -i 14102 /dev/sdXXX&lt;br /&gt;
       14101 full/path/mumble_fratz_foo_bar_1495&lt;br /&gt;
       14102 full/path/mumble_fratz_foo_bar_1494&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Should this not work, we can manually map inode numbers in B-Tree format directory by taking the following steps:&lt;br /&gt;
&lt;br /&gt;
 &amp;lt;nowiki&amp;gt;&lt;br /&gt;
 # xfs_db -x /dev/sdXXX&lt;br /&gt;
 xfs_db&amp;amp;gt; inode NNN&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 core.magic = 0x494e&lt;br /&gt;
 ...&lt;br /&gt;
 next_unlinked = null&lt;br /&gt;
 u.bmbt.level = 1&lt;br /&gt;
 u.bmbt.numrecs = 1&lt;br /&gt;
 u.bmbt.keys[1] = [startoff] 1:[0]&lt;br /&gt;
 u.bmbt.ptrs[1] = 1:3628&lt;br /&gt;
 xfs_db&amp;amp;gt; fsblock 3628&lt;br /&gt;
 xfs_db&amp;amp;gt; type bmapbtd&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 magic = 0x424d4150&lt;br /&gt;
 level = 0&lt;br /&gt;
 numrecs = 19&lt;br /&gt;
 leftsib = null&lt;br /&gt;
 rightsib = null&lt;br /&gt;
 recs[1-19] = [startoff,startblock,blockcount,extentflag]&lt;br /&gt;
        1:[0,3088,4,0] 2:[4,3128,8,0] 3:[12,3308,4,0] 4:[16,3360,4,0]&lt;br /&gt;
        5:[20,3496,8,0] 6:[28,3552,8,0] 7:[36,3624,4,0] 8:[40,3633,4,0]&lt;br /&gt;
        9:[44,3688,8,0] 10:[52,3744,4,0] 11:[56,3784,8,0]&lt;br /&gt;
        12:[64,3840,8,0] 13:[72,3896,4,0] 14:[33554432,3092,4,0]&lt;br /&gt;
        15:[33554436,3488,8,0] 16:[33554444,3629,4,0]&lt;br /&gt;
        17:[33554448,3748,4,0] 18:[33554452,3900,4,0]&lt;br /&gt;
        19:[67108864,3364,4,0]&lt;br /&gt;
 &amp;lt;/nowiki&amp;gt;&lt;br /&gt;
&lt;br /&gt;
At this point we are looking at the extents that hold all of the directory information. There are three types of extent here, we have the data blocks (extents 1 through 13 above), then the leaf blocks (extents 14 through 18), then the freelist blocks (extent 19 above). The jumps in the first field (start offset) indicate our progression through each of the three types. For recovering file names, we are only interested in the data blocks, so we can now feed those offset numbers into the &#039;&#039;&#039;xfs_db&#039;&#039;&#039; dblock command. So, for the fifth extent - 5:[20,3496,8,0] - listed above:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
 ...&lt;br /&gt;
 xfs_db&amp;amp;gt; dblock 20&lt;br /&gt;
 xfs_db&amp;amp;gt; print&lt;br /&gt;
 dhdr.magic = 0x58443244&lt;br /&gt;
 dhdr.bestfree[0].offset = 0&lt;br /&gt;
 dhdr.bestfree[0].length = 0&lt;br /&gt;
 dhdr.bestfree[1].offset = 0&lt;br /&gt;
 dhdr.bestfree[1].length = 0&lt;br /&gt;
 dhdr.bestfree[2].offset = 0&lt;br /&gt;
 dhdr.bestfree[2].length = 0&lt;br /&gt;
 du[0].inumber = 13937&lt;br /&gt;
 du[0].namelen = 25&lt;br /&gt;
 du[0].name = &amp;quot;mumble_fratz_foo_bar_1595&amp;quot;&lt;br /&gt;
 du[0].tag = 0x10&lt;br /&gt;
 du[1].inumber = 13938&lt;br /&gt;
 du[1].namelen = 25&lt;br /&gt;
 du[1].name = &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;&lt;br /&gt;
 du[1].tag = 0x38&lt;br /&gt;
 ...&lt;br /&gt;
&lt;br /&gt;
So, here we can see that inode number 13938 matches up with name &amp;quot;mumble_fratz_foo_bar_1594&amp;quot;. Iterate through all the extents, and extract all the name-to-inode-number mappings you can, as these will be useful when looking at &amp;quot;lost+found&amp;quot; (once &#039;&#039;&#039;xfs_repair&#039;&#039;&#039; has removed the corrupt directory).&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why does my &amp;gt; 2TB XFS partition disappear when I reboot ? ==&lt;br /&gt;
&lt;br /&gt;
Strictly speaking this is not an XFS problem.&lt;br /&gt;
&lt;br /&gt;
To support &amp;gt; 2TB partitions you need two things: a kernel that supports large block devices (&amp;lt;tt&amp;gt;CONFIG_LBD=y&amp;lt;/tt&amp;gt;) and a partition table format that can hold large partitions.  The default DOS partition tables don&#039;t.  The best partition format for&lt;br /&gt;
&amp;gt; 2TB partitions is the EFI GPT format (&amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Without CONFIG_LBD=y you can&#039;t even create the filesystem, but without &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt; it works fine until you reboot at which point the partition will disappear.  Note that you need to enable the &amp;lt;tt&amp;gt;CONFIG_PARTITION_ADVANCED&amp;lt;/tt&amp;gt; option before you can set &amp;lt;tt&amp;gt;CONFIG_EFI_PARTITION=y&amp;lt;/tt&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Q: Why do I receive &amp;lt;tt&amp;gt;No space left on device&amp;lt;/tt&amp;gt; after &amp;lt;tt&amp;gt;xfs_growfs&amp;lt;/tt&amp;gt;? ==&lt;br /&gt;
&lt;br /&gt;
After [http://oss.sgi.com/pipermail/xfs/2009-January/039828.html growing a XFS filesystem], df(1) would show enough free space but attempts to write to the filesystem result in -ENOSPC. To fix this, [http://oss.sgi.com/pipermail/xfs/2009-January/039835.html Dave Chinner advised]:&lt;br /&gt;
&lt;br /&gt;
  The only way to fix this is to move data around to free up space&lt;br /&gt;
  below 1TB. Find your oldest data (i.e. that was around before even&lt;br /&gt;
  the first grow) and move it off the filesystem (move, not copy).&lt;br /&gt;
  Then if you copy it back on, the data blocks will end up above 1TB&lt;br /&gt;
  and that should leave you with plenty of space for inodes below 1TB.&lt;br /&gt;
  &lt;br /&gt;
  A complete dump and restore will also fix the problem ;)&lt;br /&gt;
&lt;br /&gt;
Also, you can add &#039;inode64&#039; to your mount options to allow inodes to live above 1TB.&lt;br /&gt;
&lt;br /&gt;
example:[https://www.centos.org/modules/newbb/viewtopic.php?topic_id=30703&amp;amp;forum=38 | No space left on device on xfs filesystem with 7.7TB free]&lt;br /&gt;
&lt;br /&gt;
== Q: Is using noatime or/and nodiratime at mount time giving any performance benefits in xfs (or not using them performance decrease)? ==&lt;br /&gt;
&lt;br /&gt;
The default atime behaviour is relatime, which has almost no overhead compared to noatime but still maintains sane atime values. All Linux filesystems use this as the default now (since around 2.6.30), but XFS has used relatime-like behaviour since 2006, so no-one should really need to ever use noatime on XFS for performance reasons. &lt;br /&gt;
&lt;br /&gt;
Also, noatime implies nodiratime, so there is never a need to specify nodiratime when noatime is also specified.&lt;br /&gt;
&lt;br /&gt;
== Q: How to get around a bad inode repair is unable to clean up ==&lt;br /&gt;
&lt;br /&gt;
The trick is go in with xfs_db and mark the inode as a deleted, which will cause repair to clean it up and finish the remove process.&lt;br /&gt;
&lt;br /&gt;
  xfs_db -x -c &#039;inode XXX&#039; -c &#039;write core.nextents 0&#039; -c &#039;write core.size 0&#039; /dev/hdXX&lt;br /&gt;
&lt;br /&gt;
== Q: How to calculate the correct sunit,swidth values for optimal performance ==&lt;br /&gt;
&lt;br /&gt;
XFS allows to optimize for a given RAID stripe unit (stripe size) and stripe width (number of data disks) via mount options.&lt;br /&gt;
&lt;br /&gt;
These options can be sometimes autodetected (for example with md raid and recent enough kernel (&amp;gt;= 2.6.32) and xfsprogs (&amp;gt;= 3.1.1) built with libblkid support) but manual calculation is needed for most of hardware raids.&lt;br /&gt;
&lt;br /&gt;
The calculation of these values is quite simple:&lt;br /&gt;
&lt;br /&gt;
  su = &amp;lt;RAID controllers stripe size in BYTES (or KiBytes when used with k)&amp;gt;&lt;br /&gt;
  sw = &amp;lt;# of data disks (don&#039;t count parity disks)&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So if your RAID controller has a stripe size of 64KB, and you have a RAID-6 with 8 disks, use&lt;br /&gt;
&lt;br /&gt;
  su = 64k&lt;br /&gt;
  sw = 6 (RAID-6 of 8 disks has 6 data disks)&lt;br /&gt;
&lt;br /&gt;
A RAID stripe size of 256KB with a RAID-10 over 16 disks should use&lt;br /&gt;
&lt;br /&gt;
  su = 256k&lt;br /&gt;
  sw = 8 (RAID-10 of 16 disks has 8 data disks)&lt;br /&gt;
&lt;br /&gt;
Alternatively, you can use &amp;quot;sunit&amp;quot; instead of &amp;quot;su&amp;quot; and &amp;quot;swidth&amp;quot; instead of &amp;quot;sw&amp;quot; but then sunit/swidth values need to be specified in &amp;quot;number of 512B sectors&amp;quot;!&lt;br /&gt;
&lt;br /&gt;
Note that &amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; interpret sunit and swidth as being specified in units of 512B sectors; that&#039;s unfortunately not the unit they&#039;re reported in, however.&lt;br /&gt;
&amp;lt;tt&amp;gt;xfs_info&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; report them in multiples of your basic block size (bsize) and not in 512B sectors.&lt;br /&gt;
&lt;br /&gt;
Assume for example: swidth 1024 (specified at mkfs.xfs command line; so 1024 of 512B sectors) and block size of 4096 (bsize reported by mkfs.xfs at output). You should see swidth 128 (reported by mkfs.xfs at output). 128 * 4096 == 1024 * 512.&lt;br /&gt;
&lt;br /&gt;
When creating XFS filesystem on top of LVM on top of hardware raid please use sunit/swith values as when creating XFS filesystem directly on top of hardware raid.&lt;br /&gt;
&lt;br /&gt;
== Q: Why doesn&#039;t NFS-exporting subdirectories of inode64-mounted filesystem work? ==&lt;br /&gt;
&lt;br /&gt;
The default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; type encodes only 32-bit of the inode number for subdirectory exports.  However, exporting the root of the filesystem works, or using one of the non-default &amp;lt;tt&amp;gt;fsid&amp;lt;/tt&amp;gt; types (&amp;lt;tt&amp;gt;fsid=uuid&amp;lt;/tt&amp;gt; in &amp;lt;tt&amp;gt;/etc/exports&amp;lt;/tt&amp;gt; with recent &amp;lt;tt&amp;gt;nfs-utils&amp;lt;/tt&amp;gt;) should work as well. (Thanks, Christoph!)&lt;br /&gt;
&lt;br /&gt;
== Q: What is the inode64 mount option for? ==&lt;br /&gt;
&lt;br /&gt;
By default, with 32bit inodes, XFS places inodes only in the first 1TB of a disk. If you have a disk with 100TB, all inodes will be stuck in the first TB. This can lead to strange things like &amp;quot;disk full&amp;quot; when you still have plenty space free, but there&#039;s no more place in the first TB to create a new inode. Also, performance sucks.&lt;br /&gt;
&lt;br /&gt;
To come around this, use the inode64 mount options for filesystems &amp;gt;1TB. Inodes will then be placed in the location where their data is, minimizing disk seeks.&lt;br /&gt;
&lt;br /&gt;
Beware that some old programs might have problems reading 64bit inodes, especially over NFS. Your editor used inode64 for over a year with recent (openSUSE 11.1 and higher) distributions using NFS and Samba without any corruptions, so that might be a recent enough distro.&lt;br /&gt;
&lt;br /&gt;
== Q: Can I just try the inode64 option to see if it helps me? ==&lt;br /&gt;
&lt;br /&gt;
Starting from kernel 2.6.35, you can try and then switch back. Older kernels have a bug leading to strange problems if you mount without inode64 again. For example, you can&#039;t access files &amp;amp; dirs that have been created with an inode &amp;gt;32bit anymore.&lt;br /&gt;
&lt;br /&gt;
== Q: Performance: mkfs.xfs -n size=64k option ==&lt;br /&gt;
&lt;br /&gt;
Asking the implications of that mkfs option on the XFS mailing list, Dave Chinner explained it this way:&lt;br /&gt;
&lt;br /&gt;
Inodes are not stored in the directory structure, only the directory entry name and the inode number. Hence the amount of space used by a&lt;br /&gt;
directory entry is determined by the length of the name.&lt;br /&gt;
&lt;br /&gt;
There is extra overhead to allocate large directory blocks (16 pages instead of one, to begin with, then there&#039;s the vmap overhead, etc), so for small directories smaller block sizes are faster for create and unlink operations.&lt;br /&gt;
&lt;br /&gt;
For empty directories, operations on 4k block sized directories consume roughly 50% less CPU that 64k block size directories. The 4k block size directories consume less CPU out to roughly 1.5 million entries where the two are roughly equal. At directory sizes of 10 million entries, 64k directory block operations are consuming about 15% of the CPU that 4k directory block operations consume.&lt;br /&gt;
&lt;br /&gt;
In terms of lookups, the 64k block directory will take less IO but consume more CPU for a given lookup. Hence it depends on your IO latency and whether directory readahead can hide that latency as to which will be faster. e.g. For SSDs, CPU usage might be the limiting factor, not the IO. Right now I don&#039;t have any numbers on what the difference might be - I&#039;m getting 1 billion inode population issues worked out first before I start on measuring cold cache lookup times on 1 billion files....&lt;br /&gt;
&lt;br /&gt;
== Q: I want to tune my XFS filesystems for &amp;lt;something&amp;gt; ==&lt;br /&gt;
&lt;br /&gt;
The standard answer you will get to this question is this: use the defaults.&lt;br /&gt;
&lt;br /&gt;
There are few workloads where using non-default mkfs.xfs or mount options make much sense. In general, the default values already used are optimised for best performance in the first place. mkfs.xfs will detect the difference between single disk and MD/DM RAID setups and change the default values it uses to  configure the filesystem appropriately.&lt;br /&gt;
&lt;br /&gt;
There are a lot of &amp;quot;XFS tuning guides&amp;quot; that Google will find for you - most are old, out of date and full of misleading or just plain incorrect information. Don&#039;t expect that tuning your filesystem for optimal bonnie++ numbers will mean your workload will go faster. You should only consider changing the defaults if either: a) you know from experience that your workload causes XFS a specific problem that can be worked around via a configuration change, or b) your workload is demonstrating bad performance when using the default configurations. In this case, you need to understand why your application is causing bad performance before you start tweaking XFS configurations.&lt;br /&gt;
&lt;br /&gt;
In most cases, the only thing you need to to consider for &amp;lt;tt&amp;gt;mkfs.xfs&amp;lt;/tt&amp;gt; is specifying the stripe unit and width for hardware RAID devices. For mount options, the only thing that will change metadata performance considerably are the &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; mount options. Increasing &amp;lt;tt&amp;gt;logbsize&amp;lt;/tt&amp;gt; reduces the number of journal IOs for a given workload, and &amp;lt;tt&amp;gt;delaylog&amp;lt;/tt&amp;gt; will reduce them even further. The trade off for this increase in metadata performance is that more operations may be &amp;quot;missing&amp;quot; after recovery if the system crashes while actively making modifications.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Which factors influence the memory usage of xfs_repair? ==&lt;br /&gt;
&lt;br /&gt;
This is best explained with an example. The example filesystem is 16Tb, but basically empty (look at icount).&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -n -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 64, imem = 0, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2096.&lt;br /&gt;
  #&lt;br /&gt;
&lt;br /&gt;
xfs_repair is saying it needs at least 2096MB of RAM to repair the filesystem,&lt;br /&gt;
of which 2,097,152KB is needed for tracking free space. &lt;br /&gt;
(The -m 1 argument was telling xfs_repair to use ony 1 MB of memory.)&lt;br /&gt;
&lt;br /&gt;
Now if we add some inodes (50 million) to the filesystem (look at icount again), and the result is:&lt;br /&gt;
&lt;br /&gt;
  # xfs_repair -vv -m 1 /dev/vda&lt;br /&gt;
  Phase 1 - find and verify superblock...&lt;br /&gt;
          - max_mem = 1024, icount = 50401792, imem = 196882, dblock = 4294967296, dmem = 2097152&lt;br /&gt;
  Required memory for repair is greater that the maximum specified&lt;br /&gt;
  with the -m option. Please increase it to at least 2289.&lt;br /&gt;
&lt;br /&gt;
That is now needs at least another 200MB of RAM to run.&lt;br /&gt;
&lt;br /&gt;
The numbers reported by xfs_repair are the absolute minimum required and approximate at that;&lt;br /&gt;
more RAM than this may be required to complete successfully.&lt;br /&gt;
Also, if you only give xfs_repair the minimum required RAM, it will be slow;&lt;br /&gt;
for best repair performance, the more RAM you can give it the better.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Q: Why some files of my filesystem shows as &amp;quot;?????????? ? ?      ?          ?                ? filename&amp;quot; ? ==&lt;br /&gt;
&lt;br /&gt;
If ls -l shows you a listing as&lt;br /&gt;
&lt;br /&gt;
  # ?????????? ? ?      ?          ?                ? file1&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file2&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file3&lt;br /&gt;
    ?????????? ? ?      ?          ?                ? file4&lt;br /&gt;
&lt;br /&gt;
and errors like:&lt;br /&gt;
  # ls /pathtodir/&lt;br /&gt;
    ls: cannot access /pathtodir/file1: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file2: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file3: Invalid argument&lt;br /&gt;
    ls: cannot access /pathtodir/file4: Invalid argument&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
or even:&lt;br /&gt;
  # failed to stat /pathtodir/file1&lt;br /&gt;
&lt;br /&gt;
It is very probable your filesystem must be mounted with inode64&lt;br /&gt;
  # mount -oremount,inode64 /dev/diskpart /mnt/xfs&lt;br /&gt;
&lt;br /&gt;
should make it work ok again.&lt;br /&gt;
If it works, add the option to fstab.&lt;br /&gt;
&lt;br /&gt;
== Q: The xfs_db &amp;quot;frag&amp;quot; command says I&#039;m over 50%.  Is that bad? ==&lt;br /&gt;
&lt;br /&gt;
It depends.  It&#039;s important to know how the value is calculated.  xfs_db looks at the extents in all files, and returns:&lt;br /&gt;
&lt;br /&gt;
  (actual extents - ideal extents) / actual extents&lt;br /&gt;
&lt;br /&gt;
This means that if, for example, you have an average of 2 extents per file, you&#039;ll get an answer of 50%.  4 extents per file would give you 75%.  This may or may not be a problem, especially depending on the size of the files in question.  (i.e. 400GB files in four 100GB extents would hardly be considered badly fragmented).  The xfs_bmap command can be useful for displaying the actual fragmentation/layout of individual files.&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_Companies&amp;diff=2412</id>
		<title>XFS Companies</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_Companies&amp;diff=2412"/>
		<updated>2012-02-03T19:00:32Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= These are companies that either use XFS or have a product that utilizes XFS . =&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Info gathered from: [http://oss.sgi.com/projects/xfs/users.html XFS Users] on [http://oss.sgi.com/ oss.sgi.com]&lt;br /&gt;
&lt;br /&gt;
== [http://www.dell.com/ Dell&#039;s HPC NFS Storage Solution] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;The Dell NFS Storage Solution (NSS) is a unique new storage solution providing cost-effective NFS storage as an appliance. Designed to scale from 20 TB installations up to 80 TB of usable space, the NSS is delivered as a fully configured, ready-to-go storage solution and is available with full hardware and software support from Dell. ... XFS was chosen for the NSS because XFS is capable of scaling beyond 16 TB and provides good performance for a broad range of applications.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
More information is available in [http://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-NSS-NFS-Storage-solution-final.pdf this solution guide].&lt;br /&gt;
&lt;br /&gt;
== [http://www.kernel.org/ The Linux Kernel Archives] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;A bit more than a year ago (as of October 2008) kernel.org, in an ever increasing need to squeeze more performance out of it&#039;s machines, made the leap of migrating the primary mirror machines (mirrors.kernel.org) to XFS.  We site a number of reasons including fscking 5.5T of disk is long and painful, we were hitting various cache issues, and we were seeking better performance out of our file system.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;After initial tests looked positive we made the jump, and have been quite happy with the results.  With an instant increase in performance and throughput, as well as the worst xfs_check we&#039;ve ever seen taking 10 minutes, we were quite happy.  Subsequently we&#039;ve moved all primary mirroring file-systems to XFS, including www.kernel.org , and mirrors.kernel.org.  With an average constant movement of about 400mbps around the world, and with peaks into the 3.1gbps range serving thousands of users simultaneously it&#039;s been a file system that has taken the brunt we can throw at it and held up spectacularly.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.sdss.org/ The Sloan Digital Sky Survey] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;The Sloan Digital Sky Survey is an ambitious effort to map one-quarter of the sky at optical and very-near infrared wavelengths and take spectra of 1 million extra-galactic objects. The estimated amount of data that will be acquired over the 5 year lifespan of the project is 15TB, however, the total amount of storage space required for object informational databases, corrected frames, and reduced spectra will be several factors more than this. The goal is to have all the data online and available to the collaborators at all times. To accomplish this goal we are using commodity, off the shelf (COTS) Intel servers with EIDE disks configured as RAID50 arrays using XFS. Currently, 14 machines are in production accounting for over 18TB. By the scheduled end of the survey in 2005, 50TB of XFS disks will be online serving SDSS data to collaborators and the public.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;For complete details and status of the project please see [http://www.sdss.org/ http://www.sdss.org]. For details of the storage systems, see the [http://home.fnal.gov/~yocum/storageServerTechnicalNote.html SDSS Storage Server Technical Note].&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www-d0.fnal.gov/  The DØ Experiment at Fermilab] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;At the DØ experiment at the Fermi National Accelerator Laboratory we have a ~150 node cluster of desktop machines all using the SGI-patched kernel. Every large disk (&amp;amp;gt;40Gb) or disk array in the cluster uses XFS including 4x640Gb disk servers and several 60-120Gb disks/arrays. Originally we chose reiserfs as our journaling filesystem, however, this was a disaster. We need to export these disks via NFS and this seemed perpetually broken in 2.4 series kernels. We switched to XFS and have been very happy. The only inconvenience is that it is not included in the standard kernel. The SGI guys are very prompt in their support of new kernels, but it is still an extra step which should not be necessary.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.ciprico.com/pDiMeda.shtml  Ciprico DiMeda NAS Solutions] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;The Ciprico DiMeda line of Network Attached Storage solutions combine the ease of connectivity of NAS with the SAN like performance levels required for digital media applications. The DiMeda 3600 provides high availability and high performance through dual NAS servers and redundant, scalable Fibre Channel RAID storage. The DiMeda 1700 provides high performance files services at a low price by using the latest Serial ATA RAID technology. All DiMeda systems are based on Linux and use XFS as the filesystem. We tested a number of filesystem alternatives and XFS was chosen because it provided the highest performance in digital media applications and the journaling feature ensures rapid failover in our dual node fault tolerant configurations.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.quantum.com/Products/NAS+Servers/Guardian+14000/Default.htm  The Quantum Guardian™ 14000] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;The Quantum Guardian™ 14000, the latest Network Attached Storage (NAS) solution from Quantum, delivers 1.4TB of enterprise-class storage for less than $25,000. The Guardian 14000 is a Linux-based device which utilizes XFS to provide a highly reliable journaling filesystem with simultaneous support for Windows, UNIX, Linux and Macintosh environments. As dedicated appliance optimized for fast, reliable file sharing, the Guardian 14000 combines the simplicity of NAS with a robust feature set designed for the most demanding enterprise environments. Support for tools such as Active Directory Service (ADS), UNIX Network Information Service (NIS) and Simple Network Management Protocol (SNMP) provides ease of management and seamless integration. Hardware redundancy, Snapshots and StorageCare™ on-site service ensure security for business-critical data.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.bigstorage.com/products_approach_overview.html  BigStorage K2~NAS] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;At BigStorage we pride ourselves on tailoring our NAS systems to meet our customer&#039;s needs, with the help of XFS we are able to provide them with the most reliable Journaling Filesystem technology available. Our open systems approach, which allows for cross-platform integration, gives our customers the flexibility to grow with their data requirements. In addition, BigStorage offers a variety of other features including total hardware redundancy, snapshotting, replication and backups directly from the unit. All of our products include BigStorageï¿½s 24/7 LiveResponse™ support. With LiveResponse™, we keep our team of experienced technical experts on call 24 hours a day, every day, to ensure that your storage investment remains online, all the time.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.echostar.com  Echostar DishPVR 721] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Echostar uses the XFS filesystem for its latest generation of satellite receivers, the DP721. Echostar chose XFS for its performance, stability and unique set of features.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;XFS allowed us to meet a demanding requirement of recording two mpeg2 streams to the internal hard drive while simultaneously viewing a third pre-recorded stream. In addition, XFS allowed us to withstand unexpected power loss without filesystem corruption or user interaction.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We tested several other filesystems, but XFS emerged as the clear winner.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.sun.com/hardware/serverappliances/raq550/  Sun Cobalt RaQ™ 550] ==&lt;br /&gt;
&lt;br /&gt;
From the [http://www.sun.com/hardware/serverappliances/raq550/features.html features] page:&lt;br /&gt;
&lt;br /&gt;
&amp;quot;XFS is a journaling file system capable of quick fail over recovery after unexpected interruptions. XFS is an important feature for mission-critical applications as it ensures data integrity and dramatically reduces startup time by avoiding FSCK delay.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://pingu.salk.edu/  Center for Cytometry and Molecular Imaging at the Salk Institute] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;I run the Center for Cytometry and Molecular Imaging at the Salk Institute in La Jolla, CA. We&#039;re a core facility for the Institute, offering flow cytometry, basic and deconvolution microscopy, phosphorimaging (radioactivity imaging) and fluorescent imaging.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;I&#039;m currently in the process of migrating our data server to Linux/XFS. Our web server currently uses Linux/XFS. We have about 60 Gb on the data server which has a 100Gb SCSI RAID 5 array. This is a bit restrictive for our microscopists so in order that they can put more data online, I&#039;m adding another machine, also running Linux/XFS, with about 420 Gb of IDE-RAID5, based on Adaptec controllers....&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Servers are configured with quota and run Samba, NFS, and Netatalk for connectivity to the mixed bag of computers we have around here. I use the CVS XFS tree most of the time. I have not seen any problems in the several months I have been testing.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://coltex.nl/ Coltex Retail Group BV] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Coltex Retail group BV in the Netherlands uses Red Hat Linux with XFS for their main database server which collects the data from over 240 clothing retail stores throughout the Netherlands. Coltex depends on the availability of the server for over 100 hundred employees in the main office for retrieval of logistical and sales figures. The database size is roughly 10GB large containing both historical and current data.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;The entire production and logistical system depends on the availability of the system and downtime would mean a significant financial penalty. The speed and reliability of the XFS filesystem which has a proven track record and mature tools to go with it is fundamental to the availability of the system.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;XFS has saved us a lot of time during testing and implementation. A long filesystems check is no longer needed when bad things happen when they do. The increased speed of our database system which is based on Progress 9.1C is also a nice benefit to this filesystem.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.dkp.com/ DKP Effects] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We&#039;re a 3D computer graphics/post-production house. We&#039;ve currently got four fileservers using XFS under Linux online - three 350GB servers and one 800GB server. The servers are under fairly heavy load - network load to and from the dual NICs on the box is basically maxed out 18 hours a day - and we do have occasional lockups and drive failures. Thanks to Linux SW RAID5 and XFS, though, we haven&#039;t had any data loss, or significant down time.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.epigenomics.com/ Epigenomics] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We currently have several IDE-to-SCSI-RAID systems with XFS in production. The largest has a capacity of 1.5TB, the other 2 have 430GB each.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Data stored on these filesystems is on the one hand &amp;quot;normal&amp;quot; home directories and corporate documents and on the other hand scientific data for our laboratory and IT department.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.incyte.com/ Incyte Genomics] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;I&#039;m currently in the process of slowly converting 21 clusters totaling 2300+ processors over to XFS.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;These machines are running a fairly stock RH7.1+XFS. The application is our own custom scheduler for doing genomic research. We have one of the worlds largest sequencing labs which generates a tremendous amount of raw data. Vast amounts of CPU cycles must be applied to it to turn it into useful data we can then sell access to.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Currently, a minority of these machines are running XFS, but as I can get downtime on the clusters I am upgrading them to 7.1+XFS. When I&#039;m done, it&#039;ll be about 10TB of XFS goodness... across 9G disks mostly.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.monmouth.edu/ Monmouth University] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We&#039;ve replaced our NetApp filer (80GB, $40,000). NetApp ONTAP software [runs on NetApp filers] is basically an NFS and CIFS server with their own proprietary filesystem. We were quickly running out of space and our annual budget almost depleted. What were we to do?&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;With an off-the-shelf Dell 4400 series server and 300GB of disks ($8,000 total). We were able to run Linux and Samba to emulate a NetApp filer.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;XFS allowed us to manage 300GB of data with absolutely no downtime (now going on 79 days) since implementation. Gone are the days of fearing the fsck of 300GB.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.astro.wisc.edu  The University of Wisconsin Astronomy Department] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;At the University of Wisconsin Astronomy Department we have been using Linux XFS since the first release. We currently have 31 Linux boxes running XFS on all filesystems with about 2.6 TB of disk space on these machines. We use XFS primarily on our data reduction systems, but we also use it on our web server and on one of the remote observing machines at the WIYN 3.5m Telescope at Kitt Peak (http://www.noao.edu/wiyn/wiyn.html).&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We will likely be using Linux XFS at least in part on the GLIMPSE program (http://www.astro.wisc.edu/sirtf/) which will likely require several TB of disk space to process the data.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.amoa.org/ The Austin Museum of Art] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;The Austin Museum of Art has two file servers running RedHat 7.2_XFS upgraded from RedHat 7.1_XFS. Our webserver runs Domino on top of RedHat 7.3_XFS and we&#039;re getting about 70% better performance than the Domino server running on Windows 2000 Server. We&#039;re moving our workstations away from Windows and Microsoft Office to an LTSP server running on RedHat 7.3_XFS.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We&#039;ve become solely dependent on XFS for all of our data systems.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.tecmath.com/ tecmath AG] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We use a production server with a 270 GB RAID 5 (hardware) disk array. It is based on a Suse 7.2 distribution, but with a standard 2.4.12 kernel with XFS and LVM patches. The server provides NFS to 8 Unix clients as well as Samba to about 80 PCs. The machine also runs Bind 9, Apache, Exim, DHCP, POP3, MySQL. I have tried out different configurations with ReiserFS, but I didn&#039;t manage to find a stable configuration with respect to NFS. Since I converted all disks to XFS some 3 months ago, we never had any filesystem-related problems.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.theiqgroup.com/ The IQ Group] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Here at the IQ Group, Inc. we use XFS for all our production and development servers.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Our OS of choice is Slackware Linux 8.0. Our hardware of choice is Dell and VALinux servers.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;As for applications, we run the standard Unix/Linux apps like Sendmail, Apache, BIND, DHCP, iptables, etc.; as well as Oracle 9i and Arkeia.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We&#039;ve been running XFS across the board for about 3 months now without a hitch (so far).&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Size-wise, our biggest server is about 40 GB, but that will be increasing substantially in the near future.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Our production servers are collocated so a journaled FS was a must. Reboots are quick and no human interaction is required like with a bad fsck on ext2. Additionally, our database servers gain additional integrity and robustness.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We originally chose XFS over ReiserFS and ext3 because of it&#039;s age (it&#039;s been in production on SGI boxes for probably longer than all the other journaling FS&#039;s combined) and it&#039;s speed appeared comparable as well.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.artsit.usyd.edu.au  Arts IT Unit, Sydney University] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;I&#039;ve got XFS on a &#039;production&#039; file server. The machine could have up to 500 people logged in, but typically less than 200. Most are Mac users, connected via NetAtalk for &#039;personal files&#039;, although there are shared areas for admin units. Probably about 30-40 windows users. (Samba) It&#039;s the file server for an Academic faculty at a University.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Hardware RAID, via Mylex dual channel controller with 4 drives, Intel Tupelo MB, Intel &#039;SC5000&#039; server chassis with redundant power and hot-swap scsi bays. The system boots off a non RAID single 9gb UW-scsi drive.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Only system &#039;crash&#039; was caused by some one accidentally unplugging it, just before we put it into production. It was back in full operation within 5 minutes. Without journaling, the fsck would have taken well over an hour. In day to day use it has run well.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://structbio.vanderbilt.edu/comp/  Vanderbilt University Center for Structural Biology] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;I run a high-performance computing center for Structural Biology research at Vanderbilt University. We use XFS extensively, and have been since the late prerelease versions. I&#039;ve had nothing but good experiences with it.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We began using XFS in our search for a good solution for our RAID fileservers. We had such good experiences with it on these systems that we&#039;ve begun putting it on the root/usr/var partitions of every Linux system we run here. I even have it on my laptop these days. XFS in combination with the 2.4 NFS3 implementation performs very well for us, and we have great uptimes on these systems (Our 750GB ArenaII setup is at 143 days right now).&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;All told, we&#039;ve got about 1.2TB of XFS filesystems spinning right now. It&#039;s spread out across maybe a dozen or so filesystems and will continue to increase as we are growing fast and that&#039;s all we use now. Next up is putting it on our 17-node Linux cluster, which will bring that up to 1.5TB spread across 30 filesystems.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;I, for one, would LOVE to see XFS make it into the kernel tree. From my perspectives, it&#039;s one of the best things to happen to Linux in the 7 years I&#039;ve been using/administering it.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
==== 2008 Update ====&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We&#039;ve since moved our main home directories to a proprietary NAS, but continue to use XFS on 10TB of LVM storage for doing backup-to-disk from the same NAS&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www-cdf.fnal.gov/  CDF Experiment at Fermi National Lab] ==&lt;br /&gt;
&lt;br /&gt;
CDF, an elementary particle physics experiment at Fermi National Lab, is using XFS for all our cache disks.&lt;br /&gt;
&lt;br /&gt;
The usage model is that we have a PB tape archive (2 STK silos) as permanent storage. In front of this archive we are deploying a roughly 100TB disk cache system. The cache is made up of 50 2TB file server based on cheap commodity hardware (3ware based hardware raid using IDE drives). The data is then processed by a cluster of 300 Dual CPU Linux PC&#039;s. The cache software is dCache, a DESY/FNAL product.&lt;br /&gt;
&lt;br /&gt;
The whole system is used by more than 300 active users from all over the world for batch processing for their physics data analysis.&lt;br /&gt;
&lt;br /&gt;
== [http://www.get2chip.com  Get2Chip, Inc.] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We are using XFS on 3 production file servers with approximately 1.5T of data. Quite impressive especially when we had a power outage and all three servers shutdown. All servers came back up in minutes with no problems! We are looking at creating two more servers that would manage 2+ TB of data store.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.lando.co.za  Lando International Group Technologies] ==&lt;br /&gt;
&lt;br /&gt;
Lando International Group Technologies is the home of:&lt;br /&gt;
&lt;br /&gt;
* [www.lando.co.za Lanndo Technologies Africa (Pty) Ltd] - Internet Service Provider&lt;br /&gt;
* [www.lbsd.net Linux Based Systems Design] (Article 21). Not-For-Profit company established to provide free Linux distributions and programs.&lt;br /&gt;
* Cell Park South Africa (Pty) Ltd. RSA Pat Appln 2001/10406. Collecting parking fees by means of cell phone SMS or voice.&lt;br /&gt;
* Read Plus Education (Pty) Ltd. Software based reading skills training and testing for ages 4 to 100.&lt;br /&gt;
* Mobivan. Mobile office including Internet access, fax, copying, printing, telephone, collection and delivery services, legal services, pre-paid phone and electricity services, bill payment email, secretarial services, training facilities and management services.&lt;br /&gt;
* Lando International Marketing Agency. Direct marketing services, design and supply of promotional material, consulting, sourcing of capital and other funding.&lt;br /&gt;
* Illico. Software development and systems analysis on most platforms.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Throughout these companies, we use the XFS filesystem with [http://idms.lbsd.net IDMS Linux] on high-end Intel servers, with an average of 100 GB storage each. XFS stores our customer and user data, including credit card details, mail, routing tables, etc.. We have not had one problem since the release of the first XFS patch.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.fcb-wilkens.com  Foote, Cone, &amp;amp;amp; Belding] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We are an advertisement company in Germany, and the use of the XFS filesystem is a story of success for us. In our Hamburg office, we have two file servers having a 420 Gig RAID in XFS format serving (almost) all our data to about 180 Macs and about 30 PCs using Samba and Netatalk. Some of the data is used in our offices in Frankfurt and Berlin, and in fact the Berlin office is just getting it&#039;s own 250 Gig fileserver (using XFS) right now.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;The general success with XFS has led us to switch over all our Linux servers to run on XFS as well (with the exception of two systems that are tied to tight specifications configuration wise). XFS, even the old 1.0 version, has happily taken on various abuse - broken SCSI controllers, broken RAID systems.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.moving-picture.co.uk/  Moving Picture Company] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We here at MPC use XFS/RedHat 7.2 on all of our graphics-workstations and file-servers. More info can be found in an [http://www.linuxuser.co.uk/articles/issue20/lu20-Linux_at_work-In_the_picture.pdf  article] LinuxUser magazine did on us recently.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.coremetrics.com/  Coremetrics, Inc.] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We are currently using XFS for 25+ production web-servers, ~900GB Oracle db servers, with potentially 15+ more servers by mid 2003, with ~900GB+ databases. All XFS installed.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Also, our dev environment, except for the Sun boxes which all are being migrated to X86 in the aforementioned server additions, plus the dev Sun boxes as well, are all x86 dual proc servers running Oracle, application servers, or web services as needed. All servers run XFS from images we&#039;ve got on our SystemImager servers.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;All production back-end servers are connected via FC1 or FC2 to a SAN containing ~13TB of raw storage, which, will soon be converted from VxFS to XFS with the migration of Oracle to our x86 platforms.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://evolt.org Evolt.org] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;evolt.org, a world community for web developers promoting the mutual free exchange of ideas, skills and experiences, has had a great deal of success using XFS. Our primary webserver which serves 100K hosts/month, primary Oracle database with ~25Gb of data, and free member hosting for 1000 users haven&#039;t had a minute of downtime since XFS has been installed. Performance has been spectacular and maintenance a breeze.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;font size=&amp;quot;-1&amp;quot;&amp;gt; &#039;&#039;All testimonials on this page represent the views of the submitters, and references to other products and companies should not be construed as an endorsement by either the organizations profiled, or by SGI. All trademarks (r) their respective owners.&#039;&#039; &amp;lt;/font&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_Companies&amp;diff=2411</id>
		<title>XFS Companies</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_Companies&amp;diff=2411"/>
		<updated>2012-02-03T18:59:02Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== These are companies that either use XFS or have a product that utilizes XFS . ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Info gathered from: [http://oss.sgi.com/projects/xfs/users.html XFS Users] on [http://oss.sgi.com/ oss.sgi.com]&lt;br /&gt;
&lt;br /&gt;
== [http://www.dell.com/ Dell&#039;s HPC NFS Storage Solution] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;The Dell NFS Storage Solution (NSS) is a unique new storage solution providing cost-effective NFS storage as an appliance. Designed to scale from 20 TB installations up to 80 TB of usable space, the NSS is delivered as a fully configured, ready-to-go storage solution and is available with full hardware and software support from Dell.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
More information is available in [http://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-NSS-NFS-Storage-solution-final.pdf this solution guide].&lt;br /&gt;
&lt;br /&gt;
== [http://www.kernel.org/ The Linux Kernel Archives] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;A bit more than a year ago (as of October 2008) kernel.org, in an ever increasing need to squeeze more performance out of it&#039;s machines, made the leap of migrating the primary mirror machines (mirrors.kernel.org) to XFS.  We site a number of reasons including fscking 5.5T of disk is long and painful, we were hitting various cache issues, and we were seeking better performance out of our file system.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;After initial tests looked positive we made the jump, and have been quite happy with the results.  With an instant increase in performance and throughput, as well as the worst xfs_check we&#039;ve ever seen taking 10 minutes, we were quite happy.  Subsequently we&#039;ve moved all primary mirroring file-systems to XFS, including www.kernel.org , and mirrors.kernel.org.  With an average constant movement of about 400mbps around the world, and with peaks into the 3.1gbps range serving thousands of users simultaneously it&#039;s been a file system that has taken the brunt we can throw at it and held up spectacularly.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.sdss.org/ The Sloan Digital Sky Survey] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;The Sloan Digital Sky Survey is an ambitious effort to map one-quarter of the sky at optical and very-near infrared wavelengths and take spectra of 1 million extra-galactic objects. The estimated amount of data that will be acquired over the 5 year lifespan of the project is 15TB, however, the total amount of storage space required for object informational databases, corrected frames, and reduced spectra will be several factors more than this. The goal is to have all the data online and available to the collaborators at all times. To accomplish this goal we are using commodity, off the shelf (COTS) Intel servers with EIDE disks configured as RAID50 arrays using XFS. Currently, 14 machines are in production accounting for over 18TB. By the scheduled end of the survey in 2005, 50TB of XFS disks will be online serving SDSS data to collaborators and the public.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;For complete details and status of the project please see [http://www.sdss.org/ http://www.sdss.org]. For details of the storage systems, see the [http://home.fnal.gov/~yocum/storageServerTechnicalNote.html SDSS Storage Server Technical Note].&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www-d0.fnal.gov/  The DØ Experiment at Fermilab] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;At the DØ experiment at the Fermi National Accelerator Laboratory we have a ~150 node cluster of desktop machines all using the SGI-patched kernel. Every large disk (&amp;amp;gt;40Gb) or disk array in the cluster uses XFS including 4x640Gb disk servers and several 60-120Gb disks/arrays. Originally we chose reiserfs as our journaling filesystem, however, this was a disaster. We need to export these disks via NFS and this seemed perpetually broken in 2.4 series kernels. We switched to XFS and have been very happy. The only inconvenience is that it is not included in the standard kernel. The SGI guys are very prompt in their support of new kernels, but it is still an extra step which should not be necessary.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.ciprico.com/pDiMeda.shtml  Ciprico DiMeda NAS Solutions] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;The Ciprico DiMeda line of Network Attached Storage solutions combine the ease of connectivity of NAS with the SAN like performance levels required for digital media applications. The DiMeda 3600 provides high availability and high performance through dual NAS servers and redundant, scalable Fibre Channel RAID storage. The DiMeda 1700 provides high performance files services at a low price by using the latest Serial ATA RAID technology. All DiMeda systems are based on Linux and use XFS as the filesystem. We tested a number of filesystem alternatives and XFS was chosen because it provided the highest performance in digital media applications and the journaling feature ensures rapid failover in our dual node fault tolerant configurations.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.quantum.com/Products/NAS+Servers/Guardian+14000/Default.htm  The Quantum Guardian™ 14000] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;The Quantum Guardian™ 14000, the latest Network Attached Storage (NAS) solution from Quantum, delivers 1.4TB of enterprise-class storage for less than $25,000. The Guardian 14000 is a Linux-based device which utilizes XFS to provide a highly reliable journaling filesystem with simultaneous support for Windows, UNIX, Linux and Macintosh environments. As dedicated appliance optimized for fast, reliable file sharing, the Guardian 14000 combines the simplicity of NAS with a robust feature set designed for the most demanding enterprise environments. Support for tools such as Active Directory Service (ADS), UNIX Network Information Service (NIS) and Simple Network Management Protocol (SNMP) provides ease of management and seamless integration. Hardware redundancy, Snapshots and StorageCare™ on-site service ensure security for business-critical data.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.bigstorage.com/products_approach_overview.html  BigStorage K2~NAS] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;At BigStorage we pride ourselves on tailoring our NAS systems to meet our customer&#039;s needs, with the help of XFS we are able to provide them with the most reliable Journaling Filesystem technology available. Our open systems approach, which allows for cross-platform integration, gives our customers the flexibility to grow with their data requirements. In addition, BigStorage offers a variety of other features including total hardware redundancy, snapshotting, replication and backups directly from the unit. All of our products include BigStorageï¿½s 24/7 LiveResponse™ support. With LiveResponse™, we keep our team of experienced technical experts on call 24 hours a day, every day, to ensure that your storage investment remains online, all the time.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.echostar.com  Echostar DishPVR 721] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Echostar uses the XFS filesystem for its latest generation of satellite receivers, the DP721. Echostar chose XFS for its performance, stability and unique set of features.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;XFS allowed us to meet a demanding requirement of recording two mpeg2 streams to the internal hard drive while simultaneously viewing a third pre-recorded stream. In addition, XFS allowed us to withstand unexpected power loss without filesystem corruption or user interaction.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We tested several other filesystems, but XFS emerged as the clear winner.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.sun.com/hardware/serverappliances/raq550/  Sun Cobalt RaQ™ 550] ==&lt;br /&gt;
&lt;br /&gt;
From the [http://www.sun.com/hardware/serverappliances/raq550/features.html features] page:&lt;br /&gt;
&lt;br /&gt;
&amp;quot;XFS is a journaling file system capable of quick fail over recovery after unexpected interruptions. XFS is an important feature for mission-critical applications as it ensures data integrity and dramatically reduces startup time by avoiding FSCK delay.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://pingu.salk.edu/  Center for Cytometry and Molecular Imaging at the Salk Institute] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;I run the Center for Cytometry and Molecular Imaging at the Salk Institute in La Jolla, CA. We&#039;re a core facility for the Institute, offering flow cytometry, basic and deconvolution microscopy, phosphorimaging (radioactivity imaging) and fluorescent imaging.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;I&#039;m currently in the process of migrating our data server to Linux/XFS. Our web server currently uses Linux/XFS. We have about 60 Gb on the data server which has a 100Gb SCSI RAID 5 array. This is a bit restrictive for our microscopists so in order that they can put more data online, I&#039;m adding another machine, also running Linux/XFS, with about 420 Gb of IDE-RAID5, based on Adaptec controllers....&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Servers are configured with quota and run Samba, NFS, and Netatalk for connectivity to the mixed bag of computers we have around here. I use the CVS XFS tree most of the time. I have not seen any problems in the several months I have been testing.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://coltex.nl/ Coltex Retail Group BV] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Coltex Retail group BV in the Netherlands uses Red Hat Linux with XFS for their main database server which collects the data from over 240 clothing retail stores throughout the Netherlands. Coltex depends on the availability of the server for over 100 hundred employees in the main office for retrieval of logistical and sales figures. The database size is roughly 10GB large containing both historical and current data.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;The entire production and logistical system depends on the availability of the system and downtime would mean a significant financial penalty. The speed and reliability of the XFS filesystem which has a proven track record and mature tools to go with it is fundamental to the availability of the system.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;XFS has saved us a lot of time during testing and implementation. A long filesystems check is no longer needed when bad things happen when they do. The increased speed of our database system which is based on Progress 9.1C is also a nice benefit to this filesystem.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.dkp.com/ DKP Effects] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We&#039;re a 3D computer graphics/post-production house. We&#039;ve currently got four fileservers using XFS under Linux online - three 350GB servers and one 800GB server. The servers are under fairly heavy load - network load to and from the dual NICs on the box is basically maxed out 18 hours a day - and we do have occasional lockups and drive failures. Thanks to Linux SW RAID5 and XFS, though, we haven&#039;t had any data loss, or significant down time.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.epigenomics.com/ Epigenomics] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We currently have several IDE-to-SCSI-RAID systems with XFS in production. The largest has a capacity of 1.5TB, the other 2 have 430GB each.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Data stored on these filesystems is on the one hand &amp;quot;normal&amp;quot; home directories and corporate documents and on the other hand scientific data for our laboratory and IT department.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.incyte.com/ Incyte Genomics] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;I&#039;m currently in the process of slowly converting 21 clusters totaling 2300+ processors over to XFS.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;These machines are running a fairly stock RH7.1+XFS. The application is our own custom scheduler for doing genomic research. We have one of the worlds largest sequencing labs which generates a tremendous amount of raw data. Vast amounts of CPU cycles must be applied to it to turn it into useful data we can then sell access to.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Currently, a minority of these machines are running XFS, but as I can get downtime on the clusters I am upgrading them to 7.1+XFS. When I&#039;m done, it&#039;ll be about 10TB of XFS goodness... across 9G disks mostly.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.monmouth.edu/ Monmouth University] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We&#039;ve replaced our NetApp filer (80GB, $40,000). NetApp ONTAP software [runs on NetApp filers] is basically an NFS and CIFS server with their own proprietary filesystem. We were quickly running out of space and our annual budget almost depleted. What were we to do?&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;With an off-the-shelf Dell 4400 series server and 300GB of disks ($8,000 total). We were able to run Linux and Samba to emulate a NetApp filer.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;XFS allowed us to manage 300GB of data with absolutely no downtime (now going on 79 days) since implementation. Gone are the days of fearing the fsck of 300GB.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.astro.wisc.edu  The University of Wisconsin Astronomy Department] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;At the University of Wisconsin Astronomy Department we have been using Linux XFS since the first release. We currently have 31 Linux boxes running XFS on all filesystems with about 2.6 TB of disk space on these machines. We use XFS primarily on our data reduction systems, but we also use it on our web server and on one of the remote observing machines at the WIYN 3.5m Telescope at Kitt Peak (http://www.noao.edu/wiyn/wiyn.html).&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We will likely be using Linux XFS at least in part on the GLIMPSE program (http://www.astro.wisc.edu/sirtf/) which will likely require several TB of disk space to process the data.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.amoa.org/ The Austin Museum of Art] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;The Austin Museum of Art has two file servers running RedHat 7.2_XFS upgraded from RedHat 7.1_XFS. Our webserver runs Domino on top of RedHat 7.3_XFS and we&#039;re getting about 70% better performance than the Domino server running on Windows 2000 Server. We&#039;re moving our workstations away from Windows and Microsoft Office to an LTSP server running on RedHat 7.3_XFS.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We&#039;ve become solely dependent on XFS for all of our data systems.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.tecmath.com/ tecmath AG] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We use a production server with a 270 GB RAID 5 (hardware) disk array. It is based on a Suse 7.2 distribution, but with a standard 2.4.12 kernel with XFS and LVM patches. The server provides NFS to 8 Unix clients as well as Samba to about 80 PCs. The machine also runs Bind 9, Apache, Exim, DHCP, POP3, MySQL. I have tried out different configurations with ReiserFS, but I didn&#039;t manage to find a stable configuration with respect to NFS. Since I converted all disks to XFS some 3 months ago, we never had any filesystem-related problems.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.theiqgroup.com/ The IQ Group] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Here at the IQ Group, Inc. we use XFS for all our production and development servers.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Our OS of choice is Slackware Linux 8.0. Our hardware of choice is Dell and VALinux servers.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;As for applications, we run the standard Unix/Linux apps like Sendmail, Apache, BIND, DHCP, iptables, etc.; as well as Oracle 9i and Arkeia.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We&#039;ve been running XFS across the board for about 3 months now without a hitch (so far).&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Size-wise, our biggest server is about 40 GB, but that will be increasing substantially in the near future.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Our production servers are collocated so a journaled FS was a must. Reboots are quick and no human interaction is required like with a bad fsck on ext2. Additionally, our database servers gain additional integrity and robustness.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We originally chose XFS over ReiserFS and ext3 because of it&#039;s age (it&#039;s been in production on SGI boxes for probably longer than all the other journaling FS&#039;s combined) and it&#039;s speed appeared comparable as well.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.artsit.usyd.edu.au  Arts IT Unit, Sydney University] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;I&#039;ve got XFS on a &#039;production&#039; file server. The machine could have up to 500 people logged in, but typically less than 200. Most are Mac users, connected via NetAtalk for &#039;personal files&#039;, although there are shared areas for admin units. Probably about 30-40 windows users. (Samba) It&#039;s the file server for an Academic faculty at a University.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Hardware RAID, via Mylex dual channel controller with 4 drives, Intel Tupelo MB, Intel &#039;SC5000&#039; server chassis with redundant power and hot-swap scsi bays. The system boots off a non RAID single 9gb UW-scsi drive.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Only system &#039;crash&#039; was caused by some one accidentally unplugging it, just before we put it into production. It was back in full operation within 5 minutes. Without journaling, the fsck would have taken well over an hour. In day to day use it has run well.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://structbio.vanderbilt.edu/comp/  Vanderbilt University Center for Structural Biology] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;I run a high-performance computing center for Structural Biology research at Vanderbilt University. We use XFS extensively, and have been since the late prerelease versions. I&#039;ve had nothing but good experiences with it.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We began using XFS in our search for a good solution for our RAID fileservers. We had such good experiences with it on these systems that we&#039;ve begun putting it on the root/usr/var partitions of every Linux system we run here. I even have it on my laptop these days. XFS in combination with the 2.4 NFS3 implementation performs very well for us, and we have great uptimes on these systems (Our 750GB ArenaII setup is at 143 days right now).&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;All told, we&#039;ve got about 1.2TB of XFS filesystems spinning right now. It&#039;s spread out across maybe a dozen or so filesystems and will continue to increase as we are growing fast and that&#039;s all we use now. Next up is putting it on our 17-node Linux cluster, which will bring that up to 1.5TB spread across 30 filesystems.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;I, for one, would LOVE to see XFS make it into the kernel tree. From my perspectives, it&#039;s one of the best things to happen to Linux in the 7 years I&#039;ve been using/administering it.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
==== 2008 Update ====&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We&#039;ve since moved our main home directories to a proprietary NAS, but continue to use XFS on 10TB of LVM storage for doing backup-to-disk from the same NAS&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www-cdf.fnal.gov/  CDF Experiment at Fermi National Lab] ==&lt;br /&gt;
&lt;br /&gt;
CDF, an elementary particle physics experiment at Fermi National Lab, is using XFS for all our cache disks.&lt;br /&gt;
&lt;br /&gt;
The usage model is that we have a PB tape archive (2 STK silos) as permanent storage. In front of this archive we are deploying a roughly 100TB disk cache system. The cache is made up of 50 2TB file server based on cheap commodity hardware (3ware based hardware raid using IDE drives). The data is then processed by a cluster of 300 Dual CPU Linux PC&#039;s. The cache software is dCache, a DESY/FNAL product.&lt;br /&gt;
&lt;br /&gt;
The whole system is used by more than 300 active users from all over the world for batch processing for their physics data analysis.&lt;br /&gt;
&lt;br /&gt;
== [http://www.get2chip.com  Get2Chip, Inc.] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We are using XFS on 3 production file servers with approximately 1.5T of data. Quite impressive especially when we had a power outage and all three servers shutdown. All servers came back up in minutes with no problems! We are looking at creating two more servers that would manage 2+ TB of data store.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.lando.co.za  Lando International Group Technologies] ==&lt;br /&gt;
&lt;br /&gt;
Lando International Group Technologies is the home of:&lt;br /&gt;
&lt;br /&gt;
* [www.lando.co.za Lanndo Technologies Africa (Pty) Ltd] - Internet Service Provider&lt;br /&gt;
* [www.lbsd.net Linux Based Systems Design] (Article 21). Not-For-Profit company established to provide free Linux distributions and programs.&lt;br /&gt;
* Cell Park South Africa (Pty) Ltd. RSA Pat Appln 2001/10406. Collecting parking fees by means of cell phone SMS or voice.&lt;br /&gt;
* Read Plus Education (Pty) Ltd. Software based reading skills training and testing for ages 4 to 100.&lt;br /&gt;
* Mobivan. Mobile office including Internet access, fax, copying, printing, telephone, collection and delivery services, legal services, pre-paid phone and electricity services, bill payment email, secretarial services, training facilities and management services.&lt;br /&gt;
* Lando International Marketing Agency. Direct marketing services, design and supply of promotional material, consulting, sourcing of capital and other funding.&lt;br /&gt;
* Illico. Software development and systems analysis on most platforms.&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Throughout these companies, we use the XFS filesystem with [http://idms.lbsd.net IDMS Linux] on high-end Intel servers, with an average of 100 GB storage each. XFS stores our customer and user data, including credit card details, mail, routing tables, etc.. We have not had one problem since the release of the first XFS patch.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.fcb-wilkens.com  Foote, Cone, &amp;amp;amp; Belding] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We are an advertisement company in Germany, and the use of the XFS filesystem is a story of success for us. In our Hamburg office, we have two file servers having a 420 Gig RAID in XFS format serving (almost) all our data to about 180 Macs and about 30 PCs using Samba and Netatalk. Some of the data is used in our offices in Frankfurt and Berlin, and in fact the Berlin office is just getting it&#039;s own 250 Gig fileserver (using XFS) right now.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;The general success with XFS has led us to switch over all our Linux servers to run on XFS as well (with the exception of two systems that are tied to tight specifications configuration wise). XFS, even the old 1.0 version, has happily taken on various abuse - broken SCSI controllers, broken RAID systems.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.moving-picture.co.uk/  Moving Picture Company] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We here at MPC use XFS/RedHat 7.2 on all of our graphics-workstations and file-servers. More info can be found in an [http://www.linuxuser.co.uk/articles/issue20/lu20-Linux_at_work-In_the_picture.pdf  article] LinuxUser magazine did on us recently.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://www.coremetrics.com/  Coremetrics, Inc.] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;We are currently using XFS for 25+ production web-servers, ~900GB Oracle db servers, with potentially 15+ more servers by mid 2003, with ~900GB+ databases. All XFS installed.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;Also, our dev environment, except for the Sun boxes which all are being migrated to X86 in the aforementioned server additions, plus the dev Sun boxes as well, are all x86 dual proc servers running Oracle, application servers, or web services as needed. All servers run XFS from images we&#039;ve got on our SystemImager servers.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;quot;All production back-end servers are connected via FC1 or FC2 to a SAN containing ~13TB of raw storage, which, will soon be converted from VxFS to XFS with the migration of Oracle to our x86 platforms.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
== [http://evolt.org Evolt.org] ==&lt;br /&gt;
&lt;br /&gt;
&amp;quot;evolt.org, a world community for web developers promoting the mutual free exchange of ideas, skills and experiences, has had a great deal of success using XFS. Our primary webserver which serves 100K hosts/month, primary Oracle database with ~25Gb of data, and free member hosting for 1000 users haven&#039;t had a minute of downtime since XFS has been installed. Performance has been spectacular and maintenance a breeze.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;font size=&amp;quot;-1&amp;quot;&amp;gt; &#039;&#039;All testimonials on this page represent the views of the submitters, and references to other products and companies should not be construed as an endorsement by either the organizations profiled, or by SGI. All trademarks (r) their respective owners.&#039;&#039; &amp;lt;/font&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_Papers_and_Documentation&amp;diff=2098</id>
		<title>XFS Papers and Documentation</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_Papers_and_Documentation&amp;diff=2098"/>
		<updated>2010-08-23T15:23:04Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=== Primary XFS Documentation ===&lt;br /&gt;
&lt;br /&gt;
The XFS documentation started by SGI has been converted to docbook/[https://fedorahosted.org/publican/ Publican] format.  The material is suitable for experienced users as well as developers and support staff.  The XML source is available in a [http://git.kernel.org/?p=fs/xfs/xfsdocs-xml-dev.git;a=summary git repository] and builds of the documentation are available here:&lt;br /&gt;
&lt;br /&gt;
* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-US/html/index.html XFS User Guide]&lt;br /&gt;
&lt;br /&gt;
* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html XFS File System Structure]&lt;br /&gt;
&lt;br /&gt;
* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_Labs/tmp/en-US/html/index.html XFS Training Labs]&lt;br /&gt;
&lt;br /&gt;
* (Original versions of this material are still available at [http://oss.sgi.com/projects/xfs/training/index.html XFS Overview and Internals (html)] and [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS Filesystem Structure (pdf)]&lt;br /&gt;
&lt;br /&gt;
The format of &amp;lt;tt&amp;gt;/proc/fs/xfs/stat&amp;lt;/tt&amp;gt; also has been documented:&lt;br /&gt;
* [[Runtime_Stats|Runtime_Stats]]&lt;br /&gt;
&lt;br /&gt;
=== Papers, Presentations, Etc ===&lt;br /&gt;
&lt;br /&gt;
The October 2009 issue of the USENIX ;login: magazine published an article about XFS targeted at system administrators:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;XFS: The big storage file system for Linux&#039;&#039; [[http://oss.sgi.com/projects/xfs/papers/hellwig.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
At the Ottawa Linux Symposium (July 2006), Dave Chinner presented a paper on filesystem scalability in Linux 2.6 kernels:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;High Bandwidth Filesystems on Large Systems&#039;&#039; (July 2006) [[http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf paper]] [[http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-presentation.pdf presentation]]&lt;br /&gt;
&lt;br /&gt;
At linux.conf.au 2008 Dave Chinner gave a presentation about xfs_repair that he co-authored with Barry Naujok:&lt;br /&gt;
&lt;br /&gt;
* Fixing XFS Filesystems Faster [[http://mirror.linux.org.au/pub/linux.conf.au/2008/slides/135-fixing_xfs_faster.pdf]]&lt;br /&gt;
&lt;br /&gt;
In July 2006, SGI storage marketing updated the XFS datasheet:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Open Source XFS for Linux&#039;&#039; [[http://oss.sgi.com/projects/xfs/datasheet.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
At UKUUG 2003, Christoph Hellwig presented a talk on XFS:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;XFS for Linux&#039;&#039; (July 2003) [[http://oss.sgi.com/projects/xfs/papers/ukuug2003.pdf pdf]] [[http://verein.lst.de/~hch/talks/ukuug2003/ html]]&lt;br /&gt;
&lt;br /&gt;
Originally published in Proceedings of the FREENIX Track: 2002 Usenix Annual Technical Conference:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Filesystem Performance and Scalability in Linux 2.4.17&#039;&#039; (June 2002) [[http://oss.sgi.com/projects/xfs/papers/filesystem-perf-tm.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
At the Ottawa Linux Symposium, an updated presentation on porting XFSÂ to Linux was given:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Porting XFS to Linux&#039;&#039; (July 2000) [[http://oss.sgi.com/projects/xfs/papers/ols2000/ols-xfs.htm html]]&lt;br /&gt;
&lt;br /&gt;
At the Atlanta Linux Showcase, SGI presented the following paper on the port of XFS to Linux:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Porting the SGI XFS File System to Linux&#039;&#039; (October 1999) [[http://oss.sgi.com/projects/xfs/papers/als/als.ps ps]] [[http://oss.sgi.com/projects/xfs/papers/als/als.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
At the 6th Linux Kongress &amp;amp;amp; the Linux Storage Management Workshop (LSMW) in Germany in September, 1999, SGI had a few presentations including the following:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;SGI&#039;s port of XFS to Linux&#039;&#039; (September 1999) [[http://oss.sgi.com/projects/xfs/papers/linux_kongress/index.htm html]]&lt;br /&gt;
* &#039;&#039;Overview of DMF&#039;&#039; (September 1999) [[http://oss.sgi.com/projects/xfs/papers/DMF-over/index.htm html]]&lt;br /&gt;
&lt;br /&gt;
At the LinuxWorld Conference &amp;amp;amp; Expo in August 1999, SGI published:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;An Open Source XFS data sheet&#039;&#039; (August 1999) [[http://oss.sgi.com/projects/xfs/papers/xfs_GPL.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
From the 1996 USENIX conference:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;An XFS white paper&#039;&#039; [[http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html html]]&lt;br /&gt;
&lt;br /&gt;
=== Other historical articles, press-releases, etc ===&lt;br /&gt;
&lt;br /&gt;
* IBM&#039;s &#039;&#039;Advanced Filesystem Implementor&#039;s Guide&#039;&#039; has a chapter &#039;&#039;Introducing XFS&#039;&#039; [[http://www-106.ibm.com/developerworks/library/l-fs9.html html]]&lt;br /&gt;
&lt;br /&gt;
* An editorial titled &#039;&#039;Tired of fscking? Try a journaling filesystem!&#039;&#039;, Freshmeat (February 2001) [[http://freshmeat.net/articles/view/212/ html]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Who give a fsck about filesystems&#039;&#039; provides an overview of the Linux 2.4 filesystems [[http://www.linuxuser.co.uk/articles/issue6/lu6-All_you_need_to_know_about-Filesystems.pdf html]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Journal File Systems&#039;&#039; in issue 55 of &#039;&#039;Linux Gazette&#039;&#039; provides a comparison of journaled filesystems.&lt;br /&gt;
&lt;br /&gt;
* The original XFS beta release announcement was published in &#039;&#039;Linux Today&#039;&#039; (September 2000) [[http://linuxtoday.com/news_story.php3?ltsn=2000-09-26-017-04-OS-SW html]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;XFS: It&#039;s worth the wait&#039;&#039; was published on &#039;&#039;EarthWeb&#039;&#039; (July 2000) [[http://networking.earthweb.com/netos/oslin/article/0,,12284_623661,00.html html]]&lt;br /&gt;
&lt;br /&gt;
* An &#039;&#039;IRIX-XFS data sheet&#039;&#039; (July 1999) [[http://oss.sgi.com/projects/xfs/papers/IRIX_xfs_data_sheet.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
* The &#039;&#039;Getting Started with XFS&#039;&#039; book (1994) [[http://oss.sgi.com/projects/xfs/papers/getting_started_with_xfs.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
* Original &#039;&#039;XFS design documents&#039;&#039; (1993) ([http://oss.sgi.com/projects/xfs/design_docs/xfsdocs93_ps/ ps], [http://oss.sgi.com/projects/xfs/design_docs/xfsdocs93_pdf/ pdf])&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_Papers_and_Documentation&amp;diff=2097</id>
		<title>XFS Papers and Documentation</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_Papers_and_Documentation&amp;diff=2097"/>
		<updated>2010-08-23T15:01:46Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=== Primary XFS Documentation ===&lt;br /&gt;
&lt;br /&gt;
The XFS documentation started by SGI has been converted to docbook/[https://fedorahosted.org/publican/ Publican] format.  The material is suitable for experienced users as well as developers and support staff.  The XML source is available in a [http://git.kernel.org/?p=fs/xfs/xfsdocs-dev.git;a=summary git repository] and builds of the documentation are available here:&lt;br /&gt;
&lt;br /&gt;
* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-US/html/index.html XFS User Guide]&lt;br /&gt;
&lt;br /&gt;
* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html XFS File System Structure]&lt;br /&gt;
&lt;br /&gt;
* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_Labs/tmp/en-US/html/index.html XFS Training Labs]&lt;br /&gt;
&lt;br /&gt;
* (Original versions of this material are still available at [http://oss.sgi.com/projects/xfs/training/index.html XFS Overview and Internals (html)] and [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS Filesystem Structure (pdf)]&lt;br /&gt;
&lt;br /&gt;
The format of &amp;lt;tt&amp;gt;/proc/fs/xfs/stat&amp;lt;/tt&amp;gt; also has been documented:&lt;br /&gt;
* [[Runtime_Stats|Runtime_Stats]]&lt;br /&gt;
&lt;br /&gt;
=== Papers, Presentations, Etc ===&lt;br /&gt;
&lt;br /&gt;
The October 2009 issue of the USENIX ;login: magazine published an article about XFS targeted at system administrators:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;XFS: The big storage file system for Linux&#039;&#039; [[http://oss.sgi.com/projects/xfs/papers/hellwig.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
At the Ottawa Linux Symposium (July 2006), Dave Chinner presented a paper on filesystem scalability in Linux 2.6 kernels:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;High Bandwidth Filesystems on Large Systems&#039;&#039; (July 2006) [[http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf paper]] [[http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-presentation.pdf presentation]]&lt;br /&gt;
&lt;br /&gt;
At linux.conf.au 2008 Dave Chinner gave a presentation about xfs_repair that he co-authored with Barry Naujok:&lt;br /&gt;
&lt;br /&gt;
* Fixing XFS Filesystems Faster [[http://mirror.linux.org.au/pub/linux.conf.au/2008/slides/135-fixing_xfs_faster.pdf]]&lt;br /&gt;
&lt;br /&gt;
In July 2006, SGI storage marketing updated the XFS datasheet:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Open Source XFS for Linux&#039;&#039; [[http://oss.sgi.com/projects/xfs/datasheet.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
At UKUUG 2003, Christoph Hellwig presented a talk on XFS:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;XFS for Linux&#039;&#039; (July 2003) [[http://oss.sgi.com/projects/xfs/papers/ukuug2003.pdf pdf]] [[http://verein.lst.de/~hch/talks/ukuug2003/ html]]&lt;br /&gt;
&lt;br /&gt;
Originally published in Proceedings of the FREENIX Track: 2002 Usenix Annual Technical Conference:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Filesystem Performance and Scalability in Linux 2.4.17&#039;&#039; (June 2002) [[http://oss.sgi.com/projects/xfs/papers/filesystem-perf-tm.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
At the Ottawa Linux Symposium, an updated presentation on porting XFSÂ to Linux was given:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Porting XFS to Linux&#039;&#039; (July 2000) [[http://oss.sgi.com/projects/xfs/papers/ols2000/ols-xfs.htm html]]&lt;br /&gt;
&lt;br /&gt;
At the Atlanta Linux Showcase, SGI presented the following paper on the port of XFS to Linux:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Porting the SGI XFS File System to Linux&#039;&#039; (October 1999) [[http://oss.sgi.com/projects/xfs/papers/als/als.ps ps]] [[http://oss.sgi.com/projects/xfs/papers/als/als.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
At the 6th Linux Kongress &amp;amp;amp; the Linux Storage Management Workshop (LSMW) in Germany in September, 1999, SGI had a few presentations including the following:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;SGI&#039;s port of XFS to Linux&#039;&#039; (September 1999) [[http://oss.sgi.com/projects/xfs/papers/linux_kongress/index.htm html]]&lt;br /&gt;
* &#039;&#039;Overview of DMF&#039;&#039; (September 1999) [[http://oss.sgi.com/projects/xfs/papers/DMF-over/index.htm html]]&lt;br /&gt;
&lt;br /&gt;
At the LinuxWorld Conference &amp;amp;amp; Expo in August 1999, SGI published:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;An Open Source XFS data sheet&#039;&#039; (August 1999) [[http://oss.sgi.com/projects/xfs/papers/xfs_GPL.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
From the 1996 USENIX conference:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;An XFS white paper&#039;&#039; [[http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html html]]&lt;br /&gt;
&lt;br /&gt;
=== Other historical articles, press-releases, etc ===&lt;br /&gt;
&lt;br /&gt;
* IBM&#039;s &#039;&#039;Advanced Filesystem Implementor&#039;s Guide&#039;&#039; has a chapter &#039;&#039;Introducing XFS&#039;&#039; [[http://www-106.ibm.com/developerworks/library/l-fs9.html html]]&lt;br /&gt;
&lt;br /&gt;
* An editorial titled &#039;&#039;Tired of fscking? Try a journaling filesystem!&#039;&#039;, Freshmeat (February 2001) [[http://freshmeat.net/articles/view/212/ html]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Who give a fsck about filesystems&#039;&#039; provides an overview of the Linux 2.4 filesystems [[http://www.linuxuser.co.uk/articles/issue6/lu6-All_you_need_to_know_about-Filesystems.pdf html]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Journal File Systems&#039;&#039; in issue 55 of &#039;&#039;Linux Gazette&#039;&#039; provides a comparison of journaled filesystems.&lt;br /&gt;
&lt;br /&gt;
* The original XFS beta release announcement was published in &#039;&#039;Linux Today&#039;&#039; (September 2000) [[http://linuxtoday.com/news_story.php3?ltsn=2000-09-26-017-04-OS-SW html]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;XFS: It&#039;s worth the wait&#039;&#039; was published on &#039;&#039;EarthWeb&#039;&#039; (July 2000) [[http://networking.earthweb.com/netos/oslin/article/0,,12284_623661,00.html html]]&lt;br /&gt;
&lt;br /&gt;
* An &#039;&#039;IRIX-XFS data sheet&#039;&#039; (July 1999) [[http://oss.sgi.com/projects/xfs/papers/IRIX_xfs_data_sheet.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
* The &#039;&#039;Getting Started with XFS&#039;&#039; book (1994) [[http://oss.sgi.com/projects/xfs/papers/getting_started_with_xfs.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
* Original &#039;&#039;XFS design documents&#039;&#039; (1993) ([http://oss.sgi.com/projects/xfs/design_docs/xfsdocs93_ps/ ps], [http://oss.sgi.com/projects/xfs/design_docs/xfsdocs93_pdf/ pdf])&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=XFS_Papers_and_Documentation&amp;diff=2096</id>
		<title>XFS Papers and Documentation</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=XFS_Papers_and_Documentation&amp;diff=2096"/>
		<updated>2010-08-23T15:00:39Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The XFS documentation started by SGI has been converted to docbook/[https://fedorahosted.org/publican/ Publican] format.  The material is suitable for experienced users as well as developers and support staff.  The XML source is available in a [http://git.kernel.org/?p=fs/xfs/xfsdocs-dev.git;a=summary git repository] and builds of the documentation are available here:&lt;br /&gt;
&lt;br /&gt;
* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide//tmp/en-US/html/index.html XFS User Guide]&lt;br /&gt;
&lt;br /&gt;
* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html XFS File System Structure]&lt;br /&gt;
&lt;br /&gt;
* [http://xfs.org/docs/xfsdocs-xml-dev/XFS_Labs/tmp/en-US/html/index.html XFS Training Labs]&lt;br /&gt;
&lt;br /&gt;
* (Original versions of this material are still available at [http://oss.sgi.com/projects/xfs/training/index.html XFS Overview and Internals (html)] and [http://oss.sgi.com/projects/xfs/papers/xfs_filesystem_structure.pdf XFS Filesystem Structure (pdf)]&lt;br /&gt;
&lt;br /&gt;
The format of &amp;lt;tt&amp;gt;/proc/fs/xfs/stat&amp;lt;/tt&amp;gt; also has been documented:&lt;br /&gt;
* [[Runtime_Stats|Runtime_Stats]]&lt;br /&gt;
&lt;br /&gt;
The October 2009 issue of the USENIX ;login: magazine published an article about XFS targeted at system administrators:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;XFS: The big storage file system for Linux&#039;&#039; [[http://oss.sgi.com/projects/xfs/papers/hellwig.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
At the Ottawa Linux Symposium (July 2006), Dave Chinner presented a paper on filesystem scalability in Linux 2.6 kernels:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;High Bandwidth Filesystems on Large Systems&#039;&#039; (July 2006) [[http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf paper]] [[http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-presentation.pdf presentation]]&lt;br /&gt;
&lt;br /&gt;
At linux.conf.au 2008 Dave Chinner gave a presentation about xfs_repair that he co-authored with Barry Naujok:&lt;br /&gt;
&lt;br /&gt;
* Fixing XFS Filesystems Faster [[http://mirror.linux.org.au/pub/linux.conf.au/2008/slides/135-fixing_xfs_faster.pdf]]&lt;br /&gt;
&lt;br /&gt;
In July 2006, SGI storage marketing updated the XFS datasheet:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Open Source XFS for Linux&#039;&#039; [[http://oss.sgi.com/projects/xfs/datasheet.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
At UKUUG 2003, Christoph Hellwig presented a talk on XFS:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;XFS for Linux&#039;&#039; (July 2003) [[http://oss.sgi.com/projects/xfs/papers/ukuug2003.pdf pdf]] [[http://verein.lst.de/~hch/talks/ukuug2003/ html]]&lt;br /&gt;
&lt;br /&gt;
Originally published in Proceedings of the FREENIX Track: 2002 Usenix Annual Technical Conference:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Filesystem Performance and Scalability in Linux 2.4.17&#039;&#039; (June 2002) [[http://oss.sgi.com/projects/xfs/papers/filesystem-perf-tm.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
At the Ottawa Linux Symposium, an updated presentation on porting XFSÂ to Linux was given:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Porting XFS to Linux&#039;&#039; (July 2000) [[http://oss.sgi.com/projects/xfs/papers/ols2000/ols-xfs.htm html]]&lt;br /&gt;
&lt;br /&gt;
At the Atlanta Linux Showcase, SGI presented the following paper on the port of XFS to Linux:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Porting the SGI XFS File System to Linux&#039;&#039; (October 1999) [[http://oss.sgi.com/projects/xfs/papers/als/als.ps ps]] [[http://oss.sgi.com/projects/xfs/papers/als/als.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
At the 6th Linux Kongress &amp;amp;amp; the Linux Storage Management Workshop (LSMW) in Germany in September, 1999, SGI had a few presentations including the following:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;SGI&#039;s port of XFS to Linux&#039;&#039; (September 1999) [[http://oss.sgi.com/projects/xfs/papers/linux_kongress/index.htm html]]&lt;br /&gt;
* &#039;&#039;Overview of DMF&#039;&#039; (September 1999) [[http://oss.sgi.com/projects/xfs/papers/DMF-over/index.htm html]]&lt;br /&gt;
&lt;br /&gt;
At the LinuxWorld Conference &amp;amp;amp; Expo in August 1999, SGI published:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;An Open Source XFS data sheet&#039;&#039; (August 1999) [[http://oss.sgi.com/projects/xfs/papers/xfs_GPL.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
From the 1996 USENIX conference:&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;An XFS white paper&#039;&#039; [[http://oss.sgi.com/projects/xfs/papers/xfs_usenix/index.html html]]&lt;br /&gt;
&lt;br /&gt;
=== Other historical articles, press-releases, etc ===&lt;br /&gt;
&lt;br /&gt;
* IBM&#039;s &#039;&#039;Advanced Filesystem Implementor&#039;s Guide&#039;&#039; has a chapter &#039;&#039;Introducing XFS&#039;&#039; [[http://www-106.ibm.com/developerworks/library/l-fs9.html html]]&lt;br /&gt;
&lt;br /&gt;
* An editorial titled &#039;&#039;Tired of fscking? Try a journaling filesystem!&#039;&#039;, Freshmeat (February 2001) [[http://freshmeat.net/articles/view/212/ html]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Who give a fsck about filesystems&#039;&#039; provides an overview of the Linux 2.4 filesystems [[http://www.linuxuser.co.uk/articles/issue6/lu6-All_you_need_to_know_about-Filesystems.pdf html]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;Journal File Systems&#039;&#039; in issue 55 of &#039;&#039;Linux Gazette&#039;&#039; provides a comparison of journaled filesystems.&lt;br /&gt;
&lt;br /&gt;
* The original XFS beta release announcement was published in &#039;&#039;Linux Today&#039;&#039; (September 2000) [[http://linuxtoday.com/news_story.php3?ltsn=2000-09-26-017-04-OS-SW html]]&lt;br /&gt;
&lt;br /&gt;
* &#039;&#039;XFS: It&#039;s worth the wait&#039;&#039; was published on &#039;&#039;EarthWeb&#039;&#039; (July 2000) [[http://networking.earthweb.com/netos/oslin/article/0,,12284_623661,00.html html]]&lt;br /&gt;
&lt;br /&gt;
* An &#039;&#039;IRIX-XFS data sheet&#039;&#039; (July 1999) [[http://oss.sgi.com/projects/xfs/papers/IRIX_xfs_data_sheet.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
* The &#039;&#039;Getting Started with XFS&#039;&#039; book (1994) [[http://oss.sgi.com/projects/xfs/papers/getting_started_with_xfs.pdf pdf]]&lt;br /&gt;
&lt;br /&gt;
* Original &#039;&#039;XFS design documents&#039;&#039; (1993) ([http://oss.sgi.com/projects/xfs/design_docs/xfsdocs93_ps/ ps], [http://oss.sgi.com/projects/xfs/design_docs/xfsdocs93_pdf/ pdf])&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=Shrinking_Support&amp;diff=1919</id>
		<title>Shrinking Support</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=Shrinking_Support&amp;diff=1919"/>
		<updated>2008-12-21T18:36:17Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: minor cleanup&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Currently XFS Filesystems can&#039;t be shrunk.&lt;br /&gt;
&lt;br /&gt;
To support shrinking XFS filesystems a few things needs to be implemented, based on a list by Dave Chinner [http://marc.info/?l=linux-xfs&amp;amp;m=118091640624488&amp;amp;w=2]:&lt;br /&gt;
&lt;br /&gt;
* A way to check space is available for shrink&lt;br /&gt;
&lt;br /&gt;
* An ioctl or similar interface to prevent new allocations from a given allocation group.&lt;br /&gt;
&lt;br /&gt;
* A variant of the xfs_reno tool to support moving inodes out of filesystem areas that go away.&lt;br /&gt;
&lt;br /&gt;
* A variant of the xfs_fsr tool to support moving data out of the filesystem areas that go away.&lt;br /&gt;
&lt;br /&gt;
* Some way to move out orphan metadata out of the AGs truncated off&lt;br /&gt;
&lt;br /&gt;
* A transaction to shrink the filesystem.&lt;br /&gt;
&lt;br /&gt;
At that point, we&#039;ll have a &amp;quot;working&amp;quot; shrink that will allow&lt;br /&gt;
shrinking to only 50% of the original size because the log &lt;br /&gt;
(in the middle of the filesystem) will&lt;br /&gt;
get in the way.  To fix that, we&#039;ll need to implement transactions&lt;br /&gt;
to move the log...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Available pieces ==&lt;br /&gt;
&lt;br /&gt;
* A script from Ruben Porras to check if enough free space is available to support shrinking [[http://marc.info/?l=linux-xfs&amp;amp;m=118581682117599&amp;amp;w=2]]&lt;br /&gt;
&lt;br /&gt;
* A patch from Ruben Porras to allow / disallow allocation from an allocation group [http://marc.info/?l=linux-xfs&amp;amp;m=118302806818420&amp;amp;w=2] plus userspace support for setting / clearing it [http://marc.info/?l=linux-xfs&amp;amp;m=118881137031101&amp;amp;w=2]&lt;br /&gt;
&lt;br /&gt;
* The xfs_fsr tool in xfsprogs&lt;br /&gt;
&lt;br /&gt;
* The xfs_reno too, see [[Unfinished_work#The_xfs_reno_tool]]&lt;br /&gt;
&lt;br /&gt;
* An untested patch from Dave Chinner for a xfs_swap_inodes ioctl that allows to not just defragment extents but moving the whole inode [http://marc.info/?l=linux-xfs&amp;amp;m=119552278931942&amp;amp;w=2] and a patch to xfs_reno to use it from Ruben Porras [http://marc.info/?l=linux-xfs&amp;amp;m=119582841808985&amp;amp;w=2]&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=Shrinking_Support&amp;diff=1917</id>
		<title>Shrinking Support</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=Shrinking_Support&amp;diff=1917"/>
		<updated>2008-12-21T18:34:34Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: remove dup bullet point&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Currently XFS Filesystems can&#039;t be shrunk.&lt;br /&gt;
&lt;br /&gt;
To suppor shrinking XFS filesystems a few thinks needs to be implemented, based on a list by Dave Chinner [http://marc.info/?l=linux-xfs&amp;amp;m=118091640624488&amp;amp;w=2]:&lt;br /&gt;
&lt;br /&gt;
* A way to check space is available for shrink&lt;br /&gt;
&lt;br /&gt;
* An ioctl or similar interface to prevent new allocations from a given allocation group.&lt;br /&gt;
&lt;br /&gt;
* A variant of the xfs_reno tool to support moving inodes out of filesystem areas that go away.&lt;br /&gt;
&lt;br /&gt;
* A variant of the xfs_fsr tool to support moving data out of the filesystem areas that go away.&lt;br /&gt;
&lt;br /&gt;
* Some way to move out orphan metadata out of the AGs truncated off&lt;br /&gt;
&lt;br /&gt;
* A transaction to shrink the filesystem.&lt;br /&gt;
&lt;br /&gt;
At that point, we&#039;ll got a &amp;quot;working&amp;quot; shrink that will allow&lt;br /&gt;
shrinking to only 50% of the original size because the log will&lt;br /&gt;
get in the way. To fix that, we&#039;ll need to implement transactions&lt;br /&gt;
to move the log...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Available pieces ==&lt;br /&gt;
&lt;br /&gt;
* A script from Ruben Porras to check if enough free space is available to support shrinking [[http://marc.info/?l=linux-xfs&amp;amp;m=118581682117599&amp;amp;w=2]]&lt;br /&gt;
&lt;br /&gt;
* A patch from Ruben Porras to allow / disallow allocation from an allocation group [http://marc.info/?l=linux-xfs&amp;amp;m=118302806818420&amp;amp;w=2] plus userspace support for setting / clearing it [http://marc.info/?l=linux-xfs&amp;amp;m=118881137031101&amp;amp;w=2]&lt;br /&gt;
&lt;br /&gt;
* The xfs_fsr tool in xfsprogs&lt;br /&gt;
&lt;br /&gt;
* The xfs_reno too, see [[Unfinished_work#The_xfs_reno_tool]]&lt;br /&gt;
&lt;br /&gt;
* An untested patch from Dave Chinner for a xfs_swap_inodes ioctl that allows to not just defragment extents but moving the whole inode [http://marc.info/?l=linux-xfs&amp;amp;m=119552278931942&amp;amp;w=2] and a patch to xfs_reno to use it from Ruben Porras [http://marc.info/?l=linux-xfs&amp;amp;m=119582841808985&amp;amp;w=2]&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
	<entry>
		<id>https://xfs.org/index.php?title=Xfs.org:About&amp;diff=1902</id>
		<title>Xfs.org:About</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=Xfs.org:About&amp;diff=1902"/>
		<updated>2008-12-21T17:39:06Z</updated>

		<summary type="html">&lt;p&gt;Sandeen: Removing all content from page&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Sandeen</name></author>
	</entry>
</feed>