You are looking at the HTML representation of the XML format.
HTML is good for debugging, but probably is not suitable for your application.
See complete documentation, or API help for more information.
<?xml version="1.0"?>
<api>
  <query-continue>
    <allpages gapfrom="Shrinking Support" />
  </query-continue>
  <query>
    <pages>
      <page pageid="1440" ns="0" title="Reliable Detection and Repair of Metadata Corruption">
        <revisions>
          <rev xml:space="preserve">= Future Directions - Reliability =

From http://oss.sgi.com/archives/xfs/2008-09/msg00802.html

== Reliable Detection and Repair of Metadata Corruption ==




This can be broken down into specific phases. Firstly, we cannot repair a
corruption we have not detected. Hence the first thing we need to do is
reliable detection of errors and corruption. Once we can reliably detect errors
in structures and verified that we are propagating all the errors reported from
lower layers into XFS correctly, we can look at ways of handling them more
robustly. In many cases, the same type of error needs to be handled differently
due to the context in which the error occurs.  This introduces extra complexity
into this problem.

Rather than continually refering to specific types of problems (such as
corruption or error handling) I'll refer to them as 'exceptions'. This avoids
thinking about specific error conditions through specific paths and so helps us
to look at the issues from a more general or abstract point of view.

== Exception Detection ==


Our current approach to exception detection is entirely reactive and rather
slapdash - we read a metadata block from disk and check certain aspects of it
(e.g. the magic number) to determine if it is the block we wanted. We have no
way of verifying that it is the correct block of metadata of the type
we were trying to read; just that it is one of that specific type. We
do bounds checking on critical fields, but this can't detect bit errors
in those fields. There's many fields we don't even bother to check because
the range of valid values are not limited.

Effectively, this can be broken down into three separate areas:

	- ensuring what we've read is exactly what we wrote
	- ensuring what we've read is the block we were supposed to read
	- robust contents checking

Firstly, if we introduce a mechanism that we can use to ensure what we read is
something that the filesystem wrote, we can detect a whole range of exceptions
that are caused in layers below the filesystem (software and hardware). The
best method for this is to use a guard value that travels with the metadata it
is guarding. The guard value needs to be derived from the contents of the
block being guarded. Any event that changes the guard or the contents it is
guarding will immediately trigger an exception handling process when the
metadata is read in. Some examples of what this will detect are:

	- bit errors in media/busses/memory after guard is calculated
	- uninitialised blocks being returned from lower layers (dmcrypt
	  had a readahead cancelling bug that could do this)
	- zeroed sectors as a result of double sector failures
	  in RAID5 systems
	- overwrite by data blocks
	- partial overwrites (e.g. due to power failure)

The simplest method for doing this is introducing a checksum or CRC into each
block. We can calculate this for each different type of metadata being written
just before they are written to disk, hence we are able to provide a guard that
travels all the way to and from disk with the metadata itself. Given that
metadata blocks can be a maximum of 64k in size, we don't need a hugely complex
CRC or number of bits to protect blocks of this size. A 32 bit CRC will allow
us to reliably detect 15 bit errors on a 64k block, so this would catch almost
all types of bit error exceptions that occur. It will also detect almost all
other types of major content change that might occur due to an exception.
It has been noted that we should select the guard algorithm to be one that
has (or is targetted for) widespread hardware acceleration support.

The other advantage this provides us with is a very fast method of determining
if a corrupted btree is a result of a lower layer problem or indeed an XFS
problem. That is, instead of always getting a WANT_CORRUPTED_GOTO btree
exception and shutdown, we'll get a'bad CRC' exception before we even start
processing the contents. This will save us much time when triaging corrupt
btrees - we won't spend time chasing problems that result from (potentially
silent or unhandled) lower layer exceptions.

While a metadata block guard will protect us against content change, it won't
protect us against blocks that are written to the wrong location on disk. This,
unfortunately, happens more often that anyone would like and can be very
difficult to track down when it does occur. To protect against this problem,
metadata needs to be self-describing on disk. That is, if we read a block
on disk, there needs to be enough information in that block to determine
that it is the correct block for that location.

Currently we have a very simplistic method of determining that we really have
read the correct block - the magic numbers in each metadata structure.  This
only enables us to identify type - we still need location and filesystem to
really determine if the block we've read is the correct one. We need the
filesystem identifier because misdirected writes can cross filesystem
boundaries.  This is easily done by including the UUID of the filesystem in
every individually referencable metadata structure on disk.

For block based metadata structures such as btrees, AG headers, etc, we
can add the block number directly to the header structures hence enabling
easy checking. e.g. for btree blocks, we already have sibling pointers in the
header, so adding a long 'self' pointer makes a great deal of sense.
For inodes, adding the inode number into the inode core will provide exactly
the same protection - we'll now know that the inode we are reading is the
one we are supposed to have read. We can make similar modifications to dquots
to make them self identifying as well.

So now we are able to verify the metadata we read from disk is what we wrote
and it's the correct metadata block, the only thing that remains is more
robust checking of the content. In many cases we already do this in DEBUG
code but not in runtime code. For example, when we read an inode cluster
in we only check the first inode for a matching magic number, whereas in
debug code we check every inode in the cluster.

In some cases, there is not much point in doing this sort of detailed checking;
it's pretty hard to check the validity of the contents of a btree block without
doing a full walk of the tree and that is prohibitive overhead for production
systems. The added block guards and self identifiers should be sufficient to
catch all non-filesystem based exceptions in this case, whilst the existing
exception detection should catch all others. With the btree factoring that
is being done on for this work, all of the btrees should end up protected by
WANT_CORRUPTED_GOTO runtime exception checking.

We also need to verify that metadata is sane before we use it. For example, if
we pull a block number out of a btree record in a block that has passed all
other validity it still may be invalid due to corruption prior to writing
it to disk. In these cases we need to ensure the block number lands
within the filesystem and/or within the bounds of the specific AG.

Similar checking is needed for pretty much any forward or backwards reference
we are going to follow or using in an algorithm somewhere. This will help
prevent kernel panics by out of bound references (e.g. using an unchecked ag
number to index the per-AG array) by turning them into a handled exception
(which will initially be a shutdown). That is, we will turn a total system
failure into a (potentially recoverable) filesystem failure.

Another failures that we often have reported is that XFS has 'hung' and
traige indicates that the filesystem appears to be waiting for a metadata
I/O completion to occur. We have seen in the past I/O errors not being
propagated from the lower layers back into the filesystem causing these
sort of problems. We have also seen cases where there have been silent
I/O errors and the first thing to go wrong is 'XFS has hung'.

To catch situations like this, we need to track all I/O we have in flight and
have some method of timing them out.  That is, if we haven't completed the I/O
in N seconds, issue a warning and enter an exception handling process that
attempts to deal with the problem.

My initial thoughts on this is that it could be implemented via the MRU cache
without much extra code being needed.  The complexity with this is that we
can't catch data read I/O because we use the generic I/O path for read. We do
our own data write and metadata read/write, so we can easily add hooks to track
all these types of I/O. Hence we will initially target just metadata I/O as
this would only need to hook into the xfs_buf I/O submission layer.

To further improve exception detection, once guards and self-describing
structures are on disk, we can add filesystem scrubbing daemons that can verify
the structure of the filesystem pro-actively. That is, we can use background
processes to discovery degradation in the filesystem before it is found by a
user intiated operation. This gives us the ability to do exception handling in
a context that enables further checking and potential repair of the exception.
This sort of exception handling may not be possible if we are in a
user-initiated I/O context, and certainly not if we are in a transaction
context.

This will also allow us to detect errors in rarely referenced parts of
the filesystem, thereby giving us advance warning of degradation in filesystems
that we might not otherwise get (e.g. in systems without media scrubbing).
Ideally, data scrubbing woul dneed to be done as well, but without data guards
it is rather hard to detect that there's been a change in the data....


== Exception Handling ==


Once we can detect exceptions, we need to handle them in a sane manner.
The method of exception handling is two-fold:

	- retry (write) or cancel (read) asynchronous I/O
	- shut down the filesystem (fatal).

Effectively, we either defer non-critical failures to a later point in
time or we come to a complete halt and prevent the filesystem from being
accessed further. We have no other methods of handling exceptions.

If we look at the different types of exceptions we can have, they
broadly fall into:

	- media read errors
	- media write errors
	- successful media read, corrupted contents

The context in which the errors occur also influences the exception processing
that is required. For example, an unrecoverable metadata read error within a
dirty transaction is a fatal error, whilst the same error during a read-only
operation will simply log the error to syslog and return an error to userspace.

Furthermore, the storage subsystem plays a part indeciding how to handle
errors. The reason is that in many storage configurations I/O errors can be
transient. For example, in a SAN a broken fibre can cause a failover to a
redundant path, however the inflight I/O on the failed is usually timed out and
an error returned. We don't want to shut down the filesystem on such an error -
we want to wait for failover to a redundant path and then retry the I/O. If the
failover succeeds, then the I/O will succeed. Hence any robust method of
exception handling needs to consider that I/O exceptions may be transient.

In the abscence of redundant metadata, there is little we can do right now
on a permanent media read error. There are a number of approaches we
can take for handling the exception:

	- try reading the block again. Normally we don't get an error
	  returned until the device has given up on trying to recover it.
	  If it's a transient failure, then we should eventually get a
	  good block back. If a retry fails, then:

	- inform the lower layer that it needs to perform recovery on that
	  block before trying to read it again. For path failover situations,
	  this should block until a redundant path is brought online. If no
	  redundant path exists or recovery from parity/error coding blocks
	  fails, then we cannot recover the block and we have a fatal error
	  situation.

Ultimately, however, we reach a point where we have to give up - the metadata
no longer exists on disk and we have to enter a repair process to fix the
problem. That is, shut down the filesystem and get a human to intervene
and fix the problem.

At this point, the only way we can prevent a shutdown situation from occurring
is to have redundant metadata on disk. That is, whenever we get an error
reported, we can immediately retry by reading from an alternate metadata block.
If we can read from the alternate block, we can continue onwards without
the user even knowing there is a block in the filesystem. Of course, we'd
need to log the event for the administrator to take action on at some point
in the future.

Even better, we can mostly avoid this intervention if we have alternate
metadata blocks. That is, we can repair blocks that are returning read errors
during the exception processing. In the case of media errors, they can
generally be corrected simply by re-writing the block that was returning the
error. This will force drives to remap the bad blocks internally so the next
read from that location will return valid data. This, if my understanding is
correct, is the same process that ZFS and BTRFS use to recover from and correct
such errors.

NOTE: Adding redundant metadata can be done in several different ways. I'm not
going to address that here as it is a topic all to itself. The focus of this
document is to outline how the redundant metadata could be used to enhance
exception processing and prevent a large number of cases where we currently
shut down the filesystem.

TODO:
	Transient write error
	Permanent write error
	Corrupted data on read
	Corrupted data on write (detected during guard calculation)
	I/O timeouts
	Memory corruption


== Reverse Mapping ==


It is worth noting that even redundant metadata doesn't solve all our
problems. Realistically, all that redundant metadata gives us is the ability
to recover from top-down traversal exceptions. It does not help exception
handling of occurences such as double sector failures (i.e. loss of redundancy
and a metadata block). Double sector failures are the most common cause
of RAID5 data loss - loss of a disk followed by a sector read error during
rebuild on one of the remaining disks.

In this case, we've got a block on disk that is corrupt. We know what block it
is, but we have no idea who the owner of the block is. If it is a metadata
block, then we can recover it if we have redundant metadata.  Even if this is
user data, we still want to be able to tell them what file got corrupted by the
failure event.  However, without doing a top-down traverse of the filesystem we
cannot find the owner of the block that was corrupted.

This is where we need a reverse block map. Every time we do an allocation of
an extent we know who the owner of the block is. If we record this information
in a separate tree then we can do a simple lookup to find the owner of any
block and start an exception handling process to repair the damage. Ideally
we also need to include information about the type of block as well. For
example, and inode can own:

	- data blocks
	- data fork BMBT blocks
	- attribute blocks
	- attribute fork BMBT blocks

So keeping track of owner + type would help indicate what sort of exception
handling needs to take place. For example, a missing data fork BMBT block means there
will be unreferenced extents across the filesystem. These 'lost extents'
could be recovered by reverse map traversal to find all the BMBT and data
blocks owned by that inode and finding the ones that are not referenced.
If the reverse map held suffient extra metadata - such as the offset within the
file for the extent - the exception handling process could rebuild the BMBT
tree completely without needing ænd external help.

It would seem to me that the reverse map needs to be a long-pointer format
btree and held per-AG. it needs long pointers because the owner of an extent
can be anywhere in the filesystem, and it needs to be per-AG to avoid adverse
effect on allocation parallelism.

The format of the reverse map record will be dependent on the amount of
metdata we need to store. We need:

	- owner (64 bit, primary record)
	- {block, len} extent descriptor
	- type
	- per-type specific metadata (e.g. offset for data types).

Looking at worst case here, say we have 32 bytes per record, the worst case
space usage of the reverse map btree woul dbe roughly 62 records per 4k
block. With a 1TB allocation group, we have 228 4k blocks in the AG
that could require unique reverse mappings. That gives us roughly 222
4k blocks to for the reverse map, or 234 bytes - roughly 16GB per 1TB
of space.

It may be a good idea to allocate this space at mkfs time (tagged as unwritten
so it doesn't need zeroing) to avoid allocation overhead and potential free
space fragmentation as the reverse map index grows and shrinks. If we do
this we could even treat this as a array/skip list where a given block in the
AG has a fixed location in the map. This will require more study to determine
the advantages and disadvantages of such approaches.


== Recovering From Errors During Transactions ==


One of the big problems we face with exception recovery is what to do
when we take an exception inside a dirty transaction. At present, any
error is treated as a fatal error, the transaction is cancelled and
the filesystem is shut down. Even though we may have a context which
can return an error, we are unable to revert the changes we have
made during the transaction and so cannot back out.

Effectively, a cancelled dirty transaction looks exactly like in-memory
structure corruption. That is, what is in memory is different to that
on disk, in the log or in asynchronous transactions yet to be written
to the log. Hence we cannot simply return an error and continue.

To be able to do this, we need to be able to undo changes made in a given
transaction. The method XFS uses for journalling - write-ahead logging -
makes this diffcult to do. A transaction proceeds in the following
order:

	- allocate transaction
	- reserve space in the journal for transaction
	- repeat until change is complete:
		- lock item
		- join item to transaction
		- modify item
		- record region of change to item
	- transaction commit

Effectively, we modify structures in memory then record where we
changed them for the transaction commit to write to disk. Unfortunately,
this means we overwrite the original state of the items in memory,
leaving us with no way to back out those changes from memory if
something goes wrong.

However, based on the observation that we are supposed to join an item to the
transaction *before* we start modifying it, it is possible to record the state
of the item before we start changing it. That is, we have a hoook that can
allow us take a copy of the unmodified item when we join it to the
transaction.

If we have an unmodified copy of the item in memory, then if the transaction
is cancelled when dirty, we have the information necessary to undo, or roll
back, the changes made in the transaction. This would allow us to return
the in-memory state to that prior to the transaction starting, thereby
ensuring that the in-memory state matches the rest of the filesystem and
allowing us to return an error to the calling context.

This is not without overhead. we would have to copy every metadata item
entirely in every transaction. This will increase the CPU overhead
of each transaction as well as the memory required. It is the memory
requirement more than the CPU overhead that concerns me - we may need
to ensure we have a memory pool associated with transaction reservation
that guarantees us enough memory is available to complete the transaction.
However, given that we could roll back transactions, we could now *fail
transactions* with ENOMEM and not have to shut down the filesystem, so this
may be an acceptible trade-off.

In terms of implementation, it is worth noting that there is debug code in
the xfs_buf_log_item for checking that all the modified regions of a buffer
were logged. Importantly, this is implemented by copying the original buffer
in the item initialisation when it is first attached to a transaction. In
other words, this debug code implements the mechanism we need to be able
to rollback changes made in a transaction. Other item types would require
similar changes to be made.

Overall, this doesn't look like a particularly complex change to make; the
only real question is how much overhead is it going to introduce. With CPUs
growing more cores all the time, and XFS being aimed at extremely
multi-threaded workloads, this overhead may not be a concern for long.


== Failure Domains ==


If we plan to have redundant metadata, or even try to provide fault isolation
between different parts of the filesystem namespace, we need to know about
independent regions of the filesystem. 'Independent Regions' (IR) are ranges
of the filesystem block address space that don't share resources with
any other range.

A classic example of a filesystem made up of multiple IRs is a linear
concatenation of multiple drives into a larger address space.  The address
space associated with each drive can operate independently from the other
drives, and a failure of one drive will not affect the operation of the address
spaces associated with other drives in the linear concatenation.

A Failure Domain (FD) is made up of one or more IRs. IRs cannot be shared
between FDs - IRs are not independent if they are shared! Effectively, an
ID is an encoding of the address space within the filesystem that lower level
failures (from below the filesystem) will not propagate outside. The geometry
and redundancy in the underlying storage will determine the nature of the
IRs available to the filesystem.

To use redundant metadata effectively for recovering from fatal lower layer
loss or corruption, we really need to be able to place said redundant
metadata in a different FDs. That way a loss in one domain can be recovered
from a domain that is still intact. It also means that it is extremely
difficult to lose or corrupt all copies of a given piece of metadata;
that would require multiple independent faults to occur in a localised
temporaral window. Concurrent multiple component failure in multiple
IRs is considered to be quite unlikely - if such an event were to
occur, it is likely that there is more to worry about than filesystem
consistency (like putting out the fire in the data center).

Another use of FDs is to try to minimise the number of domain boundaries
each object in the filesystem crosses. If an object is wholly contained
within a FD, and that object is corrupted, then the repair problem is
isolated to that FD, not the entire filesystem. That is, by making
allocation strategies and placement decisions aware of failure domain
boundaries we can constraint the location of related data and metadata.
Once locality is constrained, the scope of repairing an object if
it becomes corrupted is reduced to that of ensuring the FD is consistent.

There are many ways of limiting cross-domain dependencies; I will
not try to detail them here. Likewise, there are many ways of introducing
such information into XFS - mkfs, dynamically via allocation policies,
etc - so I won't try to detail them, either. The main point to be
made is that to make full use of redundant metadata and to reduce
the scope of common reapir problems we need to pay attention to 
how the system can fail to ensure that we can recover from failures
as quickly as possible.</rev>
        </revisions>
      </page>
      <page pageid="1458" ns="0" title="Runtime Stats">
        <revisions>
          <rev xml:space="preserve">This page intends to describe info available from &lt;tt&gt;/proc/fs/xfs/stat&lt;/tt&gt;

__TOC__ 


== Overview ==
Being advanced filesystem, XFS provides some internal statistics to user's view, which can be helpful on debugging/understanding IO characteristics and optimizing performance. Data available in /proc/fs/xfs/stat as dump of variables values grouped by type of information it holds.

== output example ==

&lt;pre&gt;
extent_alloc 4260849 125170297 4618726 131131897
abt 29491162 337391304 11257328 11133039
blk_map 381213360 115456141 10903633 69612322 7448401 507596777 0
bmbt 771328 6236258 602114 86646
dir 21253907 6921870 6969079 779205554
trans 126946406 38184616 6342392
ig 17754368 2019571 102 15734797 0 15672217 3962470
log 129491915 3992515264 458018 153771989 127040250
push_ail 171473415 0 6896837 3324292 8069877 65884 1289485 0 22535 7337
xstrat 4140059 0
rw 1595677950 1046884251
attr 194724197 0 7 0
icluster 20772185 2488203 13909520
vnodes 62578 15959666 0 0 15897088 15897088 15897088 0
buf 2090581631 1972536890 118044776 225145 9486625 0 0 2000152616 809762
xpc 6908312903680 67735504884757 19760115252482
debug 0
&lt;/pre&gt;

== Fields table ==
Numbers shown in output example above are presented as table of value names. Cells with &lt;span style=&quot;background-color: red;&quot;&gt;red background&lt;/span&gt; lack meaningful description and should be edited.
{| style=&quot;white-space:nowrap; text-align:center; border-spacing: 2px; border: 1px solid #666666; font-family: Verdana, Cursor;font-size: 10px;font-weight: bold;&quot; 
 |-style=&quot;background-color: #f0f0f0;&quot;
 | [[#extent_alloc|extent_alloc - Extent Allocation]]
 | xs_allocx
 | xs_allocb
 | xs_freex
 | xs_freeb
 |- style=&quot;background-color: #bfbfff;&quot;
 | [[#abt|abt - Allocation Btree]]
 | xs_abt_lookup
 | xs_abt_compare
 | xs_abt_insrec
 | xs_abt_delrec
 |-style=&quot;background-color: #f0f0f0;&quot;
 | [[#blk_map|blk_map - Block Mapping]]
 | xs_blk_mapr
 | xs_blk_mapw
 | xs_blk_unmap
 | xs_add_exlist
 | xs_del_exlist
 | xs_look_exlist
 | xs_cmp_exlist
 |-style=&quot;background-color: #bfbfff;&quot;
 | [[#bmbt|bmbt - Block Map Btree]]
 | xs_bmbt_lookup
 | xs_bmbt_compare
 | xs_bmbt_insrec
 | xs_bmbt_delrec
 |-style=&quot;background-color: #f0f0f0;&quot;
 | [[#dir|dir - Directory Operations]]
 | xs_dir_lookup
 | xs_dir_create
 | xs_dir_remove
 | xs_dir_getdents
 |-style=&quot;background-color: #bfbfff;&quot;
 | [[#trans|trans - Transactions]]
 | xs_trans_sync
 | xs_trans_async
 | xs_trans_empty
 |-style=&quot;background-color: #f0f0f0;&quot;
 | [[#ig|ig - Inode Operations]]
 | xs_ig_attempts
 | xs_ig_found
 | xs_ig_frecycle
 | xs_ig_missed
 | xs_ig_dup
 | xs_ig_reclaims
 | xs_ig_attrchg
 |-style=&quot;background-color: #bfbfff;&quot;
 | [[#log|log - Log Operations]]
 | xs_log_writes
 | xs_log_blocks
 | xs_log_noiclogs
 | xs_log_force
 |  &lt;span style=&quot;background-color: red;&quot;&gt;xs_log_force_sleep&lt;/span&gt;
 |-style=&quot;background-color: #f0f0f0;&quot;
 | [[#push_ail|push_ail - Tail-Pushing Stats]]
 | &lt;span style=&quot;background-color: red;&quot;&gt;xs_try_logspace&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;xs_sleep_logspace&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;xs_push_ail&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;xs_push_ail_success&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;xs_push_ail_pushbuf&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;xs_push_ail_pinned&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;xs_push_ail_locked&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;xs_push_ail_flushing&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;xs_push_ail_restarts&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;xs_push_ail_flush&lt;/span&gt;
 |-style=&quot;background-color: #bfbfff;&quot;
 | [[#xstrat|xstrat - IoMap Write Convert]]
 | xs_xstrat_quick
 | xs_xstrat_split
 |-style=&quot;background-color: #f0f0f0;&quot;
 | [[#rw|rw - Read/Write Stats]]
 | xs_write_calls
 | xs_read_calls
 |-style=&quot;background-color: #bfbfff;&quot;
 | [[#attr|attr - Attribute Operations]]
 | xs_attr_get
 | xs_attr_set
 | xs_attr_remove
 | xs_attr_list
 |-style=&quot;background-color: #f0f0f0;&quot;
 | [[#icluster|icluster - Inode Clustering]]
 | xs_iflush_count
 | &lt;span style=&quot;background-color: red;&quot;&gt;xs_icluster_flushcnt&lt;/span&gt;
 | xs_icluster_flushinode
 |-style=&quot;background-color: #bfbfff;&quot;
 | [[#vnodes|vnodes - Vnode Statistics]]
 | &lt;span style=&quot;background-color: red;&quot;&gt;vn_active&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;vn_alloc&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;vn_get&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;vn_hold&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;vn_rele&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;vn_reclaim&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;vn_remove&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;vn_free&lt;/span&gt;
 |-style=&quot;background-color: #f0f0f0;&quot;
 | [[#buf|buf - Buf Statistics]]
 | &lt;span style=&quot;background-color: red;&quot;&gt;xb_get&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;xb_create&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;xb_get_locked&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;xb_get_locked_waited&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;xb_busy_locked&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;xb_miss_locked&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;xb_page_retries&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;xb_page_found&lt;/span&gt;
 | &lt;span style=&quot;background-color: red;&quot;&gt;xb_get_read&lt;/span&gt;
 |-style=&quot;background-color: #bfbfff;&quot;
 | [[#xpc|xpc - eXtended Precision Counters]]
 | xs_xstrat_bytes
 | xs_write_bytes
 | xs_read_bytes
 |}



== Fields description: ==

=== extent_alloc - Extent Allocation ===
&lt;span id=&quot;extent_alloc&quot;&gt;
* xs_allocx (xfs.allocs.alloc_extent) 
** Number of file system extents allocated over all XFS filesystems. 
* xs_allocb (xfs.allocs.alloc_block)
** Number of file system blocks allocated over all XFS filesystems. 
* xs_freex (xfs.allocs.free_extent) 
** Number of file system extents freed over all XFS filesystems. 
* xs_freeb (xfs.allocs.free_block) 
** Number of file system blocks freed over all XFS filesystems. 

=== abt - Allocation Btree ===
&lt;span id=&quot;abt&quot;&gt;
* xs_abt_lookup (xfs.alloc_btree.lookup)
** Number of lookup operations in XFS filesystem allocation btrees.
* xs_abt_compare (xfs.alloc_btree.compare)
** Number of compares in XFS filesystem allocation btree lookups.
* xs_abt_insrec (xfs.alloc_btree.insrec)
** Number of extent records inserted into XFS filesystem allocation btrees.
* xs_abt_delrec (xfs.alloc_btree.delrec)
** Number of extent records deleted from XFS filesystem allocation btrees.

=== blk_map - Block Mapping ===
&lt;span id=&quot;blk_map&quot;&gt;
* xs_blk_mapr (xfs.block_map.read_ops)
** Number of block map for read operations performed on XFS files.
* xs_blk_mapw (xfs.block_map.write_ops)
** Number of block map for write operations performed on XFS files.
* xs_blk_unmap (xfs.block_map.unmap)
** Number of block unmap (delete) operations performed on XFS files.
* xs_add_exlist (xfs.block_map.add_exlist)
** Number of extent list insertion operations for XFS files.
* xs_del_exlist (xfs.block_map.del_exlist)
** Number of extent list deletion operations for XFS files.
* xs_look_exlist (xfs.block_map.look_exlist)
** Number of extent list lookup operations for XFS files.
* xs_cmp_exlist (xfs.block_map.cmp_exlist)
** Number of extent list comparisons in XFS extent list lookups.

=== bmbt - Block Map Btree ===
&lt;span id=&quot;bmbt&quot;&gt;
* xs_bmbt_lookup (xfs.bmap_btree.lookup)
** Number of block map btree lookup operations on XFS files.
* xs_bmbt_compare (xfs.bmap_btree.compare)
** Number of block map btree compare operations in XFS block map lookups.
* xs_bmbt_insrec (xfs.bmap_btree.insrec)
** Number of block map btree records inserted for XFS files.
* xs_bmbt_delrec (xfs.bmap_btree.delrec)
** Number of block map btree records deleted for XFS files.

=== dir - Directory Operations ===
&lt;span id=&quot;dir&quot;&gt;
* xs_dir_lookup (xfs.dir_ops.lookup)
** This is a count of the number of file name directory lookups in XFS filesystems. It counts only those lookups which miss in the operating system's directory name lookup cache and must search the real directory structure for the name in question.  The count is incremented once for each level of a pathname search that results in a directory lookup.
* xs_dir_create (xfs.dir_ops.create)
** This is the number of times a new directory entry was created in XFS filesystems. Each time that a new file, directory, link, symbolic link, or special file is created in the directory hierarchy the count is incremented.
* xs_dir_remove (xfs.dir_ops.remove)
** This is the number of times an existing directory entry was removed in XFS filesystems. Each time that a file, directory, link, symbolic link, or special file is removed from the directory hierarchy the count is incremented.
* xs_dir_getdents (xfs.dir_ops.getdents)
** This is the number of times the XFS directory getdents operation was performed. The getdents operation is used by programs to read the contents of directories in a file system independent fashion.  This count corresponds exactly to the number of times the getdents(2) system call was successfully used on an XFS directory.

=== trans - Transactions ===
&lt;span id=&quot;trans&quot;&gt;
* xs_trans_sync (xfs.transactions.sync)
** This is the number of meta-data transactions which waited to be committed to the on-disk log before allowing the process performing the transaction to continue. These transactions are slower and more expensive than asynchronous transactions, because they force the in memory log buffers to be forced to disk more often and they wait for the completion of the log buffer writes. Synchronous transactions include file truncations and all directory updates when the file system is mounted with the 'wsync' option.
* xs_trans_async (xfs.transactions.async)
** This is the number of meta-data transactions which did not wait to be committed to the on-disk log before allowing the process performing the transaction to continue. These transactions are faster and more efficient than synchronous transactions, because they commit their data to the in memory log buffers without forcing those buffers to be written to disk. This allows multiple asynchronous transactions to be committed to disk in a single log buffer write. Most transactions used in XFS file systems are asynchronous.
* xs_trans_empty (xfs.transactions.empty)
** This is the number of meta-data transactions which did not actually change anything. These are transactions which were started for some purpose, but in the end it turned out that no change was necessary.

=== ig - Inode Operations ===
&lt;span id=&quot;ig&quot;&gt;
* xs_ig_attempts (xfs.inode_ops.ig_attempts)
** This is the number of times the operating system looked for an XFS inode in the inode cache. Whether the inode was found in the cache or needed to be read in from the disk is not indicated here, but this can be computed from the ig_found and ig_missed counts.
* xs_ig_found (xfs.inode_ops.ig_found)
** This is the number of times the operating system looked for an XFS inode in the inode cache and found it. The closer this count is to the ig_attempts count the better the inode cache is performing.
* xs_ig_frecycle (xfs.inode_ops.ig_frecycle)
** This is the number of times the operating system looked for an XFS inode in the inode cache and saw that it was there but was unable to use the in memory inode because it was being recycled by another process.
* xs_ig_missed (xfs.inode_ops.ig_missed)
** This is the number of times the operating system looked for an XFS inode in the inode cache and the inode was not there. The further this count is from the ig_attempts count the better.
* xs_ig_dup (xfs.inode_ops.ig_dup)
** This is the number of times the operating system looked for an XFS inode in the inode cache and found that it was not there but upon attempting to add the inode to the cache found that another process had already inserted it.
* xs_ig_reclaims (xfs.inode_ops.ig_reclaims)
** This is the number of times the operating system recycled an XFS inode from the inode cache in order to use the memory for that inode for another purpose. Inodes are recycled in order to keep the inode cache from growing without bound. If the reclaim rate is high it may be beneficial to raise the vnode_free_ratio kernel tunable variable to increase the size of the inode cache.
* xs_ig_attrchg (xfs.inode_ops.ig_attrchg)
** This is the number of times the operating system explicitly changed the attributes of an XFS inode. For example, this could be to change the inode's owner, the inode's size, or the inode's timestamps.

=== log - Log Operations ===
&lt;span id=&quot;log&quot;&gt;
* xs_log_writes (xfs.log.writes)
** This variable counts the number of log buffer writes going to the physical log partitions of all XFS filesystems. Log data traffic is proportional to the level of meta-data updating. Log buffer writes get generated when they fill up or external syncs occur.
* xs_log_blocks (xfs.log.blocks)
** This variable counts (in 512-byte units) the information being written to the physical log partitions of all XFS filesystems. Log data traffic is proportional to the level of meta-data updating. The rate with which log data gets written depends on the size of internal log buffers and disk write speed. Therefore, filesystems with very high meta-data updating may need to stripe the log partition or put the log partition on a separate drive.
* xs_log_noiclogs (xfs.log.noiclogs)
** This variable keeps track of times when a logged transaction can not get any log buffer space. When this occurs, all of the internal log buffers are busy flushing their data to the physical on-disk log.
* xs_log_force (xfs.log.force)
** The number of times the in-core log is forced to disk.  It is equivalent to the number of successful calls to the function xfs_log_force().
* xs_log_force_sleep (xfs.log.force_sleep)
** Value exported from the xs_log_force_sleep field of struct xfsstats.

=== push_ail - Tail-Pushing Stats ===
&lt;span id=&quot;push_ail&quot;&gt;
* xs_try_logspace (xfs.log_tail.try_logspace)
** Value from the xs_try_logspace field of struct xfsstats. 
* xs_sleep_logspace (xfs.log_tail.sleep_logspace)
** Value from the xs_sleep_logspace field of struct xfsstats.
* xs_push_ail (xfs.log_tail.push_ail.pushes)
** The number of times the tail of the AIL is moved forward.  It is equivalent to the number of successful calls to the function xfs_trans_push_ail(). 
* xs_push_ail_success (xfs.log_tail.push_ail.success)
** Value from xs_push_ail_success field of struct xfsstats.
* xs_push_ail_pushbuf (xfs.log_tail.push_ail.pushbuf)
** Value from xs_push_ail_pushbuf field of struct xfsstats.
* xs_push_ail_pinned (xfs.log_tail.push_ail.pinned)
** Value from xs_push_ail_pinned field of struct xfsstats.
* xs_push_ail_locked (xfs.log_tail.push_ail.locked)
** Value from xs_push_ail_locked field of struct xfsstats.
* xs_push_ail_flushing (xfs.log_tail.push_ail.flushing)
** Value from xs_push_ail_flushing field of struct xfsstats.
* xs_push_ail_restarts (xfs.log_tail.push_ail.restarts)
** Value from xs_push_ail_restarts field of struct xfsstats.
* xs_push_ail_flush (xfs.log_tail.push_ail.flush)
** Value from xs_push_ail_flush field of struct xfsstats.

=== xstrat - IoMap Write Convert ===
&lt;span id=&quot;xstrat&quot;&gt;
*xs_xstrat_quick (xfs.xstrat.quick)
** This is the number of buffers flushed out by the XFS flushing daemons which are written to contiguous space on disk. The buffers handled by the XFS daemons are delayed allocation buffers, so this count gives an indication of the success of the XFS daemons in allocating contiguous disk space for the data being flushed to disk.
*xs_xstrat_split (xfs.xstrat.split)
** This is the number of buffers flushed out by the XFS flushing daemons which are written to non-contiguous space on disk. The buffers handled by the XFS daemons are delayed allocation buffers, so this count gives an indication of the failure of the XFS daemons in allocating contiguous disk space for the data being flushed to disk. Large values in this counter indicate that the file system has become fragmented.

=== rw - Read/Write Stats ===
&lt;span id=&quot;rw&quot;&gt;
* xs_write_calls
** This is the number of write(2) system calls made to files in XFS file systems.
* xs_read_calls
** This is the number of read(2) system calls made to files in XFS file systems.

=== attr - Attribute Operations ===
&lt;span id=&quot;attr&quot;&gt;
* xs_attr_get
** The number of &quot;get&quot; operations performed on extended file attributes within XFS filesystems.  The &quot;get&quot; operation retrieves the value of an extended attribute.
* xs_attr_set
** The number of &quot;set&quot; operations performed on extended file attributes within XFS filesystems.  The &quot;set&quot; operation creates and sets the value of an extended attribute.
* xs_attr_remove
** The number of &quot;remove&quot; operations performed on extended file attributes within XFS filesystems.  The &quot;remove&quot; operation deletes an extended attribute.
* xs_attr_list
** The number of &quot;list&quot; operations performed on extended file attributes within XFS filesystems.  The &quot;list&quot; operation retrieves the set of extended attributes associated with a file.

=== icluster - Inode Clustering ===
&lt;span id=&quot;icluster&quot;&gt;
* xs_iflush_count
** This is the number of calls to xfs_iflush which gets called when an inode is being flushed (such as by bdflush or tail pushing). xfs_iflush searches for other inodes in the same cluster which are dirty and flushable.
* xs_icluster_flushcnt
** Value from xs_icluster_flushcnt field of struct xfsstats.
* xs_icluster_flushinode
** This is the number of times that the inode clustering was not able to flush anything but the one inode it was called with.

=== vnodes - Vnode Statistics ===
&lt;span id=&quot;vnodes&quot;&gt;
* vn_active
** Number of vnodes not on free lists.
* vn_alloc
** Number of times vn_alloc called.
* vn_get
** Number of times vn_get called.
* vn_hold
** Number of times vn_hold called.
* vn_rele
** Number of times vn_rele called.
* vn_reclaim
**  Number of times vn_reclaim called.
* vn_remove
** Number of times vn_remove called.

=== buf - Buf Statistics ===
&lt;span id=&quot;buf&quot;&gt;
* xb_get
* xb_create
* xb_get_locked
* xb_get_locked_waited
* xb_busy_locked
* xb_miss_locked
* xb_page_retries
* xb_page_found
* xb_get_read

=== xpc - eXtended Precision Counters ===
&lt;span id=&quot;xpc&quot;&gt;
* xs_xstrat_bytes
** This is a count of bytes of file data flushed out by the XFS flushing daemons.
* xs_write_bytes
** This is a count of bytes written via write(2) system calls to files in XFS file systems. It can be used in conjunction with the write_calls count to calculate the average size of the write operations to files in XFS file systems.
* xs_read_bytes
** This is a count of bytes read via read(2) system calls to files in XFS file systems. It can be used in conjunction with the read_calls count to calculate the average size of the read operations to files in XFS file systems.


== NOTES ==
Many of these statistics are monotonically increasing counters, and of course are subject to counter overflow (the final three listed above are 64-bit values, all others are 32-bit values).  As such they are of limited value in this raw form - if you are interested in monitoring throughput (e.g. bytes read/written per second), or other rates of change, you will be better served by investigating the PCP package more thoroughly - it contains a number of performance analysis tools which can help in this regard.

== External links ==
# Linux kernel sources: [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_stats.h;hb=HEAD xfs_stats.h]
# [http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsmisc/xfs_stats.pl?rev=1.7;content-type=text%2Fplain xfs_stats.pl] - script to parse and display xfs statistics
# Developers on [irc://irc.freenode.org/xfs irc]</rev>
        </revisions>
      </page>
    </pages>
  </query>
</api>