MediaWiki API result

This is the HTML representation of the JSON format. HTML is good for debugging, but is unsuitable for application use.

Specify the format parameter to change the output format. To see the non-HTML representation of the JSON format, set format=json.

See the complete documentation, or the API help for more information.

{
    "batchcomplete": "",
    "continue": {
        "gapcontinue": "Shrinking_Support",
        "continue": "gapcontinue||"
    },
    "warnings": {
        "main": {
            "*": "Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes."
        },
        "revisions": {
            "*": "Because \"rvslots\" was not specified, a legacy format has been used for the output. This format is deprecated, and in the future the new format will always be used."
        }
    },
    "query": {
        "pages": {
            "1440": {
                "pageid": 1440,
                "ns": 0,
                "title": "Reliable Detection and Repair of Metadata Corruption",
                "revisions": [
                    {
                        "contentformat": "text/x-wiki",
                        "contentmodel": "wikitext",
                        "*": "= Future Directions - Reliability =\n\nFrom http://oss.sgi.com/archives/xfs/2008-09/msg00802.html\n\n== Reliable Detection and Repair of Metadata Corruption ==\n\n\n\n\nThis can be broken down into specific phases. Firstly, we cannot repair a\ncorruption we have not detected. Hence the first thing we need to do is\nreliable detection of errors and corruption. Once we can reliably detect errors\nin structures and verified that we are propagating all the errors reported from\nlower layers into XFS correctly, we can look at ways of handling them more\nrobustly. In many cases, the same type of error needs to be handled differently\ndue to the context in which the error occurs.  This introduces extra complexity\ninto this problem.\n\nRather than continually refering to specific types of problems (such as\ncorruption or error handling) I'll refer to them as 'exceptions'. This avoids\nthinking about specific error conditions through specific paths and so helps us\nto look at the issues from a more general or abstract point of view.\n\n== Exception Detection ==\n\n\nOur current approach to exception detection is entirely reactive and rather\nslapdash - we read a metadata block from disk and check certain aspects of it\n(e.g. the magic number) to determine if it is the block we wanted. We have no\nway of verifying that it is the correct block of metadata of the type\nwe were trying to read; just that it is one of that specific type. We\ndo bounds checking on critical fields, but this can't detect bit errors\nin those fields. There's many fields we don't even bother to check because\nthe range of valid values are not limited.\n\nEffectively, this can be broken down into three separate areas:\n\n\t- ensuring what we've read is exactly what we wrote\n\t- ensuring what we've read is the block we were supposed to read\n\t- robust contents checking\n\nFirstly, if we introduce a mechanism that we can use to ensure what we read is\nsomething that the filesystem wrote, we can detect a whole range of exceptions\nthat are caused in layers below the filesystem (software and hardware). The\nbest method for this is to use a guard value that travels with the metadata it\nis guarding. The guard value needs to be derived from the contents of the\nblock being guarded. Any event that changes the guard or the contents it is\nguarding will immediately trigger an exception handling process when the\nmetadata is read in. Some examples of what this will detect are:\n\n\t- bit errors in media/busses/memory after guard is calculated\n\t- uninitialised blocks being returned from lower layers (dmcrypt\n\t  had a readahead cancelling bug that could do this)\n\t- zeroed sectors as a result of double sector failures\n\t  in RAID5 systems\n\t- overwrite by data blocks\n\t- partial overwrites (e.g. due to power failure)\n\nThe simplest method for doing this is introducing a checksum or CRC into each\nblock. We can calculate this for each different type of metadata being written\njust before they are written to disk, hence we are able to provide a guard that\ntravels all the way to and from disk with the metadata itself. Given that\nmetadata blocks can be a maximum of 64k in size, we don't need a hugely complex\nCRC or number of bits to protect blocks of this size. A 32 bit CRC will allow\nus to reliably detect 15 bit errors on a 64k block, so this would catch almost\nall types of bit error exceptions that occur. It will also detect almost all\nother types of major content change that might occur due to an exception.\nIt has been noted that we should select the guard algorithm to be one that\nhas (or is targetted for) widespread hardware acceleration support.\n\nThe other advantage this provides us with is a very fast method of determining\nif a corrupted btree is a result of a lower layer problem or indeed an XFS\nproblem. That is, instead of always getting a WANT_CORRUPTED_GOTO btree\nexception and shutdown, we'll get a\ue000'bad CRC' exception before we even start\nprocessing the contents. This will save us much time when triaging corrupt\nbtrees - we won't spend time chasing problems that result from (potentially\nsilent or unhandled) lower layer exceptions.\n\nWhile a metadata block guard will protect us against content change, it won't\nprotect us against blocks that are written to the wrong location on disk. This,\nunfortunately, happens more often that anyone would like and can be very\ndifficult to track down when it does occur. To protect against this problem,\nmetadata needs to be self-describing on disk. That is, if we read a block\non disk, there needs to be enough information in that block to determine\nthat it is the correct block for that location.\n\nCurrently we have a very simplistic method of determining that we really have\nread the correct block - the magic numbers in each metadata structure.  This\nonly enables us to identify type - we still need location and filesystem to\nreally determine if the block we've read is the correct one. We need the\nfilesystem identifier because misdirected writes can cross filesystem\nboundaries.  This is easily done by including the UUID of the filesystem in\nevery individually referencable metadata structure on disk.\n\nFor block based metadata structures such as btrees, AG headers, etc, we\ncan add the block number directly to the header structures hence enabling\neasy checking. e.g. for btree blocks, we already have sibling pointers in the\nheader, so adding a long 'self' pointer makes a great deal of sense.\nFor inodes, adding the inode number into the inode core will provide exactly\nthe same protection - we'll now know that the inode we are reading is the\none we are supposed to have read. We can make similar modifications to dquots\nto make them self identifying as well.\n\nSo now we are able to verify the metadata we read from disk is what we wrote\nand it's the correct metadata block, the only thing that remains is more\nrobust checking of the content. In many cases we already do this in DEBUG\ncode but not in runtime code. For example, when we read an inode cluster\nin we only check the first inode for a matching magic number, whereas in\ndebug code we check every inode in the cluster.\n\nIn some cases, there is not much point in doing this sort of detailed checking;\nit's pretty hard to check the validity of the contents of a btree block without\ndoing a full walk of the tree and that is prohibitive overhead for production\nsystems. The added block guards and self identifiers should be sufficient to\ncatch all non-filesystem based exceptions in this case, whilst the existing\nexception detection should catch all others. With the btree factoring that\nis being done on for this work, all of the btrees should end up protected by\nWANT_CORRUPTED_GOTO runtime exception checking.\n\nWe also need to verify that metadata is sane before we use it. For example, if\nwe pull a block number out of a btree record in a block that has passed all\nother validity it still may be invalid due to corruption prior to writing\nit to disk. In these cases we need to ensure the block number lands\nwithin the filesystem and/or within the bounds of the specific AG.\n\nSimilar checking is needed for pretty much any forward or backwards reference\nwe are going to follow or using in an algorithm somewhere. This will help\nprevent kernel panics by out of bound references (e.g. using an unchecked ag\nnumber to index the per-AG array) by turning them into a handled exception\n(which will initially be a shutdown). That is, we will turn a total system\nfailure into a (potentially recoverable) filesystem failure.\n\nAnother failures that we often have reported is that XFS has 'hung' and\ntraige indicates that the filesystem appears to be waiting for a metadata\nI/O completion to occur. We have seen in the past I/O errors not being\npropagated from the lower layers back into the filesystem causing these\nsort of problems. We have also seen cases where there have been silent\nI/O errors and the first thing to go wrong is 'XFS has hung'.\n\nTo catch situations like this, we need to track all I/O we have in flight and\nhave some method of timing them out.  That is, if we haven't completed the I/O\nin N seconds, issue a warning and enter an exception handling process that\nattempts to deal with the problem.\n\nMy initial thoughts on this is that it could be implemented via the MRU cache\nwithout much extra code being needed.  The complexity with this is that we\ncan't catch data read I/O because we use the generic I/O path for read. We do\nour own data write and metadata read/write, so we can easily add hooks to track\nall these types of I/O. Hence we will initially target just metadata I/O as\nthis would only need to hook into the xfs_buf I/O submission layer.\n\nTo further improve exception detection, once guards and self-describing\nstructures are on disk, we can add filesystem scrubbing daemons that can verify\nthe structure of the filesystem pro-actively. That is, we can use background\nprocesses to discovery degradation in the filesystem before it is found by a\nuser intiated operation. This gives us the ability to do exception handling in\na context that enables further checking and potential repair of the exception.\nThis sort of exception handling may not be possible if we are in a\nuser-initiated I/O context, and certainly not if we are in a transaction\ncontext.\n\nThis will also allow us to detect errors in rarely referenced parts of\nthe filesystem, thereby giving us advance warning of degradation in filesystems\nthat we might not otherwise get (e.g. in systems without media scrubbing).\nIdeally, data scrubbing woul dneed to be done as well, but without data guards\nit is rather hard to detect that there's been a change in the data....\n\n\n== Exception Handling ==\n\n\nOnce we can detect exceptions, we need to handle them in a sane manner.\nThe method of exception handling is two-fold:\n\n\t- retry (write) or cancel (read) asynchronous I/O\n\t- shut down the filesystem (fatal).\n\nEffectively, we either defer non-critical failures to a later point in\ntime or we come to a complete halt and prevent the filesystem from being\naccessed further. We have no other methods of handling exceptions.\n\nIf we look at the different types of exceptions we can have, they\nbroadly fall into:\n\n\t- media read errors\n\t- media write errors\n\t- successful media read, corrupted contents\n\nThe context in which the errors occur also influences the exception processing\nthat is required. For example, an unrecoverable metadata read error within a\ndirty transaction is a fatal error, whilst the same error during a read-only\noperation will simply log the error to syslog and return an error to userspace.\n\nFurthermore, the storage subsystem plays a part indeciding how to handle\nerrors. The reason is that in many storage configurations I/O errors can be\ntransient. For example, in a SAN a broken fibre can cause a failover to a\nredundant path, however the inflight I/O on the failed is usually timed out and\nan error returned. We don't want to shut down the filesystem on such an error -\nwe want to wait for failover to a redundant path and then retry the I/O. If the\nfailover succeeds, then the I/O will succeed. Hence any robust method of\nexception handling needs to consider that I/O exceptions may be transient.\n\nIn the abscence of redundant metadata, there is little we can do right now\non a permanent media read error. There are a number of approaches we\ncan take for handling the exception:\n\n\t- try reading the block again. Normally we don't get an error\n\t  returned until the device has given up on trying to recover it.\n\t  If it's a transient failure, then we should eventually get a\n\t  good block back. If a retry fails, then:\n\n\t- inform the lower layer that it needs to perform recovery on that\n\t  block before trying to read it again. For path failover situations,\n\t  this should block until a redundant path is brought online. If no\n\t  redundant path exists or recovery from parity/error coding blocks\n\t  fails, then we cannot recover the block and we have a fatal error\n\t  situation.\n\nUltimately, however, we reach a point where we have to give up - the metadata\nno longer exists on disk and we have to enter a repair process to fix the\nproblem. That is, shut down the filesystem and get a human to intervene\nand fix the problem.\n\nAt this point, the only way we can prevent a shutdown situation from occurring\nis to have redundant metadata on disk. That is, whenever we get an error\nreported, we can immediately retry by reading from an alternate metadata block.\nIf we can read from the alternate block, we can continue onwards without\nthe user even knowing there is a block in the filesystem. Of course, we'd\nneed to log the event for the administrator to take action on at some point\nin the future.\n\nEven better, we can mostly avoid this intervention if we have alternate\nmetadata blocks. That is, we can repair blocks that are returning read errors\nduring the exception processing. In the case of media errors, they can\ngenerally be corrected simply by re-writing the block that was returning the\nerror. This will force drives to remap the bad blocks internally so the next\nread from that location will return valid data. This, if my understanding is\ncorrect, is the same process that ZFS and BTRFS use to recover from and correct\nsuch errors.\n\nNOTE: Adding redundant metadata can be done in several different ways. I'm not\ngoing to address that here as it is a topic all to itself. The focus of this\ndocument is to outline how the redundant metadata could be used to enhance\nexception processing and prevent a large number of cases where we currently\nshut down the filesystem.\n\nTODO:\n\tTransient write error\n\tPermanent write error\n\tCorrupted data on read\n\tCorrupted data on write (detected during guard calculation)\n\tI/O timeouts\n\tMemory corruption\n\n\n== Reverse Mapping ==\n\n\nIt is worth noting that even redundant metadata doesn't solve all our\nproblems. Realistically, all that redundant metadata gives us is the ability\nto recover from top-down traversal exceptions. It does not help exception\nhandling of occurences such as double sector failures (i.e. loss of redundancy\nand a metadata block). Double sector failures are the most common cause\nof RAID5 data loss - loss of a disk followed by a sector read error during\nrebuild on one of the remaining disks.\n\nIn this case, we've got a block on disk that is corrupt. We know what block it\nis, but we have no idea who the owner of the block is. If it is a metadata\nblock, then we can recover it if we have redundant metadata.  Even if this is\nuser data, we still want to be able to tell them what file got corrupted by the\nfailure event.  However, without doing a top-down traverse of the filesystem we\ncannot find the owner of the block that was corrupted.\n\nThis is where we need a reverse block map. Every time we do an allocation of\nan extent we know who the owner of the block is. If we record this information\nin a separate tree then we can do a simple lookup to find the owner of any\nblock and start an exception handling process to repair the damage. Ideally\nwe also need to include information about the type of block as well. For\nexample, and inode can own:\n\n\t- data blocks\n\t- data fork BMBT blocks\n\t- attribute blocks\n\t- attribute fork BMBT blocks\n\nSo keeping track of owner + type would help indicate what sort of exception\nhandling needs to take place. For example, a missing data fork BMBT block means there\nwill be unreferenced extents across the filesystem. These 'lost extents'\ncould be recovered by reverse map traversal to find all the BMBT and data\nblocks owned by that inode and finding the ones that are not referenced.\nIf the reverse map held suffient extra metadata - such as the offset within the\nfile for the extent - the exception handling process could rebuild the BMBT\ntree completely without needing \u00e6nd external help.\n\nIt would seem to me that the reverse map needs to be a long-pointer format\nbtree and held per-AG. it needs long pointers because the owner of an extent\ncan be anywhere in the filesystem, and it needs to be per-AG to avoid adverse\neffect on allocation parallelism.\n\nThe format of the reverse map record will be dependent on the amount of\nmetdata we need to store. We need:\n\n\t- owner (64 bit, primary record)\n\t- {block, len} extent descriptor\n\t- type\n\t- per-type specific metadata (e.g. offset for data types).\n\nLooking at worst case here, say we have 32 bytes per record, the worst case\nspace usage of the reverse map btree woul dbe roughly 62 records per 4k\nblock. With a 1TB allocation group, we have 228 4k blocks in the AG\nthat could require unique reverse mappings. That gives us roughly 222\n4k blocks to for the reverse map, or 234 bytes - roughly 16GB per 1TB\nof space.\n\nIt may be a good idea to allocate this space at mkfs time (tagged as unwritten\nso it doesn't need zeroing) to avoid allocation overhead and potential free\nspace fragmentation as the reverse map index grows and shrinks. If we do\nthis we could even treat this as a array/skip list where a given block in the\nAG has a fixed location in the map. This will require more study to determine\nthe advantages and disadvantages of such approaches.\n\n\n== Recovering From Errors During Transactions ==\n\n\nOne of the big problems we face with exception recovery is what to do\nwhen we take an exception inside a dirty transaction. At present, any\nerror is treated as a fatal error, the transaction is cancelled and\nthe filesystem is shut down. Even though we may have a context which\ncan return an error, we are unable to revert the changes we have\nmade during the transaction and so cannot back out.\n\nEffectively, a cancelled dirty transaction looks exactly like in-memory\nstructure corruption. That is, what is in memory is different to that\non disk, in the log or in asynchronous transactions yet to be written\nto the log. Hence we cannot simply return an error and continue.\n\nTo be able to do this, we need to be able to undo changes made in a given\ntransaction. The method XFS uses for journalling - write-ahead logging -\nmakes this diffcult to do. A transaction proceeds in the following\norder:\n\n\t- allocate transaction\n\t- reserve space in the journal for transaction\n\t- repeat until change is complete:\n\t\t- lock item\n\t\t- join item to transaction\n\t\t- modify item\n\t\t- record region of change to item\n\t- transaction commit\n\nEffectively, we modify structures in memory then record where we\nchanged them for the transaction commit to write to disk. Unfortunately,\nthis means we overwrite the original state of the items in memory,\nleaving us with no way to back out those changes from memory if\nsomething goes wrong.\n\nHowever, based on the observation that we are supposed to join an item to the\ntransaction *before* we start modifying it, it is possible to record the state\nof the item before we start changing it. That is, we have a hoook that can\nallow us take a copy of the unmodified item when we join it to the\ntransaction.\n\nIf we have an unmodified copy of the item in memory, then if the transaction\nis cancelled when dirty, we have the information necessary to undo, or roll\nback, the changes made in the transaction. This would allow us to return\nthe in-memory state to that prior to the transaction starting, thereby\nensuring that the in-memory state matches the rest of the filesystem and\nallowing us to return an error to the calling context.\n\nThis is not without overhead. we would have to copy every metadata item\nentirely in every transaction. This will increase the CPU overhead\nof each transaction as well as the memory required. It is the memory\nrequirement more than the CPU overhead that concerns me - we may need\nto ensure we have a memory pool associated with transaction reservation\nthat guarantees us enough memory is available to complete the transaction.\nHowever, given that we could roll back transactions, we could now *fail\ntransactions* with ENOMEM and not have to shut down the filesystem, so this\nmay be an acceptible trade-off.\n\nIn terms of implementation, it is worth noting that there is debug code in\nthe xfs_buf_log_item for checking that all the modified regions of a buffer\nwere logged. Importantly, this is implemented by copying the original buffer\nin the item initialisation when it is first attached to a transaction. In\nother words, this debug code implements the mechanism we need to be able\nto rollback changes made in a transaction. Other item types would require\nsimilar changes to be made.\n\nOverall, this doesn't look like a particularly complex change to make; the\nonly real question is how much overhead is it going to introduce. With CPUs\ngrowing more cores all the time, and XFS being aimed at extremely\nmulti-threaded workloads, this overhead may not be a concern for long.\n\n\n== Failure Domains ==\n\n\nIf we plan to have redundant metadata, or even try to provide fault isolation\nbetween different parts of the filesystem namespace, we need to know about\nindependent regions of the filesystem. 'Independent Regions' (IR) are ranges\nof the filesystem block address space that don't share resources with\nany other range.\n\nA classic example of a filesystem made up of multiple IRs is a linear\nconcatenation of multiple drives into a larger address space.  The address\nspace associated with each drive can operate independently from the other\ndrives, and a failure of one drive will not affect the operation of the address\nspaces associated with other drives in the linear concatenation.\n\nA Failure Domain (FD) is made up of one or more IRs. IRs cannot be shared\nbetween FDs - IRs are not independent if they are shared! Effectively, an\nID is an encoding of the address space within the filesystem that lower level\nfailures (from below the filesystem) will not propagate outside. The geometry\nand redundancy in the underlying storage will determine the nature of the\nIRs available to the filesystem.\n\nTo use redundant metadata effectively for recovering from fatal lower layer\nloss or corruption, we really need to be able to place said redundant\nmetadata in a different FDs. That way a loss in one domain can be recovered\nfrom a domain that is still intact. It also means that it is extremely\ndifficult to lose or corrupt all copies of a given piece of metadata;\nthat would require multiple independent faults to occur in a localised\ntemporaral window. Concurrent multiple component failure in multiple\nIRs is considered to be quite unlikely - if such an event were to\noccur, it is likely that there is more to worry about than filesystem\nconsistency (like putting out the fire in the data center).\n\nAnother use of FDs is to try to minimise the number of domain boundaries\neach object in the filesystem crosses. If an object is wholly contained\nwithin a FD, and that object is corrupted, then the repair problem is\nisolated to that FD, not the entire filesystem. That is, by making\nallocation strategies and placement decisions aware of failure domain\nboundaries we can constraint the location of related data and metadata.\nOnce locality is constrained, the scope of repairing an object if\nit becomes corrupted is reduced to that of ensuring the FD is consistent.\n\nThere are many ways of limiting cross-domain dependencies; I will\nnot try to detail them here. Likewise, there are many ways of introducing\nsuch information into XFS - mkfs, dynamically via allocation policies,\netc - so I won't try to detail them, either. The main point to be\nmade is that to make full use of redundant metadata and to reduce\nthe scope of common reapir problems we need to pay attention to \nhow the system can fail to ensure that we can recover from failures\nas quickly as possible."
                    }
                ]
            },
            "1458": {
                "pageid": 1458,
                "ns": 0,
                "title": "Runtime Stats",
                "revisions": [
                    {
                        "contentformat": "text/x-wiki",
                        "contentmodel": "wikitext",
                        "*": "This page intends to describe info available from <tt>/proc/fs/xfs/stat</tt>\n\n__TOC__ \n\n\n== Overview ==\nBeing advanced filesystem, XFS provides some internal statistics to user's view, which can be helpful on debugging/understanding IO characteristics and optimizing performance. Data available in /proc/fs/xfs/stat as dump of variables values grouped by type of information it holds.\n\n== output example ==\n\n<pre>\nextent_alloc 4260849 125170297 4618726 131131897\nabt 29491162 337391304 11257328 11133039\nblk_map 381213360 115456141 10903633 69612322 7448401 507596777 0\nbmbt 771328 6236258 602114 86646\ndir 21253907 6921870 6969079 779205554\ntrans 126946406 38184616 6342392\nig 17754368 2019571 102 15734797 0 15672217 3962470\nlog 129491915 3992515264 458018 153771989 127040250\npush_ail 171473415 0 6896837 3324292 8069877 65884 1289485 0 22535 7337\nxstrat 4140059 0\nrw 1595677950 1046884251\nattr 194724197 0 7 0\nicluster 20772185 2488203 13909520\nvnodes 62578 15959666 0 0 15897088 15897088 15897088 0\nbuf 2090581631 1972536890 118044776 225145 9486625 0 0 2000152616 809762\nxpc 6908312903680 67735504884757 19760115252482\ndebug 0\n</pre>\n\n== Fields table ==\nNumbers shown in output example above are presented as table of value names. Cells with <span style=\"background-color: red;\">red background</span> lack meaningful description and should be edited.\n{| style=\"white-space:nowrap; text-align:center; border-spacing: 2px; border: 1px solid #666666; font-family: Verdana, Cursor;font-size: 10px;font-weight: bold;\" \n |-style=\"background-color: #f0f0f0;\"\n | [[#extent_alloc|extent_alloc - Extent Allocation]]\n | xs_allocx\n | xs_allocb\n | xs_freex\n | xs_freeb\n |- style=\"background-color: #bfbfff;\"\n | [[#abt|abt - Allocation Btree]]\n | xs_abt_lookup\n | xs_abt_compare\n | xs_abt_insrec\n | xs_abt_delrec\n |-style=\"background-color: #f0f0f0;\"\n | [[#blk_map|blk_map - Block Mapping]]\n | xs_blk_mapr\n | xs_blk_mapw\n | xs_blk_unmap\n | xs_add_exlist\n | xs_del_exlist\n | xs_look_exlist\n | xs_cmp_exlist\n |-style=\"background-color: #bfbfff;\"\n | [[#bmbt|bmbt - Block Map Btree]]\n | xs_bmbt_lookup\n | xs_bmbt_compare\n | xs_bmbt_insrec\n | xs_bmbt_delrec\n |-style=\"background-color: #f0f0f0;\"\n | [[#dir|dir - Directory Operations]]\n | xs_dir_lookup\n | xs_dir_create\n | xs_dir_remove\n | xs_dir_getdents\n |-style=\"background-color: #bfbfff;\"\n | [[#trans|trans - Transactions]]\n | xs_trans_sync\n | xs_trans_async\n | xs_trans_empty\n |-style=\"background-color: #f0f0f0;\"\n | [[#ig|ig - Inode Operations]]\n | xs_ig_attempts\n | xs_ig_found\n | xs_ig_frecycle\n | xs_ig_missed\n | xs_ig_dup\n | xs_ig_reclaims\n | xs_ig_attrchg\n |-style=\"background-color: #bfbfff;\"\n | [[#log|log - Log Operations]]\n | xs_log_writes\n | xs_log_blocks\n | xs_log_noiclogs\n | xs_log_force\n |  <span style=\"background-color: red;\">xs_log_force_sleep</span>\n |-style=\"background-color: #f0f0f0;\"\n | [[#push_ail|push_ail - Tail-Pushing Stats]]\n | <span style=\"background-color: red;\">xs_try_logspace</span>\n | <span style=\"background-color: red;\">xs_sleep_logspace</span>\n | <span style=\"background-color: red;\">xs_push_ail</span>\n | <span style=\"background-color: red;\">xs_push_ail_success</span>\n | <span style=\"background-color: red;\">xs_push_ail_pushbuf</span>\n | <span style=\"background-color: red;\">xs_push_ail_pinned</span>\n | <span style=\"background-color: red;\">xs_push_ail_locked</span>\n | <span style=\"background-color: red;\">xs_push_ail_flushing</span>\n | <span style=\"background-color: red;\">xs_push_ail_restarts</span>\n | <span style=\"background-color: red;\">xs_push_ail_flush</span>\n |-style=\"background-color: #bfbfff;\"\n | [[#xstrat|xstrat - IoMap Write Convert]]\n | xs_xstrat_quick\n | xs_xstrat_split\n |-style=\"background-color: #f0f0f0;\"\n | [[#rw|rw - Read/Write Stats]]\n | xs_write_calls\n | xs_read_calls\n |-style=\"background-color: #bfbfff;\"\n | [[#attr|attr - Attribute Operations]]\n | xs_attr_get\n | xs_attr_set\n | xs_attr_remove\n | xs_attr_list\n |-style=\"background-color: #f0f0f0;\"\n | [[#icluster|icluster - Inode Clustering]]\n | xs_iflush_count\n | <span style=\"background-color: red;\">xs_icluster_flushcnt</span>\n | xs_icluster_flushinode\n |-style=\"background-color: #bfbfff;\"\n | [[#vnodes|vnodes - Vnode Statistics]]\n | <span style=\"background-color: red;\">vn_active</span>\n | <span style=\"background-color: red;\">vn_alloc</span>\n | <span style=\"background-color: red;\">vn_get</span>\n | <span style=\"background-color: red;\">vn_hold</span>\n | <span style=\"background-color: red;\">vn_rele</span>\n | <span style=\"background-color: red;\">vn_reclaim</span>\n | <span style=\"background-color: red;\">vn_remove</span>\n | <span style=\"background-color: red;\">vn_free</span>\n |-style=\"background-color: #f0f0f0;\"\n | [[#buf|buf - Buf Statistics]]\n | <span style=\"background-color: red;\">xb_get</span>\n | <span style=\"background-color: red;\">xb_create</span>\n | <span style=\"background-color: red;\">xb_get_locked</span>\n | <span style=\"background-color: red;\">xb_get_locked_waited</span>\n | <span style=\"background-color: red;\">xb_busy_locked</span>\n | <span style=\"background-color: red;\">xb_miss_locked</span>\n | <span style=\"background-color: red;\">xb_page_retries</span>\n | <span style=\"background-color: red;\">xb_page_found</span>\n | <span style=\"background-color: red;\">xb_get_read</span>\n |-style=\"background-color: #bfbfff;\"\n | [[#xpc|xpc - eXtended Precision Counters]]\n | xs_xstrat_bytes\n | xs_write_bytes\n | xs_read_bytes\n |}\n\n\n\n== Fields description: ==\n\n=== extent_alloc - Extent Allocation ===\n<span id=\"extent_alloc\">\n* xs_allocx (xfs.allocs.alloc_extent) \n** Number of file system extents allocated over all XFS filesystems. \n* xs_allocb (xfs.allocs.alloc_block)\n** Number of file system blocks allocated over all XFS filesystems. \n* xs_freex (xfs.allocs.free_extent) \n** Number of file system extents freed over all XFS filesystems. \n* xs_freeb (xfs.allocs.free_block) \n** Number of file system blocks freed over all XFS filesystems. \n\n=== abt - Allocation Btree ===\n<span id=\"abt\">\n* xs_abt_lookup (xfs.alloc_btree.lookup)\n** Number of lookup operations in XFS filesystem allocation btrees.\n* xs_abt_compare (xfs.alloc_btree.compare)\n** Number of compares in XFS filesystem allocation btree lookups.\n* xs_abt_insrec (xfs.alloc_btree.insrec)\n** Number of extent records inserted into XFS filesystem allocation btrees.\n* xs_abt_delrec (xfs.alloc_btree.delrec)\n** Number of extent records deleted from XFS filesystem allocation btrees.\n\n=== blk_map - Block Mapping ===\n<span id=\"blk_map\">\n* xs_blk_mapr (xfs.block_map.read_ops)\n** Number of block map for read operations performed on XFS files.\n* xs_blk_mapw (xfs.block_map.write_ops)\n** Number of block map for write operations performed on XFS files.\n* xs_blk_unmap (xfs.block_map.unmap)\n** Number of block unmap (delete) operations performed on XFS files.\n* xs_add_exlist (xfs.block_map.add_exlist)\n** Number of extent list insertion operations for XFS files.\n* xs_del_exlist (xfs.block_map.del_exlist)\n** Number of extent list deletion operations for XFS files.\n* xs_look_exlist (xfs.block_map.look_exlist)\n** Number of extent list lookup operations for XFS files.\n* xs_cmp_exlist (xfs.block_map.cmp_exlist)\n** Number of extent list comparisons in XFS extent list lookups.\n\n=== bmbt - Block Map Btree ===\n<span id=\"bmbt\">\n* xs_bmbt_lookup (xfs.bmap_btree.lookup)\n** Number of block map btree lookup operations on XFS files.\n* xs_bmbt_compare (xfs.bmap_btree.compare)\n** Number of block map btree compare operations in XFS block map lookups.\n* xs_bmbt_insrec (xfs.bmap_btree.insrec)\n** Number of block map btree records inserted for XFS files.\n* xs_bmbt_delrec (xfs.bmap_btree.delrec)\n** Number of block map btree records deleted for XFS files.\n\n=== dir - Directory Operations ===\n<span id=\"dir\">\n* xs_dir_lookup (xfs.dir_ops.lookup)\n** This is a count of the number of file name directory lookups in XFS filesystems. It counts only those lookups which miss in the operating system's directory name lookup cache and must search the real directory structure for the name in question.  The count is incremented once for each level of a pathname search that results in a directory lookup.\n* xs_dir_create (xfs.dir_ops.create)\n** This is the number of times a new directory entry was created in XFS filesystems. Each time that a new file, directory, link, symbolic link, or special file is created in the directory hierarchy the count is incremented.\n* xs_dir_remove (xfs.dir_ops.remove)\n** This is the number of times an existing directory entry was removed in XFS filesystems. Each time that a file, directory, link, symbolic link, or special file is removed from the directory hierarchy the count is incremented.\n* xs_dir_getdents (xfs.dir_ops.getdents)\n** This is the number of times the XFS directory getdents operation was performed. The getdents operation is used by programs to read the contents of directories in a file system independent fashion.  This count corresponds exactly to the number of times the getdents(2) system call was successfully used on an XFS directory.\n\n=== trans - Transactions ===\n<span id=\"trans\">\n* xs_trans_sync (xfs.transactions.sync)\n** This is the number of meta-data transactions which waited to be committed to the on-disk log before allowing the process performing the transaction to continue. These transactions are slower and more expensive than asynchronous transactions, because they force the in memory log buffers to be forced to disk more often and they wait for the completion of the log buffer writes. Synchronous transactions include file truncations and all directory updates when the file system is mounted with the 'wsync' option.\n* xs_trans_async (xfs.transactions.async)\n** This is the number of meta-data transactions which did not wait to be committed to the on-disk log before allowing the process performing the transaction to continue. These transactions are faster and more efficient than synchronous transactions, because they commit their data to the in memory log buffers without forcing those buffers to be written to disk. This allows multiple asynchronous transactions to be committed to disk in a single log buffer write. Most transactions used in XFS file systems are asynchronous.\n* xs_trans_empty (xfs.transactions.empty)\n** This is the number of meta-data transactions which did not actually change anything. These are transactions which were started for some purpose, but in the end it turned out that no change was necessary.\n\n=== ig - Inode Operations ===\n<span id=\"ig\">\n* xs_ig_attempts (xfs.inode_ops.ig_attempts)\n** This is the number of times the operating system looked for an XFS inode in the inode cache. Whether the inode was found in the cache or needed to be read in from the disk is not indicated here, but this can be computed from the ig_found and ig_missed counts.\n* xs_ig_found (xfs.inode_ops.ig_found)\n** This is the number of times the operating system looked for an XFS inode in the inode cache and found it. The closer this count is to the ig_attempts count the better the inode cache is performing.\n* xs_ig_frecycle (xfs.inode_ops.ig_frecycle)\n** This is the number of times the operating system looked for an XFS inode in the inode cache and saw that it was there but was unable to use the in memory inode because it was being recycled by another process.\n* xs_ig_missed (xfs.inode_ops.ig_missed)\n** This is the number of times the operating system looked for an XFS inode in the inode cache and the inode was not there. The further this count is from the ig_attempts count the better.\n* xs_ig_dup (xfs.inode_ops.ig_dup)\n** This is the number of times the operating system looked for an XFS inode in the inode cache and found that it was not there but upon attempting to add the inode to the cache found that another process had already inserted it.\n* xs_ig_reclaims (xfs.inode_ops.ig_reclaims)\n** This is the number of times the operating system recycled an XFS inode from the inode cache in order to use the memory for that inode for another purpose. Inodes are recycled in order to keep the inode cache from growing without bound. If the reclaim rate is high it may be beneficial to raise the vnode_free_ratio kernel tunable variable to increase the size of the inode cache.\n* xs_ig_attrchg (xfs.inode_ops.ig_attrchg)\n** This is the number of times the operating system explicitly changed the attributes of an XFS inode. For example, this could be to change the inode's owner, the inode's size, or the inode's timestamps.\n\n=== log - Log Operations ===\n<span id=\"log\">\n* xs_log_writes (xfs.log.writes)\n** This variable counts the number of log buffer writes going to the physical log partitions of all XFS filesystems. Log data traffic is proportional to the level of meta-data updating. Log buffer writes get generated when they fill up or external syncs occur.\n* xs_log_blocks (xfs.log.blocks)\n** This variable counts (in 512-byte units) the information being written to the physical log partitions of all XFS filesystems. Log data traffic is proportional to the level of meta-data updating. The rate with which log data gets written depends on the size of internal log buffers and disk write speed. Therefore, filesystems with very high meta-data updating may need to stripe the log partition or put the log partition on a separate drive.\n* xs_log_noiclogs (xfs.log.noiclogs)\n** This variable keeps track of times when a logged transaction can not get any log buffer space. When this occurs, all of the internal log buffers are busy flushing their data to the physical on-disk log.\n* xs_log_force (xfs.log.force)\n** The number of times the in-core log is forced to disk.  It is equivalent to the number of successful calls to the function xfs_log_force().\n* xs_log_force_sleep (xfs.log.force_sleep)\n** Value exported from the xs_log_force_sleep field of struct xfsstats.\n\n=== push_ail - Tail-Pushing Stats ===\n<span id=\"push_ail\">\n* xs_try_logspace (xfs.log_tail.try_logspace)\n** Value from the xs_try_logspace field of struct xfsstats. \n* xs_sleep_logspace (xfs.log_tail.sleep_logspace)\n** Value from the xs_sleep_logspace field of struct xfsstats.\n* xs_push_ail (xfs.log_tail.push_ail.pushes)\n** The number of times the tail of the AIL is moved forward.  It is equivalent to the number of successful calls to the function xfs_trans_push_ail(). \n* xs_push_ail_success (xfs.log_tail.push_ail.success)\n** Value from xs_push_ail_success field of struct xfsstats.\n* xs_push_ail_pushbuf (xfs.log_tail.push_ail.pushbuf)\n** Value from xs_push_ail_pushbuf field of struct xfsstats.\n* xs_push_ail_pinned (xfs.log_tail.push_ail.pinned)\n** Value from xs_push_ail_pinned field of struct xfsstats.\n* xs_push_ail_locked (xfs.log_tail.push_ail.locked)\n** Value from xs_push_ail_locked field of struct xfsstats.\n* xs_push_ail_flushing (xfs.log_tail.push_ail.flushing)\n** Value from xs_push_ail_flushing field of struct xfsstats.\n* xs_push_ail_restarts (xfs.log_tail.push_ail.restarts)\n** Value from xs_push_ail_restarts field of struct xfsstats.\n* xs_push_ail_flush (xfs.log_tail.push_ail.flush)\n** Value from xs_push_ail_flush field of struct xfsstats.\n\n=== xstrat - IoMap Write Convert ===\n<span id=\"xstrat\">\n*xs_xstrat_quick (xfs.xstrat.quick)\n** This is the number of buffers flushed out by the XFS flushing daemons which are written to contiguous space on disk. The buffers handled by the XFS daemons are delayed allocation buffers, so this count gives an indication of the success of the XFS daemons in allocating contiguous disk space for the data being flushed to disk.\n*xs_xstrat_split (xfs.xstrat.split)\n** This is the number of buffers flushed out by the XFS flushing daemons which are written to non-contiguous space on disk. The buffers handled by the XFS daemons are delayed allocation buffers, so this count gives an indication of the failure of the XFS daemons in allocating contiguous disk space for the data being flushed to disk. Large values in this counter indicate that the file system has become fragmented.\n\n=== rw - Read/Write Stats ===\n<span id=\"rw\">\n* xs_write_calls\n** This is the number of write(2) system calls made to files in XFS file systems.\n* xs_read_calls\n** This is the number of read(2) system calls made to files in XFS file systems.\n\n=== attr - Attribute Operations ===\n<span id=\"attr\">\n* xs_attr_get\n** The number of \"get\" operations performed on extended file attributes within XFS filesystems.  The \"get\" operation retrieves the value of an extended attribute.\n* xs_attr_set\n** The number of \"set\" operations performed on extended file attributes within XFS filesystems.  The \"set\" operation creates and sets the value of an extended attribute.\n* xs_attr_remove\n** The number of \"remove\" operations performed on extended file attributes within XFS filesystems.  The \"remove\" operation deletes an extended attribute.\n* xs_attr_list\n** The number of \"list\" operations performed on extended file attributes within XFS filesystems.  The \"list\" operation retrieves the set of extended attributes associated with a file.\n\n=== icluster - Inode Clustering ===\n<span id=\"icluster\">\n* xs_iflush_count\n** This is the number of calls to xfs_iflush which gets called when an inode is being flushed (such as by bdflush or tail pushing). xfs_iflush searches for other inodes in the same cluster which are dirty and flushable.\n* xs_icluster_flushcnt\n** Value from xs_icluster_flushcnt field of struct xfsstats.\n* xs_icluster_flushinode\n** This is the number of times that the inode clustering was not able to flush anything but the one inode it was called with.\n\n=== vnodes - Vnode Statistics ===\n<span id=\"vnodes\">\n* vn_active\n** Number of vnodes not on free lists.\n* vn_alloc\n** Number of times vn_alloc called.\n* vn_get\n** Number of times vn_get called.\n* vn_hold\n** Number of times vn_hold called.\n* vn_rele\n** Number of times vn_rele called.\n* vn_reclaim\n**  Number of times vn_reclaim called.\n* vn_remove\n** Number of times vn_remove called.\n\n=== buf - Buf Statistics ===\n<span id=\"buf\">\n* xb_get\n* xb_create\n* xb_get_locked\n* xb_get_locked_waited\n* xb_busy_locked\n* xb_miss_locked\n* xb_page_retries\n* xb_page_found\n* xb_get_read\n\n=== xpc - eXtended Precision Counters ===\n<span id=\"xpc\">\n* xs_xstrat_bytes\n** This is a count of bytes of file data flushed out by the XFS flushing daemons.\n* xs_write_bytes\n** This is a count of bytes written via write(2) system calls to files in XFS file systems. It can be used in conjunction with the write_calls count to calculate the average size of the write operations to files in XFS file systems.\n* xs_read_bytes\n** This is a count of bytes read via read(2) system calls to files in XFS file systems. It can be used in conjunction with the read_calls count to calculate the average size of the read operations to files in XFS file systems.\n\n\n== NOTES ==\nMany of these statistics are monotonically increasing counters, and of course are subject to counter overflow (the final three listed above are 64-bit values, all others are 32-bit values).  As such they are of limited value in this raw form - if you are interested in monitoring throughput (e.g. bytes read/written per second), or other rates of change, you will be better served by investigating the PCP package more thoroughly - it contains a number of performance analysis tools which can help in this regard.\n\n== External links ==\n# Linux kernel sources: [http://oss.sgi.com/cgi-bin/gitweb.cgi?p=xfs/xfs.git;a=blob;f=fs/xfs/linux-2.6/xfs_stats.h;hb=HEAD xfs_stats.h]\n# [http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsmisc/xfs_stats.pl?rev=1.7;content-type=text%2Fplain xfs_stats.pl] - script to parse and display xfs statistics\n# Developers on [irc://irc.freenode.org/xfs irc]"
                    }
                ]
            }
        }
    }
}