<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://xfs.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Ka9q</id>
	<title>xfs.org - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://xfs.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Ka9q"/>
	<link rel="alternate" type="text/html" href="https://xfs.org/index.php/Special:Contributions/Ka9q"/>
	<updated>2026-04-20T13:34:05Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.42.3</generator>
	<entry>
		<id>https://xfs.org/index.php?title=Improving_Metadata_Performance_By_Reducing_Journal_Overhead&amp;diff=2120</id>
		<title>Improving Metadata Performance By Reducing Journal Overhead</title>
		<link rel="alternate" type="text/html" href="https://xfs.org/index.php?title=Improving_Metadata_Performance_By_Reducing_Journal_Overhead&amp;diff=2120"/>
		<updated>2010-11-07T20:38:39Z</updated>

		<summary type="html">&lt;p&gt;Ka9q: /* Improving Metadata Performance By Reducing Journal Overhead */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Improving Metadata Performance By Reducing Journal Overhead ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
XFS currently uses asynchronous write-ahead logging to ensure that changes to&lt;br /&gt;
the filesystem structure are preserved on crash.  It does this by logging&lt;br /&gt;
detailed records of the changes being made to each object on disk during a&lt;br /&gt;
transaction. Every byte that is modified needs to be recorded in the journal.&lt;br /&gt;
&lt;br /&gt;
There are two issues with this approach. The first is that transactions can&lt;br /&gt;
modify a *lot* of metadata  to complete a single operation. Worse is the fact&lt;br /&gt;
that the average size of a transactions grows as structures get larger and&lt;br /&gt;
deeper, so performance on larger, fuller filesystem drops off as log bandwidth&lt;br /&gt;
is consumed by fewer, larger transactions.&lt;br /&gt;
&lt;br /&gt;
The second is that we re-log previous changes that are active in the journal&lt;br /&gt;
if the object is modified again. hence if an object is modified repeatedly, the&lt;br /&gt;
dirty parts of the object get rewritten over and over again. in the worst case,&lt;br /&gt;
frequently logged buffers will be entirely dirty and so even if we only change&lt;br /&gt;
a single byte in the buffer we&#039;ll log the entire buffer.&lt;br /&gt;
&lt;br /&gt;
An example of how needless this can be is the operation of a removing all the&lt;br /&gt;
files in a directory result in the directory blocks being logged over and over&lt;br /&gt;
again before finally being freed and made stale in the log. If we are freeing&lt;br /&gt;
the entire contents of the directory, the only transactions we really need in&lt;br /&gt;
the journal w.r.t to directory buffers is the &#039;remove, stale and free&#039;&lt;br /&gt;
transaction; all other changes are irrelevant because we don&#039;t care about&lt;br /&gt;
changes to free space. Depending on the directory block size, we might log each&lt;br /&gt;
directory buffer tens to hundreds of times before making it stale...&lt;br /&gt;
&lt;br /&gt;
Clearly we have two different axis to approach this problem along:&lt;br /&gt;
&lt;br /&gt;
	- reduce the amount we log in a given transaction&lt;br /&gt;
	- reduce the number of times we re-log objects.&lt;br /&gt;
&lt;br /&gt;
Both of these things give the same end result - we require less bandwidth to&lt;br /&gt;
the journal to log changes that are happening in the filesystem. Let&#039;s start&lt;br /&gt;
by looking at how to reduce re-logging of objects.&lt;br /&gt;
&lt;br /&gt;
== Asynchronous Transaction Aggregation ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The first observation that we need to make is that we are already doing&lt;br /&gt;
asynchronous journalling for everything other than explicitly synchronous&lt;br /&gt;
transactions. This means we are aggregating completed transactions in memory&lt;br /&gt;
before writing them to disk. This reduces the number of disk I/Os needed to&lt;br /&gt;
write the log, but it does nothing to help prevent relogging of items.&lt;br /&gt;
&lt;br /&gt;
The second observation is that we store asynchronous committed transactions&lt;br /&gt;
in two forms while they are being written to disk:&lt;br /&gt;
&lt;br /&gt;
	- the physical form in the log buffer that will be written&lt;br /&gt;
	- the logical form attached to the log buffer so that on I/O completion&lt;br /&gt;
	  of the log buffer the items in the transaction can be unpinned and&lt;br /&gt;
	  moved to or updated in the AIL for later writeback.&lt;br /&gt;
&lt;br /&gt;
The fact that we store the logical form of the transaction until after the&lt;br /&gt;
log buffer is written to the journal is important - it means the transaction&lt;br /&gt;
and all it&#039;s dirty items live longer than process that creates and commits&lt;br /&gt;
the transaction. This allows us to redefine what &#039;transaction commit&#039; actually&lt;br /&gt;
means.&lt;br /&gt;
&lt;br /&gt;
A transaction commit currently takes the following steps:&lt;br /&gt;
&lt;br /&gt;
	- apply superblock and dquot changes to in-core structures&lt;br /&gt;
	- build an vector array to all the dirty regions in all the items in&lt;br /&gt;
	  the transaction.&lt;br /&gt;
	- write the vector array into the log buffer (may trigger log I/O)&lt;br /&gt;
	- release ununsed transaction reservations to in-core structures&lt;br /&gt;
	- attach transaction to log buffer callbacks&lt;br /&gt;
	- write a commit record into the log buffer for the transaction&lt;br /&gt;
	- unlock all the items locked in the transaction&lt;br /&gt;
	- release the log buffer (may trigger log I/O)&lt;br /&gt;
	- if synchronous transaction, issue a synchronous log force to&lt;br /&gt;
	  get the transaction on disk.&lt;br /&gt;
&lt;br /&gt;
Based on the observation that the transaction structure exists until it is&lt;br /&gt;
freed during log buffer I/o completion, we really don&#039;t have to format the&lt;br /&gt;
transaction into a log buffer during the transaction commit - we could&lt;br /&gt;
simply queue it into a list for later processing. Synchronous&lt;br /&gt;
transactions don&#039;t change this - they just queue the transaction then&lt;br /&gt;
do a log force to cause the transaction queue to be flushed to disk.&lt;br /&gt;
&lt;br /&gt;
Now that we have an asynchronous transaction queue in logical format, we can&lt;br /&gt;
take our time deciding when and how best to write it to disk. If we have&lt;br /&gt;
the situation where we are relogging items, we will have a given item&lt;br /&gt;
in multiple transactions. If we write out each transaction as an individual&lt;br /&gt;
commit like we currently do, we&#039;d then have the problem of future changes&lt;br /&gt;
being written in the first transaction that we write. This will cause&lt;br /&gt;
problems for recovery.&lt;br /&gt;
&lt;br /&gt;
Hence what we really want to do is aggregate all those transactions into a&lt;br /&gt;
single large journal commit. This makes the journalling model more of a&lt;br /&gt;
&#039;checkpoint of changes&#039; than a &#039;transactional change&#039; model. By commiting&lt;br /&gt;
a set of transactions rather than just a single transaction per commit&lt;br /&gt;
record, we can avoid needed to commit items several times to the log&lt;br /&gt;
if they are modified in multiple transactions. During recovery, we only&lt;br /&gt;
recover the entire commit so we only need a single version of each item&lt;br /&gt;
that encompasses all the changes in the commit period.&lt;br /&gt;
&lt;br /&gt;
As an aside, if we have large numbers of items per commit record now,&lt;br /&gt;
it makes sense to start optimising the recovery read-ahead by sorting&lt;br /&gt;
all the items in the commit record before issuing readahead on them.&lt;br /&gt;
This will reduce the seeking the readahead triggers somewhat, so should&lt;br /&gt;
lead to faster recovery times.&lt;br /&gt;
&lt;br /&gt;
The down sides to this approach are:&lt;br /&gt;
&lt;br /&gt;
	- holds items pinned in memory for longer, thereby increases&lt;br /&gt;
	  the chances of triggering a log force to unpin them.&lt;br /&gt;
	- partial log forces (i.e. those to a specific LSN) are no longer&lt;br /&gt;
	  really possible as we do not have multiple independent queues&lt;br /&gt;
	  (iclogbufs) holding the transactions.&lt;br /&gt;
	- log forces become more expensive by having to drain the entire&lt;br /&gt;
	  async transaction queue.&lt;br /&gt;
	- synchronous transactions become more expensive by having to&lt;br /&gt;
	  drain the entire async transaction queue.&lt;br /&gt;
	- possible &#039;interesting&#039; interactions with tail-pushing if we&lt;br /&gt;
	  allow too many async transactions to be queued without flushing&lt;br /&gt;
	  them.&lt;br /&gt;
&lt;br /&gt;
The main concern with this approach is ensuring that we don&#039;t adversely affect&lt;br /&gt;
fsync() performance. For example, ext3 uses a checkpoint based journalling&lt;br /&gt;
system that has a very long checkpoint period (5 seconds).  As a result, a&lt;br /&gt;
synchronous operation such as an fsync() can be forced to flush up to 5 seconds&lt;br /&gt;
worth of transactions to disk. In ordered mode, this also involves flushing&lt;br /&gt;
data, so the fsync() latency can be measured in tens of seconds on a busy&lt;br /&gt;
filesystem. This is known as the &#039;sync the world&#039; problem, and currently XFS&lt;br /&gt;
does not suffer from this at all. &lt;br /&gt;
&lt;br /&gt;
[Data point: Recent testing of this phenomenon by Chris Mason showed XFS took&lt;br /&gt;
less than one second to fsync a 4k write in the presence of a background&lt;br /&gt;
streaming write; BTRFS took two seconds and ext3 took five seconds. ]&lt;br /&gt;
&lt;br /&gt;
To avoid this type of latency, we should not be allowing too many transactions&lt;br /&gt;
to accumulate in the async transaction queue. If we look at optimising&lt;br /&gt;
workloads such as sequential creates or deletes in a single directory then, in&lt;br /&gt;
theory, accumulating just 10 async transactions into a single commit record&lt;br /&gt;
should provide close to an order of magnitude reduction in bandwidth to the log&lt;br /&gt;
under these workloads. We also reduce the number of log writes by aggregating&lt;br /&gt;
like this and that will give us even larger gains by avoiding seeks to write&lt;br /&gt;
log buffers out.&lt;br /&gt;
&lt;br /&gt;
Hence I don&#039;t think the async transaction queue needs to be all that deep&lt;br /&gt;
to realise substantial gains, and hence the impact on synchronous transaction&lt;br /&gt;
latency can be kept quite low as a result.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Atomic Multi-Transaction Operations ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
A feature asynchronous transaction aggregation makes possible is atomic&lt;br /&gt;
multi-transaction operations.  On the first transaction we hold the queue in&lt;br /&gt;
memory, preventing it from being committed. We can then do further transactions&lt;br /&gt;
that will end up in the same commit record, and on the final transaction we&lt;br /&gt;
unlock the async transaction queue. This will allow all those transaction to be&lt;br /&gt;
applied atomically. This is far simpler than any other method I&#039;ve been looking&lt;br /&gt;
at to do this.&lt;br /&gt;
&lt;br /&gt;
After a bit of reflection, I think this feature may be necessary for correct&lt;br /&gt;
implementation of existing logging techniques. The way we currently implement&lt;br /&gt;
rolling transactions (with permanent log reservations and rolling&lt;br /&gt;
dup/commit/re-reserve sequences) would seem to require all the commits in a&lt;br /&gt;
rolling transaction to be including in a single commit record.  If I understand&lt;br /&gt;
history and the original design correctly, these rolling transactions were&lt;br /&gt;
implemented so that large, complex transactions would not pin the tail of the&lt;br /&gt;
log as they progressed.  IOWs, they implicitly use re-logging to keep the tail&lt;br /&gt;
of the log moving forward as they progress and continue to modify items in the&lt;br /&gt;
transaction.&lt;br /&gt;
&lt;br /&gt;
Given we are using asynchronous transaction aggregation as a method of reducing&lt;br /&gt;
re-logging, it would make sense to prevent these sorts of transactions from&lt;br /&gt;
pinning the tail of the log at all. Further, because we are effectively&lt;br /&gt;
disturbing the concept of unique transactions, I don&#039;t think that allowing a&lt;br /&gt;
rolling transaction to span aggregated commits is valid as we are going to be&lt;br /&gt;
ignoring the transaction IDs that are used to identify individual transactions.&lt;br /&gt;
&lt;br /&gt;
Hence I think it is a good idea to simply replace rolling transactions with&lt;br /&gt;
atomic multi-transaction operations. This may also allow us to split some of&lt;br /&gt;
the large compound transactions into smaller, more self contained transactions.&lt;br /&gt;
This would reduce reservation pressure on log space in the common case where&lt;br /&gt;
all the corner cases in the transactions are not taken. In terms of&lt;br /&gt;
implementation, I think we can initially augment the permanent transaction&lt;br /&gt;
reservation/release interface to acheive this. With a working implementation,&lt;br /&gt;
we can then look to changing to a more explicit interface and slowly work to&lt;br /&gt;
remove the &#039;permanent log transaction&#039; concept entirely. This shold simplify&lt;br /&gt;
the log code somewhat....&lt;br /&gt;
&lt;br /&gt;
Note: This asynchronous transaction aggregation is originally based on a&lt;br /&gt;
concept floated by Nathan Scott called &#039;Delayed Logging&#039; after observing how&lt;br /&gt;
ext3 implemented journalling.  This never passed more than a concept&lt;br /&gt;
description phase....&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Operation Based Logging ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The second approach to reducing log traffic is to change exactly what we&lt;br /&gt;
log in the transactions. At the moment, what we log is the exact change to&lt;br /&gt;
the item that is being made. For things like inodes and dquots, this isn&#039;t&lt;br /&gt;
particularly expensive because it is already a very compact form. The issue&lt;br /&gt;
comes with changes that are logged in buffers.&lt;br /&gt;
&lt;br /&gt;
The prime example of this is a btree modification that involves either removing&lt;br /&gt;
or inserting a record into a buffer. The records are kept in compact form, so an&lt;br /&gt;
insert or remove will also move other records around in the buffer. In the worst&lt;br /&gt;
case, a single insert or remove of a 16 byte record can dirty an entire block&lt;br /&gt;
(4k generally, but could be up to 64k). In this case, if we were to log the&lt;br /&gt;
btree operation (e.g. insert {record, index}) rather than the resultant change&lt;br /&gt;
on the buffer the overhead of a btree operation is fixed. Such logging also&lt;br /&gt;
allows us to avoid needing to log the changes due to splits and merges - we just&lt;br /&gt;
replay the operation and subsequent splits/merges get done as part of replay.&lt;br /&gt;
&lt;br /&gt;
The result of this is that complex transactions no longer need as much log space&lt;br /&gt;
as all possible change they can cause - we only log the basic operations that&lt;br /&gt;
are occurring and their result. Hence transaction end up being much smaller,&lt;br /&gt;
vary less in size between empty and full filesystems, etc. An example set of&lt;br /&gt;
operations describing all the changes made by an extent allocation on an inode&lt;br /&gt;
would be:&lt;br /&gt;
&lt;br /&gt;
	- inode X intent to allocate extent {off, len}&lt;br /&gt;
	- AGCNT btree update record in AG X {old rec} {new rec values}&lt;br /&gt;
	- AGBNO btree delete record in AG X {block, len}&lt;br /&gt;
	- inode X BMBT btree insert record {off, block, len}&lt;br /&gt;
	- inode X delta&lt;br /&gt;
&lt;br /&gt;
This comes down to a relatively small, bound amount of space which is close the&lt;br /&gt;
minimun and existing allocation transaction would consume.  However, with this&lt;br /&gt;
method of logging the transaction size does not increase with the size of&lt;br /&gt;
structures or the amount of updates necessary to complete the operations.&lt;br /&gt;
&lt;br /&gt;
A major difference to the existing transaction system is that re-logging&lt;br /&gt;
of items doesn&#039;t fit very neatly with operation based logging. &lt;br /&gt;
&lt;br /&gt;
There are three main disadvantages to this approach:&lt;br /&gt;
&lt;br /&gt;
	- recovery becomes more complex - it will need to change substantially&lt;br /&gt;
	  to accomodate operation replay rather than just reading from disk&lt;br /&gt;
	  and applying deltas.&lt;br /&gt;
	- we have to create a whole new set of item types and add the necessary&lt;br /&gt;
	  hooks into the code to log all the operations correctly.&lt;br /&gt;
	- re-logging is probably not possible, and that introduces &lt;br /&gt;
	  differences to the way we&#039;ll need to track objects for flushing. It&lt;br /&gt;
	  may, in fact, require transaction IDs in all objects to allow us&lt;br /&gt;
	  to determine what the last transaction that modified the item&lt;br /&gt;
	  on disk was during recovery.&lt;br /&gt;
&lt;br /&gt;
Changing the logging strategy as described is a much more fundamental change to&lt;br /&gt;
XFS than asynchronous transaction aggregation. It will be difficult to change&lt;br /&gt;
to such a model in an evolutionary manner; it is more of a &#039;flag day&#039; style&lt;br /&gt;
change where then entire functionality needs to be added in one hit. Given that&lt;br /&gt;
we will also still have to support the old log format, it doesn&#039;t enable us to&lt;br /&gt;
remove any code, either.&lt;br /&gt;
&lt;br /&gt;
Given that we are likely to see major benefits in the problem workloads as a&lt;br /&gt;
result of asynchronous transaction aggregation, it may not be necessary to&lt;br /&gt;
completely rework the transaction subsystem. Combining aggregation with an&lt;br /&gt;
ongoing process of targeted reduction of transaction size will provide benefits&lt;br /&gt;
out to at least the medium term. It is unclear whether this direction will be&lt;br /&gt;
sufficient in the long run until we can measure the benefit that aggregation&lt;br /&gt;
will provide.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Reducing Transaction Overhead ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
To switch tracks completely, I have not addressed general issues with overhead&lt;br /&gt;
in the transaction subsystem itself. There are several points where the&lt;br /&gt;
transaction subsystem will single thread because of filesystem scope locks and&lt;br /&gt;
structures.  We have, for example, the log grant lock for protecting&lt;br /&gt;
reservation and used log space, the AIL lock for tracking dirty metadata, the&lt;br /&gt;
log state lock for state transition of log buffers and other associated&lt;br /&gt;
structure modifications.&lt;br /&gt;
&lt;br /&gt;
We have already started down the path of reducing contention in&lt;br /&gt;
various paths. For example:&lt;br /&gt;
&lt;br /&gt;
	- changing iclog reference counts to atomics to avoid needing the log&lt;br /&gt;
	  state lock on every transaction commit&lt;br /&gt;
	- protecting iclog callback lists with a per-iclog lock instead of the log&lt;br /&gt;
	  state lock&lt;br /&gt;
	- removing the AIL lock from the transaction reserve path by isolating&lt;br /&gt;
	  AIL tail pushing to a single thread instead of being done&lt;br /&gt;
	  synchronously.&lt;br /&gt;
&lt;br /&gt;
Asynchronous transaction aggregation is likely to perturb the current known&lt;br /&gt;
behaviour and bottlenecks as a result of moving all of the log interfacing out&lt;br /&gt;
of the direct transaction commit path.  Similar to moving the AIL pushing into&lt;br /&gt;
it&#039;s own thread, this will mean that there will typically only be a single&lt;br /&gt;
thread formatting and writing to iclog buffers. This will remove much of the&lt;br /&gt;
parallelism that puts excessive pressure on many of these locks.&lt;br /&gt;
&lt;br /&gt;
I am certain that asynchronous transaction aggregation will open up new areas&lt;br /&gt;
of optimisation in the log formatting and dispatch code - it will probably&lt;br /&gt;
enable us to remove a lot of the complexity because we will be able to directly&lt;br /&gt;
control the parallelism in the formatting and dispatch of log buffers. This&lt;br /&gt;
implies that we may not need to be limited to a fixed pool of fixed sized log&lt;br /&gt;
buffers for writing transactions to disk.&lt;br /&gt;
&lt;br /&gt;
However, it is probably best to leave consideration of such optimisations until&lt;br /&gt;
after the asynchronous transaction aggregation is implemented and we can&lt;br /&gt;
directly observe the pain points that become apparent as a result of such a&lt;br /&gt;
change.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Reducing Recovery Time ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
With 2GB logs, recovery can take an awfully long time due to the need&lt;br /&gt;
to read each object synchronously as we process the jounral. An obvious&lt;br /&gt;
way to avoid this is to add another pass to the processing to do asynchronous&lt;br /&gt;
readahead of all the objects in the log before doing the processing passes.&lt;br /&gt;
This will populate the cache as quickly as possible and hide any read latency&lt;br /&gt;
that could occur as we process commit records.&lt;br /&gt;
&lt;br /&gt;
A logical extension to this is to sort the objects in ascending offset order&lt;br /&gt;
before issuing I/O on them. That will further optimise the readahead I/O&lt;br /&gt;
to reduce seeking and hence should speed up the read phase of recovery&lt;br /&gt;
further.&lt;br /&gt;
&lt;br /&gt;
== ToDo ==&lt;br /&gt;
Further investigation of recovery for future optimisation.&lt;/div&gt;</summary>
		<author><name>Ka9q</name></author>
	</entry>
</feed>