Skip to content

[rlc-10/6.12.0-211.26.1.el10_2] Multiple patches tested (7 commits)#1384

Open
ciq-kernel-automation[bot] wants to merge 7 commits into
rlc-10/6.12.0-211.26.1.el10_2from
{jmaple}_rlc-10/6.12.0-211.26.1.el10_2
Open

[rlc-10/6.12.0-211.26.1.el10_2] Multiple patches tested (7 commits)#1384
ciq-kernel-automation[bot] wants to merge 7 commits into
rlc-10/6.12.0-211.26.1.el10_2from
{jmaple}_rlc-10/6.12.0-211.26.1.el10_2

Conversation

@ciq-kernel-automation

Copy link
Copy Markdown

Summary

This PR has been automatically created after successful completion of all CI stages.

Commit Message(s)

writeback: Avoid contention on wb->list_lock when switching inodes

jira jira SECO-535
bugfix: writeback softlockups
commit-author Jan Kara <jack@suse.cz>
commit e1b849cfa6b61f1c866a908c9e8dd9b5aaab820b
upstream-diff | Due to the change in bdi_writeback it propagates a kabi
	breakage through every pointer version of this and
	backing_dev_info we have to use RH_KABI_EXTEND() on
	bdi_writeback to prevent the CRC miscalculation.
writeback: Avoid softlockup when switching many inodes

jira SECO-535
bugfix: writeback softlockups
commit-author Jan Kara <jack@suse.cz>
commit 66c14dccd810d42ec5c73bb8a9177489dfd62278
writeback: Avoid excessively long inode switching times

jira SECO-535
bugfix: writeback softlockups
commit-author Jan Kara <jack@suse.cz>
commit 9a6ebbdbd41235ea3bc0c4f39e2076599b8113cc
writeback: Add tracepoint to track pending inode switches

jira SECO-535
bugfix: writeback softlockups
commit-author Jan Kara <jack@suse.cz>
commit 0cee64c547e3c9cda646af3e075a64f445ee8148
writeback: Fix use after free in inode_switch_wbs_work_fn()

jira SECO-535
bugfix: writeback softlockups
commit-author Jan Kara <jack@suse.cz>
commit 6689f01d6740cf358932b3e97ee968c6099800d9
writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()

jira SECO-535
bugfix: writeback softlockups
commit-author Baokun Li <libaokun@linux.alibaba.com>
commit cba38ec4cbd3a7b8b942a8d52531a05be8a9ff0d
writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount()

jira SECO-535
bugfix: writeback softlockups
commit-author Baokun Li <libaokun@linux.alibaba.com>
commit e90a6d668e26e00a72df2d09c173b563468f09c9

Test Results

✅ Build Stage

Architecture Build Time Total Time
x86_64 41m 8s 41m 57s
aarch64 24m 38s 25m 16s

✅ Boot Verification

✅ Kernel Selftests

Architecture Passed Failed Compared Against Status
x86_64 428 64 rlc-10/6.12.0-211.26.1.el10_2 ⚠️ No baseline available
aarch64 375 60 rlc-10/6.12.0-211.26.1.el10_2 ⚠️ No baseline available

✅ LTP Results

Architecture Passed Failed Compared Against Status
x86_64 1481 79 rlc-10/6.12.0-211.26.1.el10_2 ⚠️ No baseline available
aarch64 1452 80 rlc-10/6.12.0-211.26.1.el10_2 ⚠️ No baseline available

🤖 This PR was automatically generated by GitHub Actions
Run ID: 28248930605

PlaidCat added 7 commits June 26, 2026 11:42
jira jira SECO-535
bugfix: writeback softlockups
commit-author Jan Kara <jack@suse.cz>
commit e1b849c
upstream-diff | Due to the change in bdi_writeback it propagates a kabi
	breakage through every pointer version of this and
	backing_dev_info we have to use RH_KABI_EXTEND() on
	bdi_writeback to prevent the CRC miscalculation.

There can be multiple inode switch works that are trying to switch
inodes to / from the same wb. This can happen in particular if some
cgroup exits which owns many (thousands) inodes and we need to switch
them all. In this case several inode_switch_wbs_work_fn() instances will
be just spinning on the same wb->list_lock while only one of them makes
forward progress. This wastes CPU cycles and quickly leads to softlockup
reports and unusable system.

Instead of running several inode_switch_wbs_work_fn() instances in
parallel switching to the same wb and contending on wb->list_lock, run
just one work item per wb and manage a queue of isw items switching to
this wb.

	Acked-by: Tejun Heo <tj@kernel.org>
	Signed-off-by: Jan Kara <jack@suse.cz>
(cherry picked from commit e1b849c)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
jira SECO-535
bugfix: writeback softlockups
commit-author Jan Kara <jack@suse.cz>
commit 66c14dc

process_inode_switch_wbs_work() can be switching over 100 inodes to a
different cgroup. Since switching an inode requires counting all dirty &
under-writeback pages in the address space of each inode, this can take
a significant amount of time. Add a possibility to reschedule after
processing each inode to avoid softlockups.

	Acked-by: Tejun Heo <tj@kernel.org>
	Signed-off-by: Jan Kara <jack@suse.cz>
	Signed-off-by: Christian Brauner <brauner@kernel.org>
(cherry picked from commit 66c14dc)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
jira SECO-535
bugfix: writeback softlockups
commit-author Jan Kara <jack@suse.cz>
commit 9a6ebbd

With lazytime mount option enabled we can be switching many dirty inodes
on cgroup exit to the parent cgroup. The numbers observed in practice
when systemd slice of a large cron job exits can easily reach hundreds
of thousands or millions. The logic in inode_do_switch_wbs() which sorts
the inode into appropriate place in b_dirty list of the target wb
however has linear complexity in the number of dirty inodes thus overall
time complexity of switching all the inodes is quadratic leading to
workers being pegged for hours consuming 100% of the CPU and switching
inodes to the parent wb.

Simple reproducer of the issue:
  FILES=10000
  # Filesystem mounted with lazytime mount option
  MNT=/mnt/
  echo "Creating files and switching timestamps"
  for (( j = 0; j < 50; j ++ )); do
      mkdir $MNT/dir$j
      for (( i = 0; i < $FILES; i++ )); do
          echo "foo" >$MNT/dir$j/file$i
      done
      touch -a -t 202501010000 $MNT/dir$j/file*
  done
  wait
  echo "Syncing and flushing"
  sync
  echo 3 >/proc/sys/vm/drop_caches

  echo "Reading all files from a cgroup"
  mkdir /sys/fs/cgroup/unified/mycg1 || exit
  echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit
  for (( j = 0; j < 50; j ++ )); do
      cat /mnt/dir$j/file* >/dev/null &
  done
  wait
  echo "Switching wbs"
  # Now rmdir the cgroup after the script exits

We need to maintain b_dirty list ordering to keep writeback happy so
instead of sorting inode into appropriate place just append it at the
end of the list and clobber dirtied_time_when. This may result in inode
writeback starting later after cgroup switch however cgroup switches are
rare so it shouldn't matter much. Since the cgroup had write access to
the inode, there are no practical concerns of the possible DoS issues.

	Acked-by: Tejun Heo <tj@kernel.org>
	Signed-off-by: Jan Kara <jack@suse.cz>
	Signed-off-by: Christian Brauner <brauner@kernel.org>
(cherry picked from commit 9a6ebbd)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
jira SECO-535
bugfix: writeback softlockups
commit-author Jan Kara <jack@suse.cz>
commit 0cee64c

Add trace_inode_switch_wbs_queue tracepoint to allow insight into how
many inodes are queued to switch their bdi_writeback structure.

	Acked-by: Tejun Heo <tj@kernel.org>
	Signed-off-by: Jan Kara <jack@suse.cz>
	Signed-off-by: Christian Brauner <brauner@kernel.org>
(cherry picked from commit 0cee64c)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
jira SECO-535
bugfix: writeback softlockups
commit-author Jan Kara <jack@suse.cz>
commit 6689f01

inode_switch_wbs_work_fn() has a loop like:

  wb_get(new_wb);
  while (1) {
    list = llist_del_all(&new_wb->switch_wbs_ctxs);
    /* Nothing to do? */
    if (!list)
      break;
    ... process the items ...
  }

Now adding of items to the list looks like:

wb_queue_isw()
  if (llist_add(&isw->list, &wb->switch_wbs_ctxs))
    queue_work(isw_wq, &wb->switch_work);

Because inode_switch_wbs_work_fn() loops when processing isw items, it
can happen that wb->switch_work is pending while wb->switch_wbs_ctxs is
empty. This is a problem because in that case wb can get freed (no isw
items -> no wb reference) while the work is still pending causing
use-after-free issues.

We cannot just fix this by cancelling work when freeing wb because that
could still trigger problematic 0 -> 1 transitions on wb refcount due to
wb_get() in inode_switch_wbs_work_fn(). It could be all handled with
more careful code but that seems unnecessarily complex so let's avoid
that until it is proven that the looping actually brings practical
benefit. Just remove the loop from inode_switch_wbs_work_fn() instead.
That way when wb_queue_isw() queues work, we are guaranteed we have
added the first item to wb->switch_wbs_ctxs and nobody is going to
remove it (and drop the wb reference it holds) until the queued work
runs.

Fixes: e1b849c ("writeback: Avoid contention on wb->list_lock when switching inodes")
CC: stable@vger.kernel.org
	Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260413093618.17244-2-jack@suse.cz
	Acked-by: Tejun Heo <tj@kernel.org>
	Signed-off-by: Christian Brauner <brauner@kernel.org>
(cherry picked from commit 6689f01)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
…h_wbs()

jira SECO-535
bugfix: writeback softlockups
commit-author Baokun Li <libaokun@linux.alibaba.com>
commit cba38ec

When a container exits, the following BUG_ON() is occasionally triggered:

==================================================================
 VFS: Busy inodes after unmount of sdb (ext4)
 ------------[ cut here ]------------
 kernel BUG at fs/super.c:695!
 CPU: 3 PID: 6 Comm: containerd-shim Tainted: G OE K 6.6 #1
 pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
 pc : generic_shutdown_super+0xf0/0x100
 lr : generic_shutdown_super+0xf0/0x100
 Call trace:
  generic_shutdown_super+0xf0/0x100
  kill_block_super+0x20/0x48
  ext4_kill_sb+0x28/0x60
  deactivate_locked_super+0x54/0x130
  deactivate_super+0x84/0xa0
  cleanup_mnt+0xa4/0x140
  __cleanup_mnt+0x18/0x28
  task_work_run+0x78/0xe0
  do_notify_resume+0x204/0x240
==================================================================

The root cause is a race between cgroup_writeback_umount() and
inode_switch_wbs()/cleanup_offline_cgwb(). There is a window between
inode_prepare_wbs_switch() returning true and the subsequent
wb_queue_isw() call. Following is the process that triggers the issue:

      CPU A (umount)           |          CPU B (writeback)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                 inode_switch_wbs/cleanup_offline_cgwb
                                  atomic_inc(&isw_nr_in_flight)
                                  inode_prepare_wbs_switch
                                   -> passes SB_ACTIVE check
                                   __iget(inode)
 generic_shutdown_super
  sb->s_flags &= ~SB_ACTIVE
  cgroup_writeback_umount(sb)
   smp_mb()
   atomic_read(&isw_nr_in_flight)
   rcu_barrier()
    -> no pending RCU callbacks
   flush_workqueue(isw_wq)
    -> nothing queued, returns
  evict_inodes(sb)
   -> Inode skipped as isw still holds a ref.
  sop->put_super(sb)
   /* destroys percpu counters */
  -> VFS: Busy inodes after unmount!
                                  wb_queue_isw()
                                   queue_work(isw_wq, ...)
                                  /* later in work function */
                                  inode_switch_wbs_work_fn
                                   process_inode_switch_wbs
                                    iput() -> evict
                                     percpu_counter_dec() // UAF!

Fix this by extending the RCU read-side critical section in
inode_switch_wbs() and cleanup_offline_cgwb() to cover from
inode_prepare_wbs_switch() through wb_queue_isw().  Since there is
no sleep in this window, rcu_read_lock() can be used.  Then add a
synchronize_rcu() in cgroup_writeback_umount() before the existing
rcu_barrier(), so that all in-flight switchers that have passed the
SB_ACTIVE check have completed queue_work() before flush_workqueue()
is called.

The existing rcu_barrier() is intentionally retained so this fix can
be backported unchanged to stable kernels (5.10.y, 6.6.y, ...) that
still queue switches via queue_rcu_work(). It is a no-op on current
mainline (since commit e1b849c ("writeback: Avoid contention on
wb->list_lock when switching inodes")) and is removed in a follow-up
patch.

Fixes: a1a0e23 ("writeback: flush inode cgroup wb switches instead of pinning super_block")
	Cc: stable@vger.kernel.org
	Suggested-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/all/mxnjq2l6guusfchvauxr3v7c4bwjasybxlleqbbh4efloeqspz@iqylk76ohufz
	Reviewed-by: Jan Kara <jack@suse.cz>
	Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
Link: https://patch.msgid.link/20260521095016.2791354-2-libaokun@linux.alibaba.com
	Acked-by: Tejun Heo <tj@kernel.org>
	Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
(cherry picked from commit cba38ec)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
…unt()

jira SECO-535
bugfix: writeback softlockups
commit-author Baokun Li <libaokun@linux.alibaba.com>
commit e90a6d6

Commit e1b849c ("writeback: Avoid contention on wb->list_lock when
switching inodes") replaced the queue_rcu_work() based scheduling of
inode wb switches with a plain queue_work().  Since then no switcher
goes through call_rcu(), so rcu_barrier() in cgroup_writeback_umount()
has no callbacks of its own to wait for.  It still drains unrelated
call_rcu() callbacks from other subsystems on busy systems, which
incidentally slows umount down; drop it.

Fixes: e1b849c ("writeback: Avoid contention on wb->list_lock when switching inodes")
	Reviewed-by: Jan Kara <jack@suse.cz>
	Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
Link: https://patch.msgid.link/20260521095016.2791354-3-libaokun@linux.alibaba.com
	Acked-by: Tejun Heo <tj@kernel.org>
	Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
(cherry picked from commit e90a6d6)
	Signed-off-by: Jonathan Maple <jmaple@ciq.com>
@ciq-kernel-automation ciq-kernel-automation Bot added the created-by-kernelci Tag PRs that were automatically created when a user branch was pushed to the repo (kernelCI) label Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

created-by-kernelci Tag PRs that were automatically created when a user branch was pushed to the repo (kernelCI)

Development

Successfully merging this pull request may close these issues.

1 participant