[rlc-10/6.12.0-211.26.1.el10_2] Multiple patches tested (7 commits)#1384
Open
ciq-kernel-automation[bot] wants to merge 7 commits into
Open
[rlc-10/6.12.0-211.26.1.el10_2] Multiple patches tested (7 commits)#1384ciq-kernel-automation[bot] wants to merge 7 commits into
ciq-kernel-automation[bot] wants to merge 7 commits into
Conversation
jira jira SECO-535 bugfix: writeback softlockups commit-author Jan Kara <jack@suse.cz> commit e1b849c upstream-diff | Due to the change in bdi_writeback it propagates a kabi breakage through every pointer version of this and backing_dev_info we have to use RH_KABI_EXTEND() on bdi_writeback to prevent the CRC miscalculation. There can be multiple inode switch works that are trying to switch inodes to / from the same wb. This can happen in particular if some cgroup exits which owns many (thousands) inodes and we need to switch them all. In this case several inode_switch_wbs_work_fn() instances will be just spinning on the same wb->list_lock while only one of them makes forward progress. This wastes CPU cycles and quickly leads to softlockup reports and unusable system. Instead of running several inode_switch_wbs_work_fn() instances in parallel switching to the same wb and contending on wb->list_lock, run just one work item per wb and manage a queue of isw items switching to this wb. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> (cherry picked from commit e1b849c) Signed-off-by: Jonathan Maple <jmaple@ciq.com>
jira SECO-535 bugfix: writeback softlockups commit-author Jan Kara <jack@suse.cz> commit 66c14dc process_inode_switch_wbs_work() can be switching over 100 inodes to a different cgroup. Since switching an inode requires counting all dirty & under-writeback pages in the address space of each inode, this can take a significant amount of time. Add a possibility to reschedule after processing each inode to avoid softlockups. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> (cherry picked from commit 66c14dc) Signed-off-by: Jonathan Maple <jmaple@ciq.com>
jira SECO-535 bugfix: writeback softlockups commit-author Jan Kara <jack@suse.cz> commit 9a6ebbd With lazytime mount option enabled we can be switching many dirty inodes on cgroup exit to the parent cgroup. The numbers observed in practice when systemd slice of a large cron job exits can easily reach hundreds of thousands or millions. The logic in inode_do_switch_wbs() which sorts the inode into appropriate place in b_dirty list of the target wb however has linear complexity in the number of dirty inodes thus overall time complexity of switching all the inodes is quadratic leading to workers being pegged for hours consuming 100% of the CPU and switching inodes to the parent wb. Simple reproducer of the issue: FILES=10000 # Filesystem mounted with lazytime mount option MNT=/mnt/ echo "Creating files and switching timestamps" for (( j = 0; j < 50; j ++ )); do mkdir $MNT/dir$j for (( i = 0; i < $FILES; i++ )); do echo "foo" >$MNT/dir$j/file$i done touch -a -t 202501010000 $MNT/dir$j/file* done wait echo "Syncing and flushing" sync echo 3 >/proc/sys/vm/drop_caches echo "Reading all files from a cgroup" mkdir /sys/fs/cgroup/unified/mycg1 || exit echo $$ >/sys/fs/cgroup/unified/mycg1/cgroup.procs || exit for (( j = 0; j < 50; j ++ )); do cat /mnt/dir$j/file* >/dev/null & done wait echo "Switching wbs" # Now rmdir the cgroup after the script exits We need to maintain b_dirty list ordering to keep writeback happy so instead of sorting inode into appropriate place just append it at the end of the list and clobber dirtied_time_when. This may result in inode writeback starting later after cgroup switch however cgroup switches are rare so it shouldn't matter much. Since the cgroup had write access to the inode, there are no practical concerns of the possible DoS issues. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> (cherry picked from commit 9a6ebbd) Signed-off-by: Jonathan Maple <jmaple@ciq.com>
jira SECO-535 bugfix: writeback softlockups commit-author Jan Kara <jack@suse.cz> commit 0cee64c Add trace_inode_switch_wbs_queue tracepoint to allow insight into how many inodes are queued to switch their bdi_writeback structure. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> (cherry picked from commit 0cee64c) Signed-off-by: Jonathan Maple <jmaple@ciq.com>
jira SECO-535 bugfix: writeback softlockups commit-author Jan Kara <jack@suse.cz> commit 6689f01 inode_switch_wbs_work_fn() has a loop like: wb_get(new_wb); while (1) { list = llist_del_all(&new_wb->switch_wbs_ctxs); /* Nothing to do? */ if (!list) break; ... process the items ... } Now adding of items to the list looks like: wb_queue_isw() if (llist_add(&isw->list, &wb->switch_wbs_ctxs)) queue_work(isw_wq, &wb->switch_work); Because inode_switch_wbs_work_fn() loops when processing isw items, it can happen that wb->switch_work is pending while wb->switch_wbs_ctxs is empty. This is a problem because in that case wb can get freed (no isw items -> no wb reference) while the work is still pending causing use-after-free issues. We cannot just fix this by cancelling work when freeing wb because that could still trigger problematic 0 -> 1 transitions on wb refcount due to wb_get() in inode_switch_wbs_work_fn(). It could be all handled with more careful code but that seems unnecessarily complex so let's avoid that until it is proven that the looping actually brings practical benefit. Just remove the loop from inode_switch_wbs_work_fn() instead. That way when wb_queue_isw() queues work, we are guaranteed we have added the first item to wb->switch_wbs_ctxs and nobody is going to remove it (and drop the wb reference it holds) until the queued work runs. Fixes: e1b849c ("writeback: Avoid contention on wb->list_lock when switching inodes") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20260413093618.17244-2-jack@suse.cz Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> (cherry picked from commit 6689f01) Signed-off-by: Jonathan Maple <jmaple@ciq.com>
…h_wbs() jira SECO-535 bugfix: writeback softlockups commit-author Baokun Li <libaokun@linux.alibaba.com> commit cba38ec When a container exits, the following BUG_ON() is occasionally triggered: ================================================================== VFS: Busy inodes after unmount of sdb (ext4) ------------[ cut here ]------------ kernel BUG at fs/super.c:695! CPU: 3 PID: 6 Comm: containerd-shim Tainted: G OE K 6.6 #1 pstate: 63400009 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) pc : generic_shutdown_super+0xf0/0x100 lr : generic_shutdown_super+0xf0/0x100 Call trace: generic_shutdown_super+0xf0/0x100 kill_block_super+0x20/0x48 ext4_kill_sb+0x28/0x60 deactivate_locked_super+0x54/0x130 deactivate_super+0x84/0xa0 cleanup_mnt+0xa4/0x140 __cleanup_mnt+0x18/0x28 task_work_run+0x78/0xe0 do_notify_resume+0x204/0x240 ================================================================== The root cause is a race between cgroup_writeback_umount() and inode_switch_wbs()/cleanup_offline_cgwb(). There is a window between inode_prepare_wbs_switch() returning true and the subsequent wb_queue_isw() call. Following is the process that triggers the issue: CPU A (umount) | CPU B (writeback) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ inode_switch_wbs/cleanup_offline_cgwb atomic_inc(&isw_nr_in_flight) inode_prepare_wbs_switch -> passes SB_ACTIVE check __iget(inode) generic_shutdown_super sb->s_flags &= ~SB_ACTIVE cgroup_writeback_umount(sb) smp_mb() atomic_read(&isw_nr_in_flight) rcu_barrier() -> no pending RCU callbacks flush_workqueue(isw_wq) -> nothing queued, returns evict_inodes(sb) -> Inode skipped as isw still holds a ref. sop->put_super(sb) /* destroys percpu counters */ -> VFS: Busy inodes after unmount! wb_queue_isw() queue_work(isw_wq, ...) /* later in work function */ inode_switch_wbs_work_fn process_inode_switch_wbs iput() -> evict percpu_counter_dec() // UAF! Fix this by extending the RCU read-side critical section in inode_switch_wbs() and cleanup_offline_cgwb() to cover from inode_prepare_wbs_switch() through wb_queue_isw(). Since there is no sleep in this window, rcu_read_lock() can be used. Then add a synchronize_rcu() in cgroup_writeback_umount() before the existing rcu_barrier(), so that all in-flight switchers that have passed the SB_ACTIVE check have completed queue_work() before flush_workqueue() is called. The existing rcu_barrier() is intentionally retained so this fix can be backported unchanged to stable kernels (5.10.y, 6.6.y, ...) that still queue switches via queue_rcu_work(). It is a no-op on current mainline (since commit e1b849c ("writeback: Avoid contention on wb->list_lock when switching inodes")) and is removed in a follow-up patch. Fixes: a1a0e23 ("writeback: flush inode cgroup wb switches instead of pinning super_block") Cc: stable@vger.kernel.org Suggested-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/all/mxnjq2l6guusfchvauxr3v7c4bwjasybxlleqbbh4efloeqspz@iqylk76ohufz Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com> Link: https://patch.msgid.link/20260521095016.2791354-2-libaokun@linux.alibaba.com Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> (cherry picked from commit cba38ec) Signed-off-by: Jonathan Maple <jmaple@ciq.com>
…unt() jira SECO-535 bugfix: writeback softlockups commit-author Baokun Li <libaokun@linux.alibaba.com> commit e90a6d6 Commit e1b849c ("writeback: Avoid contention on wb->list_lock when switching inodes") replaced the queue_rcu_work() based scheduling of inode wb switches with a plain queue_work(). Since then no switcher goes through call_rcu(), so rcu_barrier() in cgroup_writeback_umount() has no callbacks of its own to wait for. It still drains unrelated call_rcu() callbacks from other subsystems on busy systems, which incidentally slows umount down; drop it. Fixes: e1b849c ("writeback: Avoid contention on wb->list_lock when switching inodes") Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun@linux.alibaba.com> Link: https://patch.msgid.link/20260521095016.2791354-3-libaokun@linux.alibaba.com Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> (cherry picked from commit e90a6d6) Signed-off-by: Jonathan Maple <jmaple@ciq.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR has been automatically created after successful completion of all CI stages.
Commit Message(s)
Test Results
✅ Build Stage
✅ Boot Verification
✅ Kernel Selftests
✅ LTP Results
🤖 This PR was automatically generated by GitHub Actions
Run ID: 28248930605