Skip to content

add pseudo header split#2

Open
nwochtma wants to merge 1 commit into
walking-machine:ixgbevf-libethfrom
nwochtma:larysa-check-2025-09-12
Open

add pseudo header split#2
nwochtma wants to merge 1 commit into
walking-machine:ixgbevf-libethfrom
nwochtma:larysa-check-2025-09-12

Conversation

@nwochtma

Copy link
Copy Markdown

No description provided.

@walking-machine walking-machine left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of nits in this review, but one thing that needs to be fixed is I think there would be memory leaks for larger packets, because you always alloc the same number of header buffers as data buffers, but if we have 2 data buffers per packet, we only need 8 new header buffers per 16 new data buffers, the remaining 8 would be leaked. There several solutions to such a problem:

  • If total packet size (READ_ONCE(adapter->netdev->mtu) + LIBETH_RX_LL_LEN) is 3K or greater, do not use pseudo header split, I prefer this option
  • do separate next_to_use and next_to_clean accounting for header buffer

Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf.h Outdated
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf.h Outdated
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
@alobakin

Copy link
Copy Markdown

A lot of nits in this review, but one thing that needs to be fixed is I think there would be memory leaks for larger packets, because you always alloc the same number of header buffers as data buffers, but if we have 2 data buffers per packet, we only need 8 new header buffers per 16 new data buffers, the remaining 8 would be leaked. There several solutions to such a problem:

The solution is to recycle the buffer to the page pool just like in idpf and ice (libeth_xdp_process_buff() does that by default when the size is zero).
I think it should be like that: W/A only under xdp->data, but process_buff() always when there's hdr_pp.

* If total packet size (`READ_ONCE(adapter->netdev->mtu) + LIBETH_RX_LL_LEN`) is 3K or greater, do not use pseudo header split, I prefer this option

You mean during the configuration or on a per-packet basis? The latter is not possible: buffers from the data FQ have a different layout when created with the hsplit flag, they lack headroom and tailroom.

The initial idea of this W/A is that: if we can't fine-tune the HW-writable buffer len using a 128-byte stride, then we always create header pools, so that data pools always have pow-2 buffers (almost the same as construct_skb() / "legacy-rx", but faster and more generic).

* do separate `next_to_use` and `next_to_clean` accounting for header buffer

Please don't :D

@nwochtma nwochtma force-pushed the larysa-check-2025-09-12 branch 5 times, most recently from 7e71a22 to 35f9f52 Compare September 29, 2025 14:23
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
#define IXGBEVF_RX_PAGE_LEN(hr) (ALIGN_DOWN(LIBETH_RX_PAGE_LEN(hr), \
IXGBE_SRRCTL_BSIZEPKT_STEP))

#define IXGBEVF_FLAG_PSEUDO_HSPLIT BIT(0)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just name it HSPLIT maybe. Both pseudo and normal hsplit won't differ too much.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we do not have header split, so it would be confusing

Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
@nwochtma nwochtma force-pushed the larysa-check-2025-09-12 branch from 35f9f52 to 68621c1 Compare October 2, 2025 12:38
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
@walking-machine

walking-machine commented Oct 2, 2025

Copy link
Copy Markdown
Owner

A lot of nits in this review, but one thing that needs to be fixed is I think there would be memory leaks for larger packets, because you always alloc the same number of header buffers as data buffers, but if we have 2 data buffers per packet, we only need 8 new header buffers per 16 new data buffers, the remaining 8 would be leaked. There several solutions to such a problem:

The solution is to recycle the buffer to the page pool just like in idpf and ice (libeth_xdp_process_buff() does that by default when the size is zero). I think it should be like that: W/A only under xdp->data, but process_buff() always when there's hdr_pp.

* If total packet size (`READ_ONCE(adapter->netdev->mtu) + LIBETH_RX_LL_LEN`) is 3K or greater, do not use pseudo header split, I prefer this option

You mean during the configuration or on a per-packet basis? The latter is not possible: buffers from the data FQ have a different layout when created with the hsplit flag, they lack headroom and tailroom.

When MTU + LL is greater than 3K, it behaves just like any other ixgbevf HW, so it would not be per-packet (see, I have used netdev->mtu), as every packet buffer would have headroom + 3K + tailroom. Like the only advantage of the newer HW here is the packet size filter, which obviously does not help in case of bigger MTU.

The initial idea of this W/A is that: if we can't fine-tune the HW-writable buffer len using a 128-byte stride, then we always create header pools, so that data pools always have pow-2 buffers (almost the same as construct_skb() / "legacy-rx", but faster and more generic).

* do separate `next_to_use` and `next_to_clean` accounting for header buffer

Please don't :D

@nwochtma nwochtma force-pushed the larysa-check-2025-09-12 branch from 68621c1 to fa3678e Compare October 3, 2025 12:35
Comment thread drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c Outdated
@nwochtma nwochtma force-pushed the larysa-check-2025-09-12 branch from fa3678e to 395ae75 Compare October 9, 2025 12:33

@walking-machine walking-machine left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks fine to me

struct page_pool *hdr_pp;

struct xdp_rxq_info xdp_rxq;

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newlines not related to the patch

union ixgbe_adv_rx_desc *rx_desc;
struct libeth_fqe *rx_buffer;
struct libeth_fqe *hdr_buff;
unsigned int hdr_size = 0;

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move it under if() now

hdr_size = ixgbevf_rx_hsplit_wa(hdr_buff,
rx_buffer,
size);
size -= hdr_size ? : size;

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we cannot copy from the buffer, we zero its size, so it is dropped?

This patch introduces pseudo header split support in the ixgbevf
driver, specifically targeting ixgbe_mac_82599_vf.

On older hardware (e.g. ixgbe_mac_82599_vf), RX DMA write size can only be
limited in 1K increments. This causes issues when attempting to fit
multiple packets per page, as a DMA write may overwrite the
headroom of the next packet.

To address this, introduce pseudo header split support, where the hardware
copies the full L2 header into a dedicated header buffer. This avoids the
need for HR/TR alignment and allows safe skb construction from the header
buffer without risking overwrites.

Signed-off-by: Natalia Wochtman <natalia.wochtman@intel.com>
@nwochtma nwochtma force-pushed the larysa-check-2025-09-12 branch from 395ae75 to 5341a01 Compare October 13, 2025 07:41
walking-machine pushed a commit that referenced this pull request Jun 16, 2026
With PROVE_LOCKING on an Snapdragon X1 and VM reclaim pressure, we see:

   ======================================================
   WARNING: possible circular locking dependency detected
   7.0.0-debug+ #43 Tainted: G        W
   ------------------------------------------------------
   kswapd0/82 is trying to acquire lock:
   ffff800080ec3870 (reservation_ww_class_acquire){+.+.}-{0:0}, at: msm_gem_shrinker_scan+0x17c/0x400 [msm]

   but task is already holding lock:
   ffffc31709b263b8 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x88/0x988

   which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

   -> #2 (fs_reclaim){+.+.}-{0:0}:
          __lock_acquire+0x4d0/0xad0
          lock_acquire.part.0+0xc4/0x248
          lock_acquire+0x8c/0x248
          fs_reclaim_acquire+0xd0/0xf0
          dma_resv_lockdep+0x224/0x348
          do_one_initcall+0x84/0x5d0
          do_initcalls+0x194/0x1d8
          kernel_init_freeable+0x128/0x180
          kernel_init+0x2c/0x160
          ret_from_fork+0x10/0x20

   -> #1 (reservation_ww_class_mutex){+.+.}-{4:4}:
          __lock_acquire+0x4d0/0xad0
          lock_acquire.part.0+0xc4/0x248
          lock_acquire+0x8c/0x248
          dma_resv_lockdep+0x1a8/0x348
          do_one_initcall+0x84/0x5d0
          do_initcalls+0x194/0x1d8
          kernel_init_freeable+0x128/0x180
          kernel_init+0x2c/0x160
          ret_from_fork+0x10/0x20

   -> #0 (reservation_ww_class_acquire){+.+.}-{0:0}:
          check_prev_add+0x114/0x790
          validate_chain+0x594/0x6f0
          __lock_acquire+0x4d0/0xad0
          lock_acquire.part.0+0xc4/0x248
          lock_acquire+0x8c/0x248
          drm_gem_lru_scan+0x1ac/0x440
          msm_gem_shrinker_scan+0x17c/0x400 [msm]
          do_shrink_slab+0x150/0x4a0
          shrink_slab+0x144/0x460
          shrink_one+0x9c/0x1b0
          shrink_many+0x27c/0x5c0
          shrink_node+0x344/0x550
          balance_pgdat+0x2c0/0x988
          kswapd+0x11c/0x318
          kthread+0x10c/0x128
          ret_from_fork+0x10/0x20

   other info that might help us debug this:
   Chain exists of:
     reservation_ww_class_acquire --> reservation_ww_class_mutex --> fs_reclaim
    Possible unsafe locking scenario:
          CPU0                    CPU1
          ----                    ----
     lock(fs_reclaim);
                                  lock(reservation_ww_class_mutex);
                                  lock(fs_reclaim);
     lock(reservation_ww_class_acquire);

    *** DEADLOCK ***
   1 lock held by kswapd0/82:
    #0: ffffc31709b263b8 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x88/0x988

   stack backtrace:
   CPU: 4 UID: 0 PID: 82 Comm: kswapd0 Tainted: G        W           7.0.0-debug+ #43 PREEMPT(full)
   Tainted: [W]=WARN
   Hardware name: LENOVO 21BX0016US/21BX0016US, BIOS N3HET94W (1.66 ) 09/15/2025
   Call trace:
    show_stack+0x20/0x40 (C)
    dump_stack_lvl+0x9c/0xd0
    dump_stack+0x18/0x30
    print_circular_bug+0x114/0x120
    check_noncircular+0x178/0x198
    check_prev_add+0x114/0x790
    validate_chain+0x594/0x6f0
    __lock_acquire+0x4d0/0xad0
    lock_acquire.part.0+0xc4/0x248
    lock_acquire+0x8c/0x248
    drm_gem_lru_scan+0x1ac/0x440
    msm_gem_shrinker_scan+0x17c/0x400 [msm]
    do_shrink_slab+0x150/0x4a0
    shrink_slab+0x144/0x460
    shrink_one+0x9c/0x1b0
    shrink_many+0x27c/0x5c0
    shrink_node+0x344/0x550
    balance_pgdat+0x2c0/0x988
    kswapd+0x11c/0x318
    kthread+0x10c/0x128
    ret_from_fork+0x10/0x20

kswapd0 holding fs_reclaim calls the MSM shrinker, which calls
dma_resv_lock. This in turn acquires fs_reclaim.

Fix this deadlock by using dma_resv_trylock() instead, dropping the
subsequently unused passed wait-wound lock 'ticket'.

Cc: stable@vger.kernel.org
Signed-off-by: Daniel J Blueman <daniel@quora.org>
Fixes: fe4952b ("drm/msm: Convert vm locking")
Patchwork: https://patchwork.freedesktop.org/patch/723564/
Message-ID: <20260508065722.18785-1-daniel@quora.org>
[rob: fixup compile errors, replace lockdep splat with something legible]
Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>
walking-machine pushed a commit that referenced this pull request Jun 16, 2026
Since the start of the git history, brport_store() always acquired the
bridge lock. Back then this decision made sense: The bridge lock
protects the STP state of the bridge and its ports and at that time the
function was only used by two STP related attributes (cost and
priority).

Nowadays, brport_store() processes a lot more attributes and most of
them do not need the bridge lock:

* Bridge flags: Only require RTNL. Read locklessly by the data path.
  Annotations can be added in net-next.

* FDB port flushing: Only requires the FDB lock.

* Multicast attributes: Only require the multicast lock.

* Group forward mask: Only requires RTNL. Read locklessly by the data
  path. Annotations can be added in net-next.

* Backup port: Only requires RTNL. Read locklessly by the data path.

This is a problem as the bridge calls dev_set_promiscuity() when certain
bridge port flags change and this function can sleep since the commit
cited below, resulting in a splat such as [1].

Fix this by reducing the scope of the bridge lock and only take it when
processing the two STP related attributes that require it. Remove the
now stale comment from br_switchdev_set_port_flag(). The
SWITCHDEV_F_DEFER flag can be removed in net-next.

[1]
BUG: sleeping function called from invalid context at net/core/dev_addr_lists.c:1262
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 372, name: bash
preempt_count: 201, expected: 0
RCU nest depth: 0, expected: 0
5 locks held by bash/372:
#0: ffff88810c51c3f0 (sb_writers#7){.+.+}-{0:0}, at: ksys_write (fs/read_write.c:740)
#1: ffff888115ce9480 (&of->mutex){+.+.}-{4:4}, at: kernfs_fop_write_iter (fs/kernfs/file.c:343)
#2: ffff88810b9fd330 (kn->active#37){.+.+}-{0:0}, at: kernfs_fop_write_iter (fs/kernfs/file.c:80 fs/kernfs/file.c:344)
#3: ffffffffa59473a0 (rtnl_mutex){+.+.}-{4:4}, at: brport_store (net/bridge/br_sysfs_if.c:326)
#4: ffff8881099d2d58 (&br->lock){+...}-{3:3}, at: brport_store (./include/linux/spinlock.h:348 net/bridge/br_sysfs_if.c:345)
Preemption disabled at:
 0x0
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
<TASK>
dump_stack_lvl (lib/dump_stack.c:94 lib/dump_stack.c:120)
__might_resched.cold (kernel/sched/core.c:9163)
netif_rx_mode_run (net/core/dev_addr_lists.c:1262)
netif_rx_mode_sync (net/core/dev_addr_lists.c:1428)
dev_set_promiscuity (net/core/dev_api.c:289)
br_manage_promisc (net/bridge/br_if.c:135 net/bridge/br_if.c:172)
br_port_flags_change (net/bridge/br_if.c:242 net/bridge/br_if.c:747)
store_learning (net/bridge/br_sysfs_if.c:79 net/bridge/br_sysfs_if.c:235)
brport_store (net/bridge/br_sysfs_if.c:346)
kernfs_fop_write_iter (fs/kernfs/file.c:352)
new_sync_write (fs/read_write.c:595)
vfs_write (fs/read_write.c:688)
ksys_write (fs/read_write.c:740)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:121)

Fixes: 78cd408 ("net: add missing instance lock to dev_set_promiscuity")
Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260526064818.272516-3-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
walking-machine pushed a commit that referenced this pull request Jun 16, 2026
Ido Schimmel says:

====================
bridge: Fix sleep in atomic context

Under certain circumstances the bridge driver can call
dev_set_promiscuity() while holding the bridge spin lock. This is a
problem as dev_set_promiscuity() might sleep.

Patches #1-#2 fix the problem in the netlink and sysfs configuration
paths by only taking the lock where it is actually needed, thereby
avoiding calling dev_set_promiscuity() from an atomic context.

Patch #3 adds test cases for both configuration paths in rtnetlink.sh
which already includes test cases for similar issues.

Note that dev_set_promiscuity() can sleep either when it takes the net
device mutex or when calling netif_rx_mode_sync(). I encountered the
problem with the latter, but blamed the former since it came earlier.
====================

Link: https://patch.msgid.link/20260526064818.272516-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
walking-machine pushed a commit that referenced this pull request Jun 16, 2026
…ernel/git/ath/ath

Jeff Johnson says:
==================
ath.git patches for v7.2 (PR #2)

For ath12k:
- Add thermal throttling and cooling device support
- Add support for handling incumbent signal interference in 6 GHz
- Add support for channel 177 in the 5 GHz band

In addition, a large number of cleanup and minor bug fixing across
all supported drivers.
==================

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
walking-machine pushed a commit that referenced this pull request Jun 25, 2026
The xe driver keeps track of whether to probe display, and whether
display hardware is there, using xe->info.probe_display. It gets set to
false if there's no display after intel_display_device_probe(). However,
the display may also be disabled via fuses, detected at a later time in
intel_display_device_info_runtime_init().

In this case, the xe driver does for_each_intel_crtc() on uninitialized
mode config in xe_display_flush_cleanup_work(), leading to a NULL
pointer dereference, and generally calls display code with display info
cleared.

Check for intel_display_device_present() after
intel_display_device_info_runtime_init(), and reset
xe->info.probe_display as necessary. Also do unset_display_features()
for completeness, although display runtime init has already done
that. This will need to be unified across all cases later.

Move intel_display_device_info_runtime_init() call slightly earlier,
similar to i915, to avoid a bunch of unnecessary setup for no display
cases.

Note #1: The xe driver has no business doing low level display plumbing
like for_each_intel_crtc() to begin with. It all needs to happen in
display code.

Note #2: The actual bug is present already in commit 44e6949
("drm/xe/display: Implement display support"), but the oops was likely
introduced later at commit ddf6492 ("drm/xe/display: Make display
suspend/resume work on discrete").

Fixes: 44e6949 ("drm/xe/display: Implement display support")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/7904
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/work_items/6150
Cc: stable@vger.kernel.org # v6.8+
Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Link: https://patch.msgid.link/20260515160920.1082842-1-jani.nikula@intel.com
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
(cherry picked from commit 7c3eb9f)
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
walking-machine pushed a commit that referenced this pull request Jun 25, 2026
Revisit and reorganize the locking and lock coverage of the
ap->lock spinlock as used in the two sysfs functions
se_bind_store() and se_associate_store().

A kernel run reported a possible deadlock situation, caused by
holding the spinlock (ap->lock) while triggering a uevent.
The fix rearranges the code protected by the spinlock by excluding
the uevent invocation, which does not require protection.

Additionally, the start of the protected region is moved earlier
to cover more lines, ensuring a consistent view of the AP queue
state between reading and updating its struct fields.

	=====================================================
	WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
	7.1.0-20260601.rc6.git12.516b5dbd4d4a.300.fc44.s390x+debug #1 Not tainted
	-----------------------------------------------------
	setupseguest.sh/11034 [HC0[0]:SC0[2]:HE1:SE0] is trying to acquire:
	000001c991f498e8 (fs_reclaim){+.+.}-{0:0}, at: __kmalloc_cache_noprof+0x5a/0x6d0
	and this task is already holding:
	000000c4a1a12378 (&aq->lock){+.-.}-{2:2}, at: se_bind_store+0x96/0x3a0
	which would create a new lock dependency:
	 (&aq->lock){+.-.}-{2:2} -> (fs_reclaim){+.+.}-{0:0}
	but this new dependency connects a SOFTIRQ-irq-safe lock:
	 (&aq->lock){+.-.}-{2:2}
	... which became SOFTIRQ-irq-safe at:
	  __lock_acquire+0x5ae/0x15a0
	  lock_acquire+0x14c/0x400
	  _raw_spin_lock_bh+0x58/0xb0
	  ap_tasklet_fn+0x72/0xd0
	  tasklet_action_common+0x174/0x1b0
	  handle_softirqs+0x180/0x5c0
	  irq_exit_rcu+0x196/0x200
	  do_ext_irq+0x12a/0x4d0
	  ext_int_handler+0xc6/0xf0
	  folio_zero_user+0x1c6/0x240
	  folio_zero_user+0x182/0x240
	  vma_alloc_anon_folio_pmd+0xa0/0x1d0
	  __do_huge_pmd_anonymous_page+0x3a/0x200
	  __handle_mm_fault+0x56c/0x590
	  handle_mm_fault+0xa2/0x370
	  do_exception+0x292/0x590
	  __do_pgm_check+0x136/0x3e0
	  pgm_check_handler+0x114/0x160
	to a SOFTIRQ-irq-unsafe lock:
	 (fs_reclaim){+.+.}-{0:0}
	... which became SOFTIRQ-irq-unsafe at:
	...
	  __lock_acquire+0x5ae/0x15a0
	  lock_acquire+0x14c/0x400
	  __fs_reclaim_acquire+0x44/0x50
	  fs_reclaim_acquire+0xbe/0x100
	  fs_reclaim_correct_nesting+0x20/0x70
	  dotest+0x5e/0x148
	  locking_selftest+0x2854/0x2a88
	  start_kernel+0x3b2/0x4f0
	  startup_continue+0x2e/0x40
	other info that might help us debug this:
	 Possible interrupt unsafe locking scenario:
	       CPU0                    CPU1
	       ----                    ----
	  lock(fs_reclaim);
				       local_irq_disable();
				       lock(&aq->lock);
				       lock(fs_reclaim);
	  <Interrupt>
	    lock(&aq->lock);
	 *** DEADLOCK ***
	4 locks held by setupseguest.sh/11034:
	 #0: 000000c485d01440 (sb_writers#4){.+.+}-{0:0}, at: vfs_write+0x2fc/0x380
	 #1: 000000c4d2283288 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x12a0x270
	 #2: 000000c4a1830e48 (kn->active#172){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x1e/0x270
	 #3: 000000c4a1a12378 (&aq->lock){+.-.}-{2:2}, at: se_bind_store+0x96/0x3a0
	the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
	-> (&aq->lock){+.-.}-{2:2} {
	   HARDIRQ-ON-W at:
			    __lock_acquire+0x5ae/0x15a0
			    lock_acquire+0x14c/0x400
			    _raw_spin_lock_bh+0x58/0xb0
			    ap_queue_init_state+0x2e/0x50
			    ap_scan_domains+0x5d6/0x620
			    ap_scan_adapter+0x4c0/0x810
			    ap_scan_bus+0x70/0x350
			    ap_scan_bus_wq_callback+0x56/0x80
			    process_one_work+0x2ba/0x820
			    worker_thread+0x21a/0x400
			    kthread+0x164/0x190
			    __ret_from_fork+0x4c/0x340
			    ret_from_fork+0xa/0x30
	   IN-SOFTIRQ-W at:
			    __lock_acquire+0x5ae/0x15a0
			    lock_acquire+0x14c/0x400
			    _raw_spin_lock_bh+0x58/0xb0
			    ap_tasklet_fn+0x72/0xd0
			    tasklet_action_common+0x174/0x1b0
			    handle_softirqs+0x180/0x5c0
			    irq_exit_rcu+0x196/0x200
			    do_ext_irq+0x12a/0x4d0
			    ext_int_handler+0xc6/0xf0
			    folio_zero_user+0x1c6/0x240
			    folio_zero_user+0x182/0x240
			    vma_alloc_anon_folio_pmd+0xa0/0x1d0
			    __do_huge_pmd_anonymous_page+0x3a/0x200
			    __handle_mm_fault+0x56c/0x590
			    handle_mm_fault+0xa2/0x370
			    do_exception+0x292/0x590
			    __do_pgm_check+0x136/0x3e0
			    pgm_check_handler+0x114/0x160
	   INITIAL USE at:
			   __lock_acquire+0x5ae/0x15a0
			   lock_acquire+0x14c/0x400
			   _raw_spin_lock_bh+0x58/0xb0
			   ap_queue_init_state+0x2e/0x50
			   ap_scan_domains+0x5d6/0x620
			   ap_scan_adapter+0x4c0/0x810
			   ap_scan_bus+0x70/0x350
			   ap_scan_bus_wq_callback+0x56/0x80
			   process_one_work+0x2ba/0x820
			   worker_thread+0x21a/0x400
			   kthread+0x164/0x190
			   __ret_from_fork+0x4c/0x340
			   ret_from_fork+0xa/0x30
	 }
	 ... key      at: [<000001c9936e8aa0>] __key.7+0x0/0x10
	the dependencies between the lock to be acquired
	 and SOFTIRQ-irq-unsafe lock:
	-> (fs_reclaim){+.+.}-{0:0} {
	   HARDIRQ-ON-W at:
			    __lock_acquire+0x5ae/0x15a0
			    lock_acquire+0x14c/0x400
			    __fs_reclaim_acquire+0x44/0x50
			    fs_reclaim_acquire+0xbe/0x100
			    fs_reclaim_correct_nesting+0x20/0x70
			    dotest+0x5e/0x148
			    locking_selftest+0x2854/0x2a88
			    start_kernel+0x3b2/0x4f0
			    startup_continue+0x2e/0x40
	   SOFTIRQ-ON-W at:
			    __lock_acquire+0x5ae/0x15a0
			    lock_acquire+0x14c/0x400
			    __fs_reclaim_acquire+0x44/0x50
			    fs_reclaim_acquire+0xbe/0x100
			    fs_reclaim_correct_nesting+0x20/0x70
			    dotest+0x5e/0x148
			    locking_selftest+0x2854/0x2a88
			    start_kernel+0x3b2/0x4f0
			    startup_continue+0x2e/0x40
	   INITIAL USE at:
			   __lock_acquire+0x5ae/0x15a0
			   lock_acquire+0x14c/0x400
			   __fs_reclaim_acquire+0x44/0x50
			   fs_reclaim_acquire+0xbe/0x100
			   fs_reclaim_correct_nesting+0x20/0x70
			   dotest+0x5e/0x148
			   locking_selftest+0x2854/0x2a88
			   start_kernel+0x3b2/0x4f0
			   startup_continue+0x2e/0x40
	 }
	 ... key      at: [<000001c991f498e8>] __fs_reclaim_map+0x0/0x30
	 ... acquired at:
	   check_prev_add+0x178/0xf40
	   __lock_acquire+0x12aa/0x15a0
	   lock_acquire+0x14c/0x400
	   __fs_reclaim_acquire+0x44/0x50
	   fs_reclaim_acquire+0xbe/0x100
	   __kmalloc_cache_noprof+0x5a/0x6d0
	   kobject_uevent_env+0xd4/0x420
	   ap_send_se_bind_uevent+0x48/0x70
	   se_bind_store+0x146/0x3a0
	   kernfs_fop_write_iter+0x18c/0x270
	   vfs_write+0x23c/0x380
	   ksys_write+0x88/0x120
	   __do_syscall+0x170/0x750
	   system_call+0x72/0x90
	stack backtrace:
	CPU: 6 UID: 0 PID: 11034 Comm: setupseguest.sh Not tainted 7.1.0-20260601.rc6.git2.516b5dbd4d4a.300.fc44.s390x+debug #1 PREEMPT
	Hardware name: IBM 9175 ME1 701 (KVM/Linux)
	Call Trace:
	 [<000001c98ffa0a7e>] dump_stack_lvl+0xae/0x108
	 [<000001c9900a6d7a>] print_bad_irq_dependency+0x47a/0x480
	 [<000001c9900a7184>] check_irq_usage+0x404/0x4c0
	 [<000001c9900a73b8>] check_prev_add+0x178/0xf40
	 [<000001c9900aaf1a>] __lock_acquire+0x12aa/0x15a0
	 [<000001c9900ab35c>] lock_acquire+0x14c/0x400
	 [<000001c9903be454>] __fs_reclaim_acquire+0x44/0x50
	 [<000001c9903be51e>] fs_reclaim_acquire+0xbe/0x100
	 [<000001c9903cf4ca>] __kmalloc_cache_noprof+0x5a/0x6d0
	 [<000001c9910ca9d4>] kobject_uevent_env+0xd4/0x420
	 [<000001c990d84098>] ap_send_se_bind_uevent+0x48/0x70
	 [<000001c990d87416>] se_bind_store+0x146/0x3a0
	 [<000001c99057da7c>] kernfs_fop_write_iter+0x18c/0x270
	 [<000001c99047712c>] vfs_write+0x23c/0x380
	 [<000001c990477438>] ksys_write+0x88/0x120
	 [<000001c9910f64e0>] __do_syscall+0x170/0x750
	 [<000001c99110a412>] system_call+0x72/0x90
	INFO: lockdep is turned off.

Fixes: 4179c39 ("s390/ap: Implement SE bind and associate uevents")
Reported-by: Ingo Franzki <ifranzki@linux.ibm.com>
Suggested-by: Finn Callies <fcallies@linux.ibm.com>
Reviewed-by: Finn Callies <fcallies@linux.ibm.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
walking-machine pushed a commit that referenced this pull request Jun 25, 2026
tl;dr: Use stop_machine() and a state machine based on the
"MULTI_STOP" pattern to implement core TDX module update logic.

Long version:

TDX module updates require careful synchronization with other TDX
operations. The requirements are (#1/#2 reflect current behavior that
must be preserved):

1. SEAMCALLs need to be callable from both process and IRQ contexts.
2. SEAMCALLs need to be able to run concurrently across CPUs
3. During updates, only update-related SEAMCALLs are permitted; all
   other SEAMCALLs shouldn't be called.
4. During updates, all online CPUs must participate in the update work.

No single lock primitive satisfies all requirements. For instance,
rwlock_t handles #1/#2 but fails #4: CPUs spinning with IRQs disabled
cannot be directed to perform update work.

Use stop_machine() as it is the only well-understood mechanism that can
meet all requirements.

And TDX module updates consist of several steps (See Intel Trust Domain
Extensions (Intel TDX) Module Base Architecture Specification, Chapter
"TD-Preserving TDX module Update"). Ordering requirements between steps
mandate lockstep synchronization across all CPUs.

multi_cpu_stop() provides a good example of executing a multi-step task
in lockstep across CPUs, but it does not synchronize the individual
steps inside the callback itself.

Implement a similar state machine as the skeleton for TDX module
updates. Each state represents one step in the update flow, and the
state advances only after all CPUs acknowledge completion of the current
step. This acknowledgment mechanism provides the required lockstep
execution.

The update flow is intentionally simpler than multi_cpu_stop() in two ways:

  a) use a spinlock to protect the control data instead of atomic_t and
     explicit memory barriers.

  b) omit touch_nmi_watchdog() and rcu_momentary_eqs(), which exist
     there for debugging and are not strictly needed for this update flow

Potential alternative to stop_machine()
=======================================
An alternative approach is to lock all KVM entry points and kick all
vCPUs. Here, KVM entry points refer to KVM VM/vCPU ioctl entry points,
implemented in KVM common code (virt/kvm). Adding a locking mechanism
there would affect all architectures KVM supports. And to lock only TDX
vCPUs, new logic would be needed to identify TDX vCPUs, which the KVM
common code currently lacks. This would add significant complexity and
maintenance overhead to KVM for this TDX-specific use case, so don't take
this approach.

[ dhansen: normal changelog/style munging ]

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://patch.msgid.link/20260520133909.409394-15-chao.gao@intel.com
walking-machine pushed a commit that referenced this pull request Jun 25, 2026
…ted-traffic'

Ido Schimmel says:

====================
ipv6: Honor oif when choosing nexthop for locally generated traffic

Patch #1 is a preparation patch following the comment from Sashiko on
v2. See details in the commit message.

Patch #2 aligns IPv6 with IPv4 and changes IPv6 route lookup to prefer a
nexthop whose nexthop device matches the specified oif.

Patch #3 adds a selftest.
====================

Link: https://patch.msgid.link/20260611154605.992528-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants