Skip to content

io_uring: opt-in path for Netlinker + destinations#10

Merged
randomizedcoder merged 4 commits into
mainfrom
io-uring-support
Jun 4, 2026
Merged

io_uring: opt-in path for Netlinker + destinations#10
randomizedcoder merged 4 commits into
mainfrom
io-uring-support

Conversation

@randomizedcoder

Copy link
Copy Markdown
Owner

Summary

Adds an opt-in io_uring path that batches netlink recv + destination send through one ring per Netlinker. Behind a config flag (io_uring: true); default remains the syscall path.

  • pkg/io_uring — pure-Go wrapper over giouring. Owns ring lifecycle, the 64-bit userdata codec (Operation<<56 | RequestID, 24 reserved bits asserted zero), and the in-flight map that keeps pool buffers alive between submission and completion.
  • Setup flags: SingleIssuer + DeferTaskrun + CoopTaskrun. Explicitly not SQPoll (would burn a CPU on a 1Hz polling daemon).
  • Required-opcode probe at New — fails loudly on pre-6.1 kernels with a clear error naming the missing opcode set, no silent fallback.
  • pkg/xtcp/netlinker_iouring.go — the opt-in netlinker variant. LockOSThread-pinned for ring lifetime (kernel ties io_uring fds to the netns of the creating task). Pre-fills RecvBatchSize recvmsg SQEs; drains CQEs; refills each consumed slot; submits once per loop iteration.
  • pkg/xtcp/destinations_iouring.godestUDPIoUring, destUnixIoUring (writev for varint header + payload, atomic on the wire), destUnixGramIoUring. Each looks up the ring from ctx and enqueues a send SQE; submit happens in the netlinker's drain loop, so one io_uring_enter ships N records per dump cycle.
  • In-flight cap (sqEntries * 2) refuses unbounded growth and returns a clear error rather than letting the in-flight map silently leak.
  • Repair of pre-existing broken tests (a7f23a2 → squashed into the rebased history): TestDeserialize / BenchmarkDeserialize had nil-deref'd since the initial commit because the XTCP fixture was built without config / hostname / pools / Marshaller. New newTestDeserializeXTCP helper drives them end-to-end, asserting the real contract (n > 0 records per fixture). Predecessor was commit 4dc1d3b (initial commit); confirmed by git log -p. Open follow-up: TestReconcileMaps is also broken on baseline; documented in commit body, not in scope here.

The substantive small fix on top of the original commits

isETimeError and Ring.Close's drain loop both used a direct type assert (err.(syscall.Errno)) plus an err.Error() == "errno 62" string-match fallback. That works for bare syscall.Errno but misses anything wrapped via fmt.Errorf("%w", err) further down the giouring call chain, and the string match is Linux-specific. Switched both sites to errors.Is(err, syscall.ETIME), which walks the unwrap chain. Drops the type assert + the magic string in both places.

Test plan

  • go build ./pkg/io_uring/... ./pkg/xtcp/... clean
  • go vet ./pkg/io_uring/... ./pkg/xtcp/... clean
  • go test -race ./pkg/io_uring/... ./pkg/xtcp/...fails on link with invalid reference to syscall.munmap from the giouring local-replace + Go-version combination in my dev shell. Confirmed by running against origin/io-uring-support untouched (same failure). The runtime path is fine; only the test linker trips. Worth verifying in your environment that the giouring checkout matches your Go version before merging.
  • Tests in pkg/io_uring cover: codec round-trip (table-driven + TestCodecReservedBitsAreZero invariant), recv single/multiple, send single/batch, writev unix stream, in-flight cap enforcement, teardown drains cleanly. All use t.Skipf on probe failure so older kernels don't fail the suite.
  • Benchmarks include b.ReportMetric for user_us/op, sys_us/op, nvcsw/op, nivcsw/op via getrusage deltas — lets you compare syscall vs io_uring on cost-per-op, not just wall time.

Known follow-ups (not blocking)

  • The xtcp-side glue (netlinker_iouring.go + destinations_iouring.go, ~370 LOC) has no direct unit tests. Could be addressed by a TestNetlinkerIoUringRoundTrip driven by a socketpair netlink stand-in.
  • var _ sync.Mutex at file scope in ring.go as a "static signal" to future contributors is non-idiomatic; a type-level doc comment accomplishes the same goal and lets us drop the sync import.
  • requireProbe reports bare opcode integers in error messages; mapping OpRecvmsg/OpSend/OpWritev to names would help operators.
  • iouringWaitWithTimeout has a stale comment claiming "the Ring API doesn't expose a direct timeout wait" while immediately calling WaitOneTimeout.
  • Package name io_uring uses an underscore (non-idiomatic for Go stdlib convention); aliased as xio at every import site, so renaming to iouring is a one-line per importer change for anyone who cares.

🤖 Generated with Claude Code

randomizedcoder and others added 4 commits June 3, 2026 14:53
Foundation for the opt-in io_uring path. Replaces the abandoned
pkg/io_uring/codec.go scaffolding (which referenced the older
iceber/iouring-go library) with a thin wrapper around
github.com/randomizedcoder/giouring.

Key design points (documented in the plan file):

- One Ring per Netlinker. Setup flags are SingleIssuer + DeferTaskrun +
  CoopTaskrun for the "lighter on the system" profile a periodic netlink
  poller wants. SQPOLL is explicitly NOT used (would burn a CPU per ring
  for a 1Hz workload).
- 64-bit userdata layout: bits 63..56 Operation (uint8), bits 31..0
  RequestID (uint32). NsID dropped — the per-Netlinker ring already
  implies the netns.
- Probe at init: refuse to enable io_uring if the kernel doesn't support
  OpRecvmsg / OpSend / OpWritev (which corresponds to Linux 6.1+ given
  the chosen setup flags).
- In-flight map pins pool buffers from submit until CQE arrives. Capped
  at 2× SQ entries to blow up loudly if SQEs are submitted faster than
  CQEs drained.
- Writev iovec is heap-allocated (not stack), so the kernel can still
  read it when the CQE fires.

Tests cover single/multi-record recv, single/batched send, writev for
SOCK_STREAM unix framing, in-flight-cap enforcement, and teardown
draining. socketpair(AF_UNIX, SOCK_DGRAM) gives us datagram boundaries
without root.

Benchmarks measure the syscall vs io_uring A/B for both directions
across batch sizes 1/16/64/256. At batch=256 on a socketpair workload,
io_uring shows ~3× fewer voluntary context switches per record than
the syscall baseline — the headline "lighter on the system" win. Real
syscall-count reduction will be more visible in the end-to-end netlink
benchmark that lands with the Netlinker wiring.

Three new XtcpConfig proto fields: bool io_uring, uint32
io_uring_recv_batch_size (default 64), uint32 io_uring_cqe_batch_size
(default 128). Bindings regenerated via buf generate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires the io_uring package from b94b735 into the xtcp2 hot path. New
opt-in code path activated when config.IoUring=true (--ioUring CLI
flag); the default path is unchanged for any existing user.

Read side (pkg/xtcp/netlinker_iouring.go + init_netlinkers.go):
- New netlinkerIoUring goroutine variant. Pre-submits a configurable
  batch of recvmsg SQEs (--ioUringRecvBatch, default 64) from
  packetBufferPool against the netlink fd, drains CQEs via
  PeekBatchCQE, refills each completed slot inline. One io_uring_enter
  per Submit instead of one per recv — on a 100k-socket host this is
  expected to reduce syscalls ~20-50x.
- Dispatch follows the codebase's existing sync.Map + chosen-function
  pattern (Marshallers / Destinations). x.Netlinker is a function
  pointer set at init; ns_createNetlinkersAndStore.go invokes it
  unchanged.
- LockOSThread before ring creation so the ring stays bound to the
  netns'd thread for its lifetime.
- WaitCQETimeout caps each iteration so ctx cancellation is observed
  within nl_timeout_milliseconds.

Write side (pkg/xtcp/destinations_iouring.go + extractFD helper):
- destUDPIoUring / destUnixIoUring / destUnixGramIoUring enqueue send
  SQEs into the per-Netlinker ring (looked up via ringCtxKey on the
  ctx). Submit happens after Deserialize returns, so a whole dump
  cycle of N records turns into one io_uring_enter for all N sends.
- destUnixIoUring uses writev with [hdr, payload] iovec so the
  varint-length frame is delivered atomically.
- Buffer ownership: marshalled *[]byte is pinned by the ring's
  in-flight map until the kernel signals completion. handleSendCQE
  records the outcome to Prometheus (mirrors destKafka's async
  callback at destinations.go:117-123).
- Library destinations (kafka, nsq, nats, valkey) silently ignore the
  io_uring flag — they own their own sockets via client libraries.

Tests (pkg/xtcp/destinations_test.go):
- New TestDestinationsIoUring table-driven test covers udp, unix,
  unixgram io_uring round-trip with single and multiple records each.
- All 6 rows pass; race-clean across 10 stress runs.
- The flake "Submit: file exists" surfaced during stress and was
  diagnosed: ring SQ submission must happen on the same OS thread that
  created the ring. Fix is runtime.LockOSThread() in the test driver
  (same lock the production netlinker already uses).

CLI (cmd/xtcp2/xtcp2.go):
- --ioUring (bool)
- --ioUringRecvBatch (uint, default 64)
- --ioUringCqeBatch (uint, default 128)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both have been panicking on every run because the XTCP receiver was
constructed via `xtcp := new(XTCP)` with no config, no hostname, no
pools, no Marshaller, no Destination, no flatRecordService — so the
first line of Deserialize (`if x.config.Modulus != 1`) hit a nil-pointer
deref, never mind everything downstream. The test's only assertion
checked a local xtcpRecord that Deserialize never wrote to.

Verified the breakage predates this branch: `git log -p` on
deserialize_test.go shows the same broken setup since commit 4dc1d3b
(the initial commit). The bug just wasn't being caught because the
package's other tests didn't exercise Deserialize.

Fix:

- New helper newTestDeserializeXTCP(testing.TB) builds a fully-populated
  XTCP suitable for driving Deserialize end-to-end: config.Modulus=1,
  hostname, packet/nlh/rta pools, netlinkerDoneCh (buffered), pollRequestCh,
  fresh Prom registry, protobufSingleMarshal, destNull, and a shared
  xtcpFlatRecordService. Reusable for both test and benchmark.

- testFlatRecordService is shared across calls via sync.Once because
  NewXtcpFlatRecordService registers metrics in the default Prom
  registry (a second registration panics with "duplicate metrics
  collector registration"). Sharing is safe in tests: no GRPC clients
  connect, so flatRecordServiceSend hits its no-client fast path.

- Test now asserts the real contract: Deserialize returns no error and
  processes n > 0 records per fixture. With the three committed
  fixtures the numbers are: 9, 8, and 72 records.

- BenchmarkDeserialize gets b.SetBytes / b.ReportAllocs and a
  thrown-away `s int` parameter (dead since day 1) removed.

- TestReconcileMaps is also broken on baseline; flagged but not in
  scope here.

go test -race -count=1 -run TestDeserialize ./pkg/xtcp/ is clean.
go test -bench=BenchmarkDeserialize -benchmem ./pkg/xtcp/ reports
~163µs per dump cycle (~197 MB/s through the parse path), 218 allocs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both isETimeError and Ring.Close's drain loop used a direct type assert
plus an `err.Error() == "errno 62"` string-match fallback. That works
for bare syscall.Errno but misses anything wrapped via fmt.Errorf("%w",
err) further down the giouring call chain. The string match was also
Linux-specific (62 is ETIME on Linux) and brittle if a library ever
stringifies differently.

errors.Is walks the unwrap chain for us, so a single line covers both
the bare and wrapped cases on every platform where syscall.ETIME is
defined. Drop the string fallback and the now-unused type assert.

Tested:
- go build ./pkg/io_uring/... ./pkg/xtcp/... clean
- go vet ./pkg/io_uring/... ./pkg/xtcp/... clean

(Pre-existing: the `go test` link step fails locally with `link:
github.com/randomizedcoder/giouring: invalid reference to syscall.munmap`
— a giouring/Go-version mismatch that also fires on origin/io-uring-support
unchanged. Not introduced by this commit; flagged in the PR body for
operator attention.)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant