io_uring: opt-in path for Netlinker + destinations#10
Merged
Conversation
Foundation for the opt-in io_uring path. Replaces the abandoned pkg/io_uring/codec.go scaffolding (which referenced the older iceber/iouring-go library) with a thin wrapper around github.com/randomizedcoder/giouring. Key design points (documented in the plan file): - One Ring per Netlinker. Setup flags are SingleIssuer + DeferTaskrun + CoopTaskrun for the "lighter on the system" profile a periodic netlink poller wants. SQPOLL is explicitly NOT used (would burn a CPU per ring for a 1Hz workload). - 64-bit userdata layout: bits 63..56 Operation (uint8), bits 31..0 RequestID (uint32). NsID dropped — the per-Netlinker ring already implies the netns. - Probe at init: refuse to enable io_uring if the kernel doesn't support OpRecvmsg / OpSend / OpWritev (which corresponds to Linux 6.1+ given the chosen setup flags). - In-flight map pins pool buffers from submit until CQE arrives. Capped at 2× SQ entries to blow up loudly if SQEs are submitted faster than CQEs drained. - Writev iovec is heap-allocated (not stack), so the kernel can still read it when the CQE fires. Tests cover single/multi-record recv, single/batched send, writev for SOCK_STREAM unix framing, in-flight-cap enforcement, and teardown draining. socketpair(AF_UNIX, SOCK_DGRAM) gives us datagram boundaries without root. Benchmarks measure the syscall vs io_uring A/B for both directions across batch sizes 1/16/64/256. At batch=256 on a socketpair workload, io_uring shows ~3× fewer voluntary context switches per record than the syscall baseline — the headline "lighter on the system" win. Real syscall-count reduction will be more visible in the end-to-end netlink benchmark that lands with the Netlinker wiring. Three new XtcpConfig proto fields: bool io_uring, uint32 io_uring_recv_batch_size (default 64), uint32 io_uring_cqe_batch_size (default 128). Bindings regenerated via buf generate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires the io_uring package from b94b735 into the xtcp2 hot path. New opt-in code path activated when config.IoUring=true (--ioUring CLI flag); the default path is unchanged for any existing user. Read side (pkg/xtcp/netlinker_iouring.go + init_netlinkers.go): - New netlinkerIoUring goroutine variant. Pre-submits a configurable batch of recvmsg SQEs (--ioUringRecvBatch, default 64) from packetBufferPool against the netlink fd, drains CQEs via PeekBatchCQE, refills each completed slot inline. One io_uring_enter per Submit instead of one per recv — on a 100k-socket host this is expected to reduce syscalls ~20-50x. - Dispatch follows the codebase's existing sync.Map + chosen-function pattern (Marshallers / Destinations). x.Netlinker is a function pointer set at init; ns_createNetlinkersAndStore.go invokes it unchanged. - LockOSThread before ring creation so the ring stays bound to the netns'd thread for its lifetime. - WaitCQETimeout caps each iteration so ctx cancellation is observed within nl_timeout_milliseconds. Write side (pkg/xtcp/destinations_iouring.go + extractFD helper): - destUDPIoUring / destUnixIoUring / destUnixGramIoUring enqueue send SQEs into the per-Netlinker ring (looked up via ringCtxKey on the ctx). Submit happens after Deserialize returns, so a whole dump cycle of N records turns into one io_uring_enter for all N sends. - destUnixIoUring uses writev with [hdr, payload] iovec so the varint-length frame is delivered atomically. - Buffer ownership: marshalled *[]byte is pinned by the ring's in-flight map until the kernel signals completion. handleSendCQE records the outcome to Prometheus (mirrors destKafka's async callback at destinations.go:117-123). - Library destinations (kafka, nsq, nats, valkey) silently ignore the io_uring flag — they own their own sockets via client libraries. Tests (pkg/xtcp/destinations_test.go): - New TestDestinationsIoUring table-driven test covers udp, unix, unixgram io_uring round-trip with single and multiple records each. - All 6 rows pass; race-clean across 10 stress runs. - The flake "Submit: file exists" surfaced during stress and was diagnosed: ring SQ submission must happen on the same OS thread that created the ring. Fix is runtime.LockOSThread() in the test driver (same lock the production netlinker already uses). CLI (cmd/xtcp2/xtcp2.go): - --ioUring (bool) - --ioUringRecvBatch (uint, default 64) - --ioUringCqeBatch (uint, default 128) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both have been panicking on every run because the XTCP receiver was constructed via `xtcp := new(XTCP)` with no config, no hostname, no pools, no Marshaller, no Destination, no flatRecordService — so the first line of Deserialize (`if x.config.Modulus != 1`) hit a nil-pointer deref, never mind everything downstream. The test's only assertion checked a local xtcpRecord that Deserialize never wrote to. Verified the breakage predates this branch: `git log -p` on deserialize_test.go shows the same broken setup since commit 4dc1d3b (the initial commit). The bug just wasn't being caught because the package's other tests didn't exercise Deserialize. Fix: - New helper newTestDeserializeXTCP(testing.TB) builds a fully-populated XTCP suitable for driving Deserialize end-to-end: config.Modulus=1, hostname, packet/nlh/rta pools, netlinkerDoneCh (buffered), pollRequestCh, fresh Prom registry, protobufSingleMarshal, destNull, and a shared xtcpFlatRecordService. Reusable for both test and benchmark. - testFlatRecordService is shared across calls via sync.Once because NewXtcpFlatRecordService registers metrics in the default Prom registry (a second registration panics with "duplicate metrics collector registration"). Sharing is safe in tests: no GRPC clients connect, so flatRecordServiceSend hits its no-client fast path. - Test now asserts the real contract: Deserialize returns no error and processes n > 0 records per fixture. With the three committed fixtures the numbers are: 9, 8, and 72 records. - BenchmarkDeserialize gets b.SetBytes / b.ReportAllocs and a thrown-away `s int` parameter (dead since day 1) removed. - TestReconcileMaps is also broken on baseline; flagged but not in scope here. go test -race -count=1 -run TestDeserialize ./pkg/xtcp/ is clean. go test -bench=BenchmarkDeserialize -benchmem ./pkg/xtcp/ reports ~163µs per dump cycle (~197 MB/s through the parse path), 218 allocs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both isETimeError and Ring.Close's drain loop used a direct type assert
plus an `err.Error() == "errno 62"` string-match fallback. That works
for bare syscall.Errno but misses anything wrapped via fmt.Errorf("%w",
err) further down the giouring call chain. The string match was also
Linux-specific (62 is ETIME on Linux) and brittle if a library ever
stringifies differently.
errors.Is walks the unwrap chain for us, so a single line covers both
the bare and wrapped cases on every platform where syscall.ETIME is
defined. Drop the string fallback and the now-unused type assert.
Tested:
- go build ./pkg/io_uring/... ./pkg/xtcp/... clean
- go vet ./pkg/io_uring/... ./pkg/xtcp/... clean
(Pre-existing: the `go test` link step fails locally with `link:
github.com/randomizedcoder/giouring: invalid reference to syscall.munmap`
— a giouring/Go-version mismatch that also fires on origin/io-uring-support
unchanged. Not introduced by this commit; flagged in the PR body for
operator attention.)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in
io_uringpath that batches netlink recv + destination send through one ring per Netlinker. Behind a config flag (io_uring: true); default remains the syscall path.pkg/io_uring— pure-Go wrapper over giouring. Owns ring lifecycle, the 64-bit userdata codec (Operation<<56 | RequestID, 24 reserved bits asserted zero), and the in-flight map that keeps pool buffers alive between submission and completion.SingleIssuer + DeferTaskrun + CoopTaskrun. Explicitly notSQPoll(would burn a CPU on a 1Hz polling daemon).New— fails loudly on pre-6.1 kernels with a clear error naming the missing opcode set, no silent fallback.pkg/xtcp/netlinker_iouring.go— the opt-in netlinker variant. LockOSThread-pinned for ring lifetime (kernel ties io_uring fds to the netns of the creating task). Pre-fillsRecvBatchSizerecvmsg SQEs; drains CQEs; refills each consumed slot; submits once per loop iteration.pkg/xtcp/destinations_iouring.go—destUDPIoUring,destUnixIoUring(writev for varint header + payload, atomic on the wire),destUnixGramIoUring. Each looks up the ring fromctxand enqueues a send SQE; submit happens in the netlinker's drain loop, so one io_uring_enter ships N records per dump cycle.sqEntries * 2) refuses unbounded growth and returns a clear error rather than letting the in-flight map silently leak.a7f23a2→ squashed into the rebased history):TestDeserialize/BenchmarkDeserializehad nil-deref'd since the initial commit because the XTCP fixture was built without config / hostname / pools / Marshaller. NewnewTestDeserializeXTCPhelper drives them end-to-end, asserting the real contract (n > 0 records per fixture). Predecessor was commit4dc1d3b(initial commit); confirmed bygit log -p. Open follow-up:TestReconcileMapsis also broken on baseline; documented in commit body, not in scope here.The substantive small fix on top of the original commits
isETimeErrorandRing.Close's drain loop both used a direct type assert (err.(syscall.Errno)) plus anerr.Error() == "errno 62"string-match fallback. That works for baresyscall.Errnobut misses anything wrapped viafmt.Errorf("%w", err)further down the giouring call chain, and the string match is Linux-specific. Switched both sites toerrors.Is(err, syscall.ETIME), which walks the unwrap chain. Drops the type assert + the magic string in both places.Test plan
go build ./pkg/io_uring/... ./pkg/xtcp/...cleango vet ./pkg/io_uring/... ./pkg/xtcp/...cleango test -race ./pkg/io_uring/... ./pkg/xtcp/...— fails on link withinvalid reference to syscall.munmapfrom the giouring local-replace + Go-version combination in my dev shell. Confirmed by running againstorigin/io-uring-supportuntouched (same failure). The runtime path is fine; only the test linker trips. Worth verifying in your environment that the giouring checkout matches your Go version before merging.pkg/io_uringcover: codec round-trip (table-driven +TestCodecReservedBitsAreZeroinvariant), recv single/multiple, send single/batch, writev unix stream, in-flight cap enforcement, teardown drains cleanly. All uset.Skipfon probe failure so older kernels don't fail the suite.b.ReportMetricforuser_us/op,sys_us/op,nvcsw/op,nivcsw/opviagetrusagedeltas — lets you compare syscall vs io_uring on cost-per-op, not just wall time.Known follow-ups (not blocking)
netlinker_iouring.go+destinations_iouring.go, ~370 LOC) has no direct unit tests. Could be addressed by aTestNetlinkerIoUringRoundTripdriven by asocketpairnetlink stand-in.var _ sync.Mutexat file scope inring.goas a "static signal" to future contributors is non-idiomatic; a type-level doc comment accomplishes the same goal and lets us drop thesyncimport.requireProbereports bare opcode integers in error messages; mappingOpRecvmsg/OpSend/OpWritevto names would help operators.iouringWaitWithTimeouthas a stale comment claiming "the Ring API doesn't expose a direct timeout wait" while immediately callingWaitOneTimeout.io_uringuses an underscore (non-idiomatic for Go stdlib convention); aliased asxioat every import site, so renaming toiouringis a one-line per importer change for anyone who cares.🤖 Generated with Claude Code