Skip to content

complexity-reduction 3/4: microvm-coverage tooling + self-test checks (M1–M6)#26

Merged
randomizedcoder merged 21 commits into
mainfrom
complexity-microvm-tooling-b2
Jun 15, 2026
Merged

complexity-reduction 3/4: microvm-coverage tooling + self-test checks (M1–M6)#26
randomizedcoder merged 21 commits into
mainfrom
complexity-microvm-tooling-b2

Conversation

@randomizedcoder

Copy link
Copy Markdown
Owner

Summary

complexity-reduction 3/4 — cluster B2: microvm-coverage tooling + self-test checks (commits 43–63). Applied cleanly onto main.

  • M1–M6: merge the coverage produced inside the coverage microVMs into the headline number — quality-report -coverage-out + per-package TSV regen, update-quality-report --with-microvm wiring, merge of the iouring VM coverage, and self-test checks that drive ns-lifecycle / TCP-in-ns / /run/docker/netns watch paths so those code paths are actually covered.
  • Self-test check fixes: NS_LIFECYCLE / NS_TRAFFIC prom-label keys, alphabetical metric-label filter, Check 3 NETLINK + Check 5 GRPC_ROUNDTRIP → coverage VM OVERALL_PASS.
  • Small Go bits: NetlinkerIoUring parsed-socket count counter; cover redisClientAdapter Publish/Ping/Close; delete dead registerProtobufSchemaRestful + add a newKafkaDest debug-log test; SA9003 / noctx / unused / unconvert lint cleanup.

Lands on docs: refresh report — 0 findings, 92.4% coverage, all checks green.

Testing

  • Binary-blob guard: clean.
  • go vet ./... + gofmt -l . — clean (go 1.25; no gofmt-forward needed this batch).
  • go test -ldflags=-checklinkname=0 -tags 'dest_kafka dest_nats dest_nsq dest_valkey' ./...entire suite green.

Note: the M-series wiring targets coverage produced by the coverage microVMs; the in-VM runs themselves aren't executed here (KVM/heavy) — the Go + tooling changes are verified by build/vet/test.

🤖 Generated with Claude Code

randomizedcoder and others added 21 commits June 14, 2026 18:53
Add a -coverage-out flag to the aggregator that regenerates the
coverage-func.out + coverage-per-package.tsv artifacts inside rawDir
from a supplied Go coverage profile, before ingestCoverage runs.

The update-quality-report wrapper will use this to feed a merged host
+ microvm coverage profile back through the aggregator without
rebuilding the entire Nix derivation.

  regenerateCoverageArtifacts(rawDir, profile)
    1. go tool cover -func=<profile> → <rawDir>/coverage-func.out
    2. buildPerPackageTSV — pure-Go port of the awk in
       nix/quality-report/default.nix; dedupes atomic-mode block
       repetitions by max-count and aggregates per package directory.

New tests (ratchet_test.go):
  TestBuildPerPackageTSV_table — 5 rows covering atomic-dedup,
    non-module-line filtering, empty profile, zero-statement blocks
  TestRegenerateCoverageArtifacts_writesFiles — verifies both
    artifacts land in rawDir (skips if go tool cover unavailable)
  TestRegenerateCoverageArtifacts_missingProfileErrs
Extend the shell wrapper with an optional --with-microvm flag:

  1. Boot the coverage-instrumented microvm via
     `nix run .#microvm-x86_64-lifecycle-coverage` and scrape the
     Go coverage data dump from its serial console into a tempdir.
  2. Build the regular .#quality-report (host coverage in raw/).
  3. Merge VM dir + host coverage.out via .#coverage-merge.
  4. Copy the read-only Nix-store raw/ into a writable temp dir, then
     re-run the quality-report aggregator with the new -coverage-out
     flag pointing at the merged profile.
  5. Overwrite docs/quality-report.md with the merged-numbers report.

Without --with-microvm, behaviour is unchanged: a single Nix build,
copy the markdown.

Falls back gracefully:
  - If the microvm lifecycle scrapes zero coverage files (KVM
    unavailable, VM crashed, etc.) we WARN + fall back to host-only.
  - Adds versions.go to runtimeInputs so the aggregator re-run has
    Go on PATH (regenerateCoverageArtifacts shells to `go tool
    cover -func`).
Microvm coverage merge wired into update-quality-report --with-microvm:

  cmd/xtcp2          92.4% → 95.9%  (daemon runDaemon now exercised in VM)
  pkg/xtcp           85.2% → 87.1%  (netlink/ns paths under real kernel)
  pkg/xtcpnl         91.4% → 91.8%
  Overall            90.3% → 91.1%

The lifecycle test exited non-zero (one self-test check failed) so
only 2 coverage files were scraped. Adding the iouring-flavor VM
merge in a follow-up will pick up more io_uring paths.
Extend coverage-merge.nix to accept multiple --vm-dir flags
(concatenated via covdata textfmt's comma-separated -i). Extend
update-quality-report --with-microvm to run BOTH coverage VMs
sequentially:

  1. .#microvm-x86_64-lifecycle-coverage (stdlib build) — exercises
     the syscall netlinker + namespace/ns_watch paths
  2. .#microvm-x86_64-lifecycle-coverage-iouring (iouring build) —
     exercises netlinker_iouring + io_uring.Ring paths that the
     stdlib VM can't reach (different build tag)

The merge picks up every block covered by either VM, then the
existing host+VM merge in coverage-merge.nix takes the union of
host coverage and the combined VM coverage.

Falls back gracefully:
  - If either VM scrapes zero files, that --vm-dir is skipped.
  - If BOTH scrape zero, we WARN + fall back to host-only (same
    behaviour as before this commit, just lifted to the two-VM
    aggregate check).
…elete)

Add two new checks to nix/microvms/self-test.nix that exercise xtcp2's
namespace-watching pipeline end-to-end inside the coverage VM:

  Check 8 (NS_LIFECYCLE):
    ip netns add xtcp_test_ns_a  →  fsnotify Create
                                 →  watchNsNamespace dispatchNsFsEvent
                                 →  nsAdd
                                 →  netNamespaceInstance
                                 →  openAndSetNSWithRetries (real
                                    Open + Setns syscalls)
                                 →  syscall.Socket(AF_NETLINK) + Bind
                                 →  createNetlinkersAndStore
                                 →  spawns a per-ns netlinker goroutine
    ip netns delete  →  fsnotify Remove  →  nsDelete teardown

  Check 9 (NS_TRAFFIC):
    Same as above, but ALSO creates a TCP listener + client pair
    inside the new ns. The per-ns netlinker polls inet_diag in that
    ns; the daemon's Netlinker.packets counter bumps. This drives
    the full netlinkerSyscall body — Recvfrom on a real netlink fd,
    Deserialize on real (not garbage) netlink bytes, every per-
    attribute deserializer that finds a present attribute.

Assertions read xtcp_counts metric vector via curl /metrics
(function="watchNamespaces" event-counter; function=
"netNamespaceInstance" start-counter; function="Netlinker"
packets-counter).

Both checks fall back gracefully if iproute2/nc are missing on
PATH; iproute2 is already in self-test runtimeInputs.

nix/microvms/default.nix: extend the coverage + coverage-iouring
lifecycle sentinelRe so the new sentinels surface in the harness
output (default filter hid them).
The metric_value awk filter was matching on task="…" but the actual
label key in pkg/xtcp/prometheus.go is variable="…". With the wrong key
both before/after queries returned 0, masking whether netNamespaceInstance
actually started.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Check 8 NS_LIFECYCLE failed silently (evt:0→0→0); add /run/netns
listing, ip-netns-add stderr capture, and per-call metric-row dumps
so the next run reveals whether fsnotify saw the create.

Check 9 NS_TRAFFIC now matches both Netlinker and NetlinkerIoUring
packet counters so it works in both coverage VM flavors.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ls is fine for a single-call diagnostic dump in the self-test where
shellcheck's SC2012 (prefer find) adds no value.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous filter passed 'function="X",variable="Y"' as a single
substring, but Prometheus prints labels in alphabetical order
(function, type, variable), so type="..." sits between them and the
substring never matched. Counters always returned 0.

Switch metric_value to two separate substring args (function + variable)
that are both required to be present in a row. Drop the verbose
diagnostic dumps now that the root cause is identified.

After this fix:
  XTCP2_SELF_TEST_NS_LIFECYCLE_PASS  (evt:0→1→2 inst:1→2)
  XTCP2_SELF_TEST_NS_TRAFFIC_PASS    (Netlinker.packets:20→36)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- tools/quality-report/main.go:381  exec.Command → exec.CommandContext
  with a 30s timeout so the cover -func step can be cancelled cleanly.
- tools/quality-report/main.go:414  remove unused blockKey struct
  (leftover from an earlier per-block dedupe approach that lives now in
  seenStmt/seenMaxCount maps).
- pkg/xtcpnl/xtcpnl_fatalf_test.go:95  drop int64(tv.Usec) — Timeval.Usec
  is already int64 on linux/amd64.
- pkg/xtcp/destinations_{kafka,valkey}_test.go  gofmt struct-field alignment.
- docs/quality-report.md  regenerated with NS_LIFECYCLE + NS_TRAFFIC
  passing in both coverage VMs (evt:0→1→2 inst:1→2,
  Netlinker.packets:20→36).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The production adapter showed 0% coverage even with the dest_valkey
build tag because every other test bypassed it via the newValkeyClientFn
factory seam. Drive each adapter method against an unreachable port with
a short-deadline context so Publish + Ping surface dial errors and Close
is exercised cleanly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
xtcp2's netNsCandidateDirs probes /run/netns/ AND /run/docker/netns/.
The coverage VM only pre-created the first, so the daemon's second
watchNsNamespace goroutine (for the docker path) never spawned and
that whole branch read 0% coverage.

Pre-create /run/docker/netns/ via systemd.tmpfiles in coverage VMs and
add Check 10: ip-netns-add → mount --bind into /run/docker/netns/ to
fire fsnotify Create on the docker dir. Mirrors what docker actually
does at the filesystem level when spawning a container — no docker
daemon required.

Result (stdlib coverage VM):
  XTCP2_SELF_TEST_NS_DOCKER_PASS  (evt:4→5→6 inst:3→4)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds NS_DOCKER_PASS in both coverage VMs (evt:4→5→6 inst:3→4),
confirming the /run/docker/netns/ watch path is now exercised
end-to-end. The redisClientAdapter Publish/Ping/Close tests
also landed — valkey production adapter went from 0% to 100%
and dropped off the gaps list.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two latent bugs that had Check 3 + 5 failing in every flavor since the
self-test was first introduced:

1. cmd/xtcp2client default port was 8888, but the daemon listens on
   8889 (cmd/xtcp2 grpcPortCst). Every gRPC roundtrip from xtcp2client
   to a default daemon was a silent connection refused. Bump the
   client's default to 8889 to match, and add an explicit -port flag
   so this footgun is at least configurable. Pinned the constant with
   a CLAUDE comment about keeping the two in lockstep.

2. Self-test Check 3 grep'd /var/log/xtcp2.jsonl, but xtcp2 has no
   file destination type — vmConfig.json's "type":"file" is
   aspirational, never wired into RegisterDestination. The file
   literally never existed. Rewrote Check 3 as a metric-driven
   assertion: poll xtcp_counts{variable="p"} until ANY Netlinker has
   parsed at least one inet_diag socket (which is the only end-to-end
   signal Check 3 was ever trying to verify).

3. Self-test Check 5 invoked xtcp2client with `-addr host:port` but
   the flag set is `-target host` + `-port num` (now exists). Updated.

Result (stdlib coverage VM):
  XTCP2_SELF_TEST_NETLINK_PASS         (Netlinker parsed 3 sockets via inet_diag)
  XTCP2_SELF_TEST_GRPC_ROUNDTRIP_PASS  (xtcp2client rc=124, produced output)
  XTCP2_SELF_TEST_OVERALL_PASS

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The syscall netlinker increments xtcp_counts{function="Netlinker",
variable="p",type="count"} by the number of inet_diag sockets parsed
per recv (netlinker.go:194), but the io_uring path discarded
Deserialize's first return. Effect: dashboards + the self-test never
saw iouring-flavor inet_diag activity reflected in the parsed-socket
metric — the counter just stayed at 0 even while NetlinkerIoUring.packets
was bumping each cycle.

Capture the count and emit the symmetric counter so iouring runs are
observable on the same dashboards as syscall runs.

Result: iouring coverage VM now hits NETLINK_PASS + OVERALL_PASS.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After fixing the xtcp2client port mismatch (Check 5), the file-sink
mirage in Check 3, and the NetlinkerIoUring missing-p-counter bug,
both stdlib and iouring coverage VMs now hit OVERALL_PASS — all 10
self-test checks green:
  SYSTEMD METRICS NETLINK BINARIES_HELP GRPC_ROUNDTRIP
  NS_INSPECT NSTEST NS_LIFECYCLE NS_TRAFFIC NS_DOCKER

Total coverage 91.9%. pkg/xtcp 89.4%.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…g test

- destinations_kafka.go: delete registerProtobufSchemaRestful which
  was marked lint:ignore U1000 "historical reference; not called".
  The whole function (35 lines, 13 stmts) was dragging pkg/xtcp's
  coverage divisor without ever being exercised. The "bytes" import
  was only used in this dead code; remove that too.

- destinations_kafka_test.go: add TestNewKafkaDest_debugLog to cover
  the 5 log.Println calls inside the `if x.debugLevel > 10` block
  that newKafkaDest's happy-path test skips.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
SA9003 (empty branch) was flagged inside an `if r := recover(); r != nil`
block in poller_helpers_test.go. The intent is to swallow the panic
deliberately — collapse to `defer func() { _ = recover() }()` so the
recover stays explicit but the empty-branch warning is gone.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Final state after the M5/M6/check-fix arc:
  Total findings:  0
  Total coverage:  92.4%
  pkg/xtcp:        90.8%  (was 89.4% — cleared the below-90pct finding)
  Coverage VMs:    XTCP2_SELF_TEST_OVERALL_PASS  (stdlib + iouring)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@randomizedcoder randomizedcoder merged commit ff2493f into main Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant