fedify-dev · dahlia · Jun 16, 2026 · Jun 13, 2026 · Jun 13, 2026 · Jun 13, 2026
diff --git a/CHANGES.md b/CHANGES.md
@@ -315,11 +315,22 @@ To be released.
     that keep `triggerSinks` allowlisting enabled.  This change is published
     as benchmark scenario schema version 2.  [[#744], [#785], [#801], [#802]]
 
+ -  Added `fedify bench compare` for CI-friendly performance regression gates.
+    The command checks out base and head refs into temporary worktrees, starts
+    the benchmark target for each ref, runs the same suite, and fails when the
+    head regresses beyond `--max-regression` plus the measured per-run noise
+    band.  Benchmark scenarios now run three times by default and aggregate
+    repeated runs with median latency/throughput and pessimistic correctness
+    results.  This change is published as benchmark report schema version 3
+    and comparison report schema version 1.  [[#744], [#786], [#804]]
+
 [#783]: https://github.com/fedify-dev/fedify/issues/783
 [#784]: https://github.com/fedify-dev/fedify/issues/784
 [#785]: https://github.com/fedify-dev/fedify/issues/785
+[#786]: https://github.com/fedify-dev/fedify/issues/786
 [#801]: https://github.com/fedify-dev/fedify/pull/801
 [#802]: https://github.com/fedify-dev/fedify/pull/802
+[#804]: https://github.com/fedify-dev/fedify/pull/804
 
 ### @fedify/fixture
 

diff --git a/docs/manual/benchmarking.md b/docs/manual/benchmarking.md
@@ -100,7 +100,6 @@ crypto cost is real.
 > types, a few options the format accepts are also not implemented yet and are
 > rejected up front with a clear message:
 >
->  -  `runs` greater than `1` (repeated runs).
 >  -  An `inbox` `activity` that is not a `Create` carrying an embedded `Note`;
 >     that is, a non-`Create` `type`, a non-`Note` `object.type`, or
 >     `embedObject: false`.
@@ -262,6 +261,29 @@ Signing is kept off the send critical path, set per scenario with `signing`:
     (open-loop only; Poisson arrivals may still sign a few extra during the
     run).
 
+### Repeated runs
+
+Each scenario runs three times by default.  Set `runs` in `defaults` to change
+the whole suite, or set `runs` on one scenario to override the default for that
+scenario:
+
+~~~~ yaml
+defaults:
+  runs: 5
+scenarios:
+- name: ci-smoke
+  type: webfinger
+  runs: 1
+  recipient: acct:alice@localhost
+~~~~
+
+Repeated runs are aggregated for stable CI gates.  Latency and throughput
+metrics use the median run, request totals and error buckets are summed, queue
+depth uses the worst observed maximum, and `successRate` uses the worst run so
+one bad run is not hidden by clean neighbors.  The JSON report records
+`runCount` for every scenario and includes per-run measurements in `runs` when
+the scenario ran more than once.
+
 ### Output
 
 Choose the format with `--format text` (default), `json`, or `markdown`;
@@ -288,7 +310,80 @@ CI check.  Keep CI gates on robust signals such as success rate, error counts,
 and gross throughput or latency floors; precise latency-percentile regression
 belongs in a controlled environment, not a shared CI runner.
 
-[report schema]: https://json-schema.fedify.dev/bench/report-v2.json
+[report schema]: https://json-schema.fedify.dev/bench/report-v3.json
+
+### Comparing two revisions
+
+Use `fedify bench compare` when a CI job should compare a change against a base
+revision on the same runner instead of relying on an absolute threshold:
+
+~~~~ sh
+fedify bench compare \
+  --base origin/main \
+  --head HEAD \
+  --file scenario.yaml \
+  --start-command "pnpm dev" \
+  --ready-url http://127.0.0.1:3000/health \
+  --max-regression 15%
+~~~~
+
+The command creates temporary detached worktrees for the base and head refs,
+starts the target command inside each worktree, waits for `--ready-url`, then
+runs the same suite from the current checkout against that target.  The two
+targets run sequentially, so they can use the same port.  Dependencies are not
+installed automatically; either prepare both refs in the job before comparing
+or make `--start-command` perform the needed build/start steps.
+
+If `--target` is omitted, the benchmark target defaults to the origin of
+`--ready-url`.  Pass `--target` when readiness and benchmark traffic use
+different URLs.  The comparison report can be written as text, JSON, or
+Markdown with the same `--format` and `--output` options; JSON validates
+against the [comparison report schema].
+
+`--max-regression` accepts either a ratio such as `0.15` or a percentage such
+as `15%`.  For each scenario, `fedify bench compare` compares performance
+metrics from the scenario's `expect` block when they are latency or rate
+metrics; if no such metric is present, it compares `latency.p95` and
+`throughputPerSec`.  A head result passes when the measured regression is
+within `--max-regression` plus the observed per-run noise band.  The command
+exits with status 1 when the head run fails its own `expect` gate or a
+comparison exceeds that allowance; configuration and orchestration failures
+exit with status 2.
+
+Use short, broad suites in shared CI:
+
+~~~~ yaml
+defaults:
+  runs: 3
+  duration: 20s
+  warmup: 5s
+scenarios:
+- name: inbox-ci
+  type: inbox
+  # ...
+  expect:
+    successRate: ">= 99%"
+    latency.p95: "< 500ms"
+~~~~
+
+Use a controlled performance runner for narrower regression checks:
+
+~~~~ yaml
+defaults:
+  runs: 7
+  duration: 2m
+  warmup: 20s
+scenarios:
+- name: inbox-lab
+  type: inbox
+  # ...
+  expect:
+    successRate: ">= 99.9%"
+    latency.p95: "< 120ms"
+    throughputPerSec: "> 250/s"
+~~~~
+
+[comparison report schema]: https://json-schema.fedify.dev/bench/compare-report-v1.json
 
 ### Safety
 

diff --git a/packages/cli/src/bench/__fixtures__/compare-reports/basic.json b/packages/cli/src/bench/__fixtures__/compare-reports/basic.json
@@ -0,0 +1,83 @@
+{
+  "$schema": "https://json-schema.fedify.dev/bench/compare-report-v1.json",
+  "schemaVersion": 1,
+  "tool": { "name": "@fedify/cli", "version": "2.3.0" },
+  "environment": {
+    "runtime": "deno",
+    "runtimeVersion": "2.5.0",
+    "os": "linux",
+    "cpuCount": 16
+  },
+  "startedAt": "2026-06-04T12:00:00.000Z",
+  "finishedAt": "2026-06-04T12:03:00.000Z",
+  "suite": { "name": "Inbox regression suite", "configHash": "sha256:abc123" },
+  "maxRegression": 0.15,
+  "base": {
+    "ref": "origin/main",
+    "report": {
+      "$schema": "https://json-schema.fedify.dev/bench/report-v3.json",
+      "schemaVersion": 3,
+      "tool": { "name": "@fedify/cli", "version": "2.3.0" },
+      "environment": {
+        "runtime": "deno",
+        "runtimeVersion": "2.5.0",
+        "os": "linux",
+        "cpuCount": 16
+      },
+      "target": {
+        "url": "http://localhost:3000",
+        "fedifyVersion": "2.3.0",
+        "statsAvailable": true
+      },
+      "startedAt": "2026-06-04T12:00:00.000Z",
+      "finishedAt": "2026-06-04T12:01:00.000Z",
+      "suite": {
+        "name": "Inbox regression suite",
+        "configHash": "sha256:abc123"
+      },
+      "passed": true,
+      "scenarios": []
+    }
+  },
+  "head": {
+    "ref": "HEAD",
+    "report": {
+      "$schema": "https://json-schema.fedify.dev/bench/report-v3.json",
+      "schemaVersion": 3,
+      "tool": { "name": "@fedify/cli", "version": "2.3.0" },
+      "environment": {
+        "runtime": "deno",
+        "runtimeVersion": "2.5.0",
+        "os": "linux",
+        "cpuCount": 16
+      },
+      "target": {
+        "url": "http://localhost:3000",
+        "fedifyVersion": "2.3.0",
+        "statsAvailable": true
+      },
+      "startedAt": "2026-06-04T12:02:00.000Z",
+      "finishedAt": "2026-06-04T12:03:00.000Z",
+      "suite": {
+        "name": "Inbox regression suite",
+        "configHash": "sha256:abc123"
+      },
+      "passed": true,
+      "scenarios": []
+    }
+  },
+  "comparisons": [
+    {
+      "scenario": "inbox-shared",
+      "metric": "latency.p95",
+      "direction": "lower-is-better",
+      "base": 91,
+      "head": 94,
+      "regression": 0.03296703296703297,
+      "noiseBand": 0.02,
+      "allowedRegression": 0.16999999999999998,
+      "pass": true
+    }
+  ],
+  "passed": true
+}
diff --git a/packages/cli/src/bench/__fixtures__/reports/inbox-report.json b/packages/cli/src/bench/__fixtures__/reports/inbox-report.json
@@ -1,6 +1,6 @@
 {
-  "$schema": "https://json-schema.fedify.dev/bench/report-v2.json",
-  "schemaVersion": 2,
+  "$schema": "https://json-schema.fedify.dev/bench/report-v3.json",
+  "schemaVersion": 3,
   "tool": { "name": "@fedify/cli", "version": "2.3.0" },
   "environment": {
     "runtime": "deno",
@@ -86,7 +86,8 @@
           "pass": true
         }
       ],
-      "passed": true
+      "passed": true,
+      "runCount": 1
     }
   ]
 }