Production monitoring guide#813
Conversation
Add a new manual chapter, docs/manual/monitoring.md, that turns Fedify's OpenTelemetry metrics into a starter dashboard and a set of alert rules. The OpenTelemetry chapter and the deployment guide already document the metrics and name the federation signals an operator should watch, but neither shows how to build a first dashboard or decide which failures should page someone. The guide covers six dashboard panels (queue backlog, inbox processing latency, outbound delivery attempts, outbound delivery failure rate, permanent delivery failures, and signature verification latency), PromQL alert examples that each explain the failure they catch, the OpenTelemetry-to-Prometheus name translation, an OpenTelemetry Collector pipeline, cardinality guidance for dashboard and alert authors, and the boundary between Fedify metrics and the runtime, database, queue-backend, and host-platform metrics it does not emit. It stays vendor-neutral and notes that every threshold is a starting point rather than a default. Spikes in remote 404/410 responses are framed as investigation alerts rather than paging alerts, since remote account deletion and instance churn are normal fediverse behavior. Wire the page into the VitePress manual sidebar, and link to it from the observability section of the deployment guide and from the instrumented-metrics section of the OpenTelemetry chapter. Add "Prometheus" and "OpenTelemetry Collector" to the Hongdown proper-noun list so those words keep their capitalization in headings. fedify-dev#743 Assisted-by: Claude Code:claude-opus-4-8 Assisted-by: Codex:gpt-5.5
✅ Deploy Preview for fedify-json-schema ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a new ChangesProduction Monitoring Documentation
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@codex review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive 'Production monitoring' guide (docs/manual/monitoring.md) along with corresponding updates to the configuration, deployment, and OpenTelemetry documentation. It also adds 'OpenTelemetry Collector' and 'Prometheus' to the proper nouns list in .hongdown.toml. The review feedback correctly identifies multiple instances in the new monitoring guide where PromQL code blocks lack language specifications, violating the repository's style guide requirement to specify the language for quadruple-tilde code blocks.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Pull request overview
Adds a new Production monitoring manual chapter that bridges Fedify’s existing OpenTelemetry metrics documentation to practical operational usage (dashboards + alerting), and links it from the existing observability docs and navigation.
Changes:
- Add docs/manual/monitoring.md with PromQL dashboard panels, alert rule examples, naming translation notes, and cardinality guidance.
- Link the new guide from OpenTelemetry and Deployment manual chapters.
- Add the new page to the VitePress manual sidebar and extend Hongdown proper-noun capitalization.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/manual/monitoring.md | New monitoring guide with PromQL dashboard/alert examples, collector pipeline, and cardinality guidance. |
| docs/manual/opentelemetry.md | Links readers from the metrics reference to the new monitoring guide. |
| docs/manual/deploy.md | Links operators from the deployment observability checklist to the monitoring guide. |
| docs/.vitepress/config.mts | Adds the monitoring guide to the manual sidebar navigation. |
| .hongdown.toml | Ensures “Prometheus” and “OpenTelemetry Collector” keep capitalization in headings. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/manual/monitoring.md`:
- Around line 177-179: All nine PromQL code blocks in the monitoring
documentation currently use quadruple-tilde markers without language
specifications. Update each of these code block opening markers by changing
`~~~~` to `~~~~ promql` to comply with documentation standards and ensure proper
syntax highlighting in the rendered output.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1c204a5a-b356-4457-bb86-97aec7686422
📒 Files selected for processing (5)
.hongdown.tomldocs/.vitepress/config.mtsdocs/manual/deploy.mddocs/manual/monitoring.mddocs/manual/opentelemetry.md
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 775b38a7f5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codecov Report✅ All modified and coverable lines are covered by tests. 🚀 New features to boost your workflow:
|
The PromQL query blocks used bare quadruple-tilde fences with no language identifier. Tag them as promql so they follow the repository convention that fenced code blocks specify a language. fedify-dev#813 (comment) fedify-dev#813 (comment) fedify-dev#813 (comment) fedify-dev#813 (comment) fedify-dev#813 (comment) fedify-dev#813 (comment) fedify-dev#813 (comment) fedify-dev#813 (comment) fedify-dev#813 (comment) fedify-dev#813 (comment) fedify-dev#813 (comment) Assisted-by: Claude Code:claude-opus-4-8
The 404/410 spike and signature-failure alerts compared a per-second rate() against a whole-number threshold, so "> 1" meant more than one event per second: far above the background-churn levels the prose describes. Switch both to increase(), which counts events over the window, matching the "more than N in the last few minutes" intent the surrounding text sets up. fedify-dev#813 (comment) fedify-dev#813 (comment) Assisted-by: Claude Code:claude-opus-4-8
activitypub.inbox.processing_duration is recorded in the queue worker, which runs after handleInbox() has already answered the remote with 202 Accepted. The guide described a high p95 there as remote servers timing out, which only holds for inline (no-queue) listeners. Clarify that behind a queue this is side-effect latency, and point readers at fedify.http.server.request.duration on the inbox endpoints for the latency remotes actually experience. fedify-dev#813 (comment) Assisted-by: Claude Code:claude-opus-4-8
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 568f54724b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive production monitoring guide (docs/manual/monitoring.md) for Fedify applications. The guide details how to map Fedify's OpenTelemetry metrics to Prometheus, configure an OpenTelemetry Collector pipeline, set up a federation-health dashboard, define alerting rules, and manage metric cardinality. Related documentation files (docs/manual/deploy.md, docs/manual/opentelemetry.md, and docs/.vitepress/config.mts) have been updated to link to this new guide, and proper nouns have been added to .hongdown.toml. There are no review comments, so I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
fedify.queue.depth carries fedify.federation.instance_id so depth series stay distinct when several Federation instances share one MeterProvider. The example summed by role alone, which collapses that label and double-counts the backlog when those instances read from the same queue backend. Document keeping the instance id in the grouping for multi-instance setups. fedify-dev#813 (comment) Assisted-by: Claude Code:claude-opus-4-8
activitypub.delivery.permanent_failure only counts deliveries a remote rejected with a permanent-failure status code. The guide called the permanent-failure panel the deliveries Fedify "has given up on entirely", which overstates it: deliveries abandoned after the outbox retry policy exhausts on transport errors or transient 5xx responses are recorded on activitypub.outbox.activity with processing.result="abandoned" instead. Narrow the wording and add the abandoned-outbox series so operators do not miss that class of drops. fedify-dev#813 (comment) Assisted-by: Claude Code:claude-opus-4-8
When a large peer disappears it produces a short burst of 404/410 permanent failures and then stops, because Fedify stops retrying permanent-failure statuses. The alert used increase(...[15m]) with for: 1h, so the burst left the 15-minute window long before the one-hour for clause elapsed and the alert never fired for the scenario it targets. Count over a one-hour range with no for clause instead, so a single burst registers and then clears on its own. fedify-dev#813 (comment) Assisted-by: Claude Code:claude-opus-4-8
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/manual/monitoring.md (1)
177-185:⚠️ Potential issue | 🟠 Major | ⚡ Quick winInclude
fedify_federation_instance_idin the backlog example.The note says queue depth must stay split per instance when multiple
Federationinstances share aMeterProvider, but the query still groups only by role. That will collapse identical depth series and double-count backlog in the multi-instance case.Fix
-sum by (fedify_queue_role) (fedify_queue_depth{fedify_queue_depth_state="queued"}) +sum by (fedify_federation_instance_id, fedify_queue_role) (fedify_queue_depth{fedify_queue_depth_state="queued"})🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/manual/monitoring.md` around lines 177 - 185, The PromQL query example provided in the monitoring documentation does not match the guidance in the note that follows it. The note correctly states that fedify_federation_instance_id must remain in the grouping when multiple Federation instances share a MeterProvider to avoid double-counting backlog, but the example query sum by (fedify_queue_role) only groups by role. Update the PromQL query to include both fedify_queue_role and fedify_federation_instance_id in the sum by grouping clause to align with the documented best practice and prevent collapsing identical depth series across instances.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@docs/manual/monitoring.md`:
- Around line 177-185: The PromQL query example provided in the monitoring
documentation does not match the guidance in the note that follows it. The note
correctly states that fedify_federation_instance_id must remain in the grouping
when multiple Federation instances share a MeterProvider to avoid
double-counting backlog, but the example query sum by (fedify_queue_role) only
groups by role. Update the PromQL query to include both fedify_queue_role and
fedify_federation_instance_id in the sum by grouping clause to align with the
documented best practice and prevent collapsing identical depth series across
instances.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c78ed060-17ae-4f2f-818d-237df9f90c06
📒 Files selected for processing (1)
docs/manual/monitoring.md
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive "Production monitoring" guide (docs/manual/monitoring.md) for Fedify applications, which details how to translate OpenTelemetry metrics into Prometheus dashboards and alert rules. It also updates .hongdown.toml with new proper nouns, integrates the new guide into the VitePress configuration, and adds cross-references in docs/manual/deploy.md and docs/manual/opentelemetry.md. There are no review comments, so I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a5c8995b7d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
In the common multi-replica topology where every Federation instance observes one shared queue backend, registerQueueDepthGauge() reports the backend's full depth from each replica. Summing fedify_queue_depth by role then multiplies the backlog by the replica count and trips depth alerts early. Switch the example to max by (fedify_queue_role), which reads the true depth for a shared backend, and say when sum is right (one separate backend per instance). This also subsumes the earlier per-instance grouping concern, since max collapses the instance and scrape labels correctly. fedify-dev#813 (comment) Assisted-by: Claude Code:claude-opus-4-8
The signature-latency panel references activitypub_signature_key_fetch_duration_milliseconds_bucket by its Prometheus name, but that metric was absent from the name-translation table, so readers hit the name with no mapping for where it came from. Add the row. fedify-dev#813 (comment) Assisted-by: Claude Code:claude-opus-4-8
|
@codex review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive production monitoring guide (docs/manual/monitoring.md) for Fedify applications, detailing how to turn OpenTelemetry metrics into a federation-health dashboard and Prometheus alert rules. It also updates the documentation configuration, proper nouns list, and related manual pages (deploy.md and opentelemetry.md) to link to this new guide. There are no review comments, so I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c5964bfd2a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Fedify already documents its OpenTelemetry metrics, and the Deployment guide already names the federation signals an operator should watch. Neither place shows how to turn those signals into a dashboard, or how to decide which failures should page someone instead of quietly scrolling past in a trace. This adds that missing layer as a new manual chapter, docs/manual/monitoring.md.
The guide is written so an operator can stand up a first dashboard without reading Fedify's source. It covers:
404/410responses, and signature verification failures. Each rule explains the failure it is meant to catch, and the section states up front that every threshold is a starting point rather than a default.activitypub.remote.hostis the one label whose value set grows with the fediverse and should be aggregated away by default.The
404/410alert is framed as an investigation alert rather than a paging alert, since remote account deletion and instance churn are ordinary fediverse behavior, not an incident on your side.Supporting changes:
Scope
This is a documentation change only. It does not add, rename, or redefine any metric; every metric name and attribute it uses comes from the metrics work tracked under #316, which has already landed. Following the open questions in the issue, this first version stays text-only and Prometheus-only: no exported Grafana dashboard JSON, no non-PromQL query examples, and no screenshots. Each of those is a reasonable follow-up once the written guide is in.
Closes #743.