Skip to content

Production monitoring guide#813

Merged
dahlia merged 9 commits into
fedify-dev:mainfrom
dahlia:docs/monitoring
Jun 21, 2026
Merged

Production monitoring guide#813
dahlia merged 9 commits into
fedify-dev:mainfrom
dahlia:docs/monitoring

Conversation

@dahlia

@dahlia dahlia commented Jun 21, 2026

Copy link
Copy Markdown
Member

Fedify already documents its OpenTelemetry metrics, and the Deployment guide already names the federation signals an operator should watch. Neither place shows how to turn those signals into a dashboard, or how to decide which failures should page someone instead of quietly scrolling past in a trace. This adds that missing layer as a new manual chapter, docs/manual/monitoring.md.

The guide is written so an operator can stand up a first dashboard without reading Fedify's source. It covers:

  • Six dashboard panels (queue backlog, inbox processing latency, outbound delivery attempts, outbound delivery failure rate, permanent delivery failures, and signature verification latency), each with a PromQL example and a note on how to read it.
  • Prometheus alert rules for a growing queue backlog, an outbound delivery failure spike, sustained inbox latency, spikes in remote 404/410 responses, and signature verification failures. Each rule explains the failure it is meant to catch, and the section states up front that every threshold is a starting point rather than a default.
  • The OpenTelemetry-to-Prometheus name translation, so the metric names from the OpenTelemetry chapter line up with what you actually type into a query bar.
  • A single OpenTelemetry Collector pipeline that either exposes a Prometheus scrape endpoint or forwards metrics over OTLP.
  • Cardinality guidance for dashboard and alert authors, including why activitypub.remote.host is the one label whose value set grows with the fediverse and should be aggregated away by default.
  • The boundary between Fedify's own metrics and the runtime, database, queue-backend, and host-platform metrics it deliberately does not emit.

The 404/410 alert is framed as an investigation alert rather than a paging alert, since remote account deletion and instance churn are ordinary fediverse behavior, not an incident on your side.

Supporting changes:

  • docs/.vitepress/config.mts: add the page to the manual sidebar, next to OpenTelemetry.
  • docs/manual/deploy.md and docs/manual/opentelemetry.md: link to the new guide from the observability section and from the instrumented-metrics reference, respectively.
  • .hongdown.toml: add “Prometheus” and “OpenTelemetry Collector” to the proper-noun list so they keep their capitalization in headings.

Scope

This is a documentation change only. It does not add, rename, or redefine any metric; every metric name and attribute it uses comes from the metrics work tracked under #316, which has already landed. Following the open questions in the issue, this first version stays text-only and Prometheus-only: no exported Grafana dashboard JSON, no non-PromQL query examples, and no screenshots. Each of those is a reasonable follow-up once the written guide is in.

Closes #743.

Add a new manual chapter, docs/manual/monitoring.md, that turns
Fedify's OpenTelemetry metrics into a starter dashboard and a set of
alert rules.  The OpenTelemetry chapter and the deployment guide
already document the metrics and name the federation signals an
operator should watch, but neither shows how to build a first
dashboard or decide which failures should page someone.

The guide covers six dashboard panels (queue backlog, inbox
processing latency, outbound delivery attempts, outbound delivery
failure rate, permanent delivery failures, and signature verification
latency), PromQL alert examples that each explain the failure they
catch, the OpenTelemetry-to-Prometheus name translation, an
OpenTelemetry Collector pipeline, cardinality guidance for dashboard
and alert authors, and the boundary between Fedify metrics and the
runtime, database, queue-backend, and host-platform metrics it does
not emit.  It stays vendor-neutral and notes that every threshold is
a starting point rather than a default.  Spikes in remote 404/410
responses are framed as investigation alerts rather than paging
alerts, since remote account deletion and instance churn are normal
fediverse behavior.

Wire the page into the VitePress manual sidebar, and link to it from
the observability section of the deployment guide and from the
instrumented-metrics section of the OpenTelemetry chapter.  Add
"Prometheus" and "OpenTelemetry Collector" to the Hongdown
proper-noun list so those words keep their capitalization in
headings.

fedify-dev#743

Assisted-by: Claude Code:claude-opus-4-8
Assisted-by: Codex:gpt-5.5
@dahlia dahlia added this to the Fedify 2.3 milestone Jun 21, 2026
@dahlia dahlia requested a review from Copilot June 21, 2026 04:32
@dahlia dahlia self-assigned this Jun 21, 2026
@dahlia dahlia added the component/otel OpenTelemetry integration label Jun 21, 2026
@netlify

netlify Bot commented Jun 21, 2026

Copy link
Copy Markdown

Deploy Preview for fedify-json-schema ready!

Name Link
🔨 Latest commit c5964bf
🔍 Latest deploy log https://app.netlify.com/projects/fedify-json-schema/deploys/6a3794c97578eb0008ef1c51
😎 Deploy Preview https://deploy-preview-813--fedify-json-schema.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2037ccd6-290d-433d-97e1-c95155433e0e

📥 Commits

Reviewing files that changed from the base of the PR and between b194a90 and c5964bf.

📒 Files selected for processing (1)
  • docs/manual/monitoring.md

📝 Walkthrough

Walkthrough

Adds a new docs/manual/monitoring.md production monitoring guide covering the OpenTelemetry Collector pipeline, federation health dashboard PromQL queries, Prometheus alert rules, cardinality guidance, and metric boundary definitions. The page is added to the VitePress sidebar and cross-linked from docs/manual/deploy.md and docs/manual/opentelemetry.md. The spell-checker gains "OpenTelemetry Collector" as a proper noun.

Changes

Production Monitoring Documentation

Layer / File(s) Summary
Sidebar nav, cross-links, and spell-check entry
docs/.vitepress/config.mts, docs/manual/opentelemetry.md, docs/manual/deploy.md, .hongdown.toml
Adds the "Monitoring" entry to the VitePress manual sidebar, inserts pointer paragraphs in opentelemetry.md and deploy.md directing readers to the new guide, and registers "OpenTelemetry Collector" in the proper-nouns list.
Monitoring guide: introduction and OpenTelemetry Collector pipeline
docs/manual/monitoring.md
Introduces the monitoring guide scope and assumptions, then documents the OpenTelemetry Collector pipeline configuration for routing OTLP metrics to Prometheus scraping. Explains metric-name translation conventions (dot/underscore, unit/type suffixes, counter totals, histogram expansion).
Monitoring guide: federation health dashboard PromQL queries
docs/manual/monitoring.md
Defines the first federation dashboard with PromQL panels covering queue backlog, inbox processing latency, outbound delivery volume with success split, failure rate, permanent failures by HTTP status, and signature verification latency broken down by key-fetch versus verification time.
Monitoring guide: alert rules, cardinality guidance, and metric boundaries
docs/manual/monitoring.md
Provides Prometheus alert rules for queue backlog growth, outbound delivery failures, sustained inbox latency, remote 404/410 spikes, and signature verification failures. Documents cardinality bounding practices and clarifies metric boundaries between Fedify instrumentation and process/runtime/database/queue/host metrics.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

  • Production monitoring dashboards and alerting guide #743 — This PR directly implements the acceptance criteria of that issue: monitoring.md exists with dashboard and alert examples using finalized metric names, is linked from the sidebar and existing docs, includes cardinality guidance, distinguishes Fedify metrics from platform metrics, and includes an OpenTelemetry Collector example.
  • Example Grafana dashboard and Prometheus alert rules for Fedify metrics #745 — That issue proposes companion runnable example files (docker-compose, Grafana dashboard JSON, Prometheus rules) as follow-up deliverables that would reference the written guide added in this PR.

Possibly related PRs

  • fedify-dev/fedify#697: Both PRs modify docs/manual/deploy.md to extend the observability and monitoring sections with federation health signal guidance.

Suggested labels

type/documentation

Suggested reviewers

  • sij411
  • 2chanhaeng
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Production monitoring guide' accurately describes the main change—adding a comprehensive production monitoring guide to the documentation.
Description check ✅ Passed The description is thorough and directly related to the changeset, explaining the monitoring guide's purpose, content, scope, and supporting changes in detail.
Linked Issues check ✅ Passed The PR fully addresses all coding and documentation requirements from issue #743: dashboard panels, PromQL examples, alert rules, OTEL-to-Prometheus name translation, Collector configuration, cardinality guidance, metric boundaries, sidebar navigation, and cross-references.
Out of Scope Changes check ✅ Passed All changes are directly in scope: the new monitoring.md guide, VitePress sidebar configuration, deploy.md and opentelemetry.md updates with cross-references, and .hongdown.toml proper-noun additions—all supporting the production monitoring documentation objective.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dahlia

dahlia commented Jun 21, 2026

Copy link
Copy Markdown
Member Author

@codex review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive 'Production monitoring' guide (docs/manual/monitoring.md) along with corresponding updates to the configuration, deployment, and OpenTelemetry documentation. It also adds 'OpenTelemetry Collector' and 'Prometheus' to the proper nouns list in .hongdown.toml. The review feedback correctly identifies multiple instances in the new monitoring guide where PromQL code blocks lack language specifications, violating the repository's style guide requirement to specify the language for quadruple-tilde code blocks.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread docs/manual/monitoring.md Outdated
Comment thread docs/manual/monitoring.md Outdated
Comment thread docs/manual/monitoring.md Outdated
Comment thread docs/manual/monitoring.md Outdated
Comment thread docs/manual/monitoring.md Outdated
Comment thread docs/manual/monitoring.md Outdated
Comment thread docs/manual/monitoring.md Outdated
Comment thread docs/manual/monitoring.md Outdated
Comment thread docs/manual/monitoring.md Outdated
Comment thread docs/manual/monitoring.md

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Production monitoring manual chapter that bridges Fedify’s existing OpenTelemetry metrics documentation to practical operational usage (dashboards + alerting), and links it from the existing observability docs and navigation.

Changes:

  • Add docs/manual/monitoring.md with PromQL dashboard panels, alert rule examples, naming translation notes, and cardinality guidance.
  • Link the new guide from OpenTelemetry and Deployment manual chapters.
  • Add the new page to the VitePress manual sidebar and extend Hongdown proper-noun capitalization.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
docs/manual/monitoring.md New monitoring guide with PromQL dashboard/alert examples, collector pipeline, and cardinality guidance.
docs/manual/opentelemetry.md Links readers from the metrics reference to the new monitoring guide.
docs/manual/deploy.md Links operators from the deployment observability checklist to the monitoring guide.
docs/.vitepress/config.mts Adds the monitoring guide to the manual sidebar navigation.
.hongdown.toml Ensures “Prometheus” and “OpenTelemetry Collector” keep capitalization in headings.

Comment thread docs/manual/monitoring.md
Comment thread docs/manual/monitoring.md Outdated
Comment thread docs/manual/monitoring.md Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/manual/monitoring.md`:
- Around line 177-179: All nine PromQL code blocks in the monitoring
documentation currently use quadruple-tilde markers without language
specifications. Update each of these code block opening markers by changing
`~~~~` to `~~~~ promql` to comply with documentation standards and ensure proper
syntax highlighting in the rendered output.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1c204a5a-b356-4457-bb86-97aec7686422

📥 Commits

Reviewing files that changed from the base of the PR and between 35dea44 and 775b38a.

📒 Files selected for processing (5)
  • .hongdown.toml
  • docs/.vitepress/config.mts
  • docs/manual/deploy.md
  • docs/manual/monitoring.md
  • docs/manual/opentelemetry.md

Comment thread docs/manual/monitoring.md Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 775b38a7f5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/manual/monitoring.md Outdated
@codecov

codecov Bot commented Jun 21, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dahlia added 3 commits June 21, 2026 14:56
The PromQL query blocks used bare quadruple-tilde fences with no
language identifier.  Tag them as promql so they follow the
repository convention that fenced code blocks specify a language.

fedify-dev#813 (comment)
fedify-dev#813 (comment)
fedify-dev#813 (comment)
fedify-dev#813 (comment)
fedify-dev#813 (comment)
fedify-dev#813 (comment)
fedify-dev#813 (comment)
fedify-dev#813 (comment)
fedify-dev#813 (comment)
fedify-dev#813 (comment)
fedify-dev#813 (comment)

Assisted-by: Claude Code:claude-opus-4-8
The 404/410 spike and signature-failure alerts compared a per-second
rate() against a whole-number threshold, so "> 1" meant more than one
event per second: far above the background-churn levels the prose
describes.  Switch both to increase(), which counts events over the
window, matching the "more than N in the last few minutes" intent the
surrounding text sets up.

fedify-dev#813 (comment)
fedify-dev#813 (comment)

Assisted-by: Claude Code:claude-opus-4-8
activitypub.inbox.processing_duration is recorded in the queue worker,
which runs after handleInbox() has already answered the remote with
202 Accepted.  The guide described a high p95 there as remote servers
timing out, which only holds for inline (no-queue) listeners.  Clarify
that behind a queue this is side-effect latency, and point readers at
fedify.http.server.request.duration on the inbox endpoints for the
latency remotes actually experience.

fedify-dev#813 (comment)

Assisted-by: Claude Code:claude-opus-4-8
@dahlia

dahlia commented Jun 21, 2026

Copy link
Copy Markdown
Member Author

@codex review

@dahlia

dahlia commented Jun 21, 2026

Copy link
Copy Markdown
Member Author

/gemini review

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread docs/manual/monitoring.md

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 568f54724b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/manual/monitoring.md
Comment thread docs/manual/monitoring.md Outdated
Comment thread docs/manual/monitoring.md Outdated

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive production monitoring guide (docs/manual/monitoring.md) for Fedify applications. The guide details how to map Fedify's OpenTelemetry metrics to Prometheus, configure an OpenTelemetry Collector pipeline, set up a federation-health dashboard, define alerting rules, and manage metric cardinality. Related documentation files (docs/manual/deploy.md, docs/manual/opentelemetry.md, and docs/.vitepress/config.mts) have been updated to link to this new guide, and proper nouns have been added to .hongdown.toml. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

dahlia added 3 commits June 21, 2026 15:39
fedify.queue.depth carries fedify.federation.instance_id so depth
series stay distinct when several Federation instances share one
MeterProvider.  The example summed by role alone, which collapses
that label and double-counts the backlog when those instances read
from the same queue backend.  Document keeping the instance id in the
grouping for multi-instance setups.

fedify-dev#813 (comment)

Assisted-by: Claude Code:claude-opus-4-8
activitypub.delivery.permanent_failure only counts deliveries a remote
rejected with a permanent-failure status code.  The guide called the
permanent-failure panel the deliveries Fedify "has given up on
entirely", which overstates it: deliveries abandoned after the outbox
retry policy exhausts on transport errors or transient 5xx responses
are recorded on activitypub.outbox.activity with
processing.result="abandoned" instead.  Narrow the wording and add the
abandoned-outbox series so operators do not miss that class of drops.

fedify-dev#813 (comment)

Assisted-by: Claude Code:claude-opus-4-8
When a large peer disappears it produces a short burst of 404/410
permanent failures and then stops, because Fedify stops retrying
permanent-failure statuses.  The alert used increase(...[15m]) with
for: 1h, so the burst left the 15-minute window long before the
one-hour for clause elapsed and the alert never fired for the scenario
it targets.  Count over a one-hour range with no for clause instead,
so a single burst registers and then clears on its own.

fedify-dev#813 (comment)

Assisted-by: Claude Code:claude-opus-4-8
@dahlia

dahlia commented Jun 21, 2026

Copy link
Copy Markdown
Member Author

@codex review

@dahlia

dahlia commented Jun 21, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/manual/monitoring.md (1)

177-185: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Include fedify_federation_instance_id in the backlog example.

The note says queue depth must stay split per instance when multiple Federation instances share a MeterProvider, but the query still groups only by role. That will collapse identical depth series and double-count backlog in the multi-instance case.

Fix
-sum by (fedify_queue_role) (fedify_queue_depth{fedify_queue_depth_state="queued"})
+sum by (fedify_federation_instance_id, fedify_queue_role) (fedify_queue_depth{fedify_queue_depth_state="queued"})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/manual/monitoring.md` around lines 177 - 185, The PromQL query example
provided in the monitoring documentation does not match the guidance in the note
that follows it. The note correctly states that fedify_federation_instance_id
must remain in the grouping when multiple Federation instances share a
MeterProvider to avoid double-counting backlog, but the example query sum by
(fedify_queue_role) only groups by role. Update the PromQL query to include both
fedify_queue_role and fedify_federation_instance_id in the sum by grouping
clause to align with the documented best practice and prevent collapsing
identical depth series across instances.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docs/manual/monitoring.md`:
- Around line 177-185: The PromQL query example provided in the monitoring
documentation does not match the guidance in the note that follows it. The note
correctly states that fedify_federation_instance_id must remain in the grouping
when multiple Federation instances share a MeterProvider to avoid
double-counting backlog, but the example query sum by (fedify_queue_role) only
groups by role. Update the PromQL query to include both fedify_queue_role and
fedify_federation_instance_id in the sum by grouping clause to align with the
documented best practice and prevent collapsing identical depth series across
instances.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c78ed060-17ae-4f2f-818d-237df9f90c06

📥 Commits

Reviewing files that changed from the base of the PR and between 568f547 and a5c8995.

📒 Files selected for processing (1)
  • docs/manual/monitoring.md

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive "Production monitoring" guide (docs/manual/monitoring.md) for Fedify applications, which details how to translate OpenTelemetry metrics into Prometheus dashboards and alert rules. It also updates .hongdown.toml with new proper nouns, integrates the new guide into the VitePress configuration, and adds cross-references in docs/manual/deploy.md and docs/manual/opentelemetry.md. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a5c8995b7d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/manual/monitoring.md Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread docs/manual/monitoring.md
dahlia added 2 commits June 21, 2026 16:19
In the common multi-replica topology where every Federation instance
observes one shared queue backend, registerQueueDepthGauge() reports
the backend's full depth from each replica.  Summing fedify_queue_depth
by role then multiplies the backlog by the replica count and trips depth
alerts early.  Switch the example to max by (fedify_queue_role), which
reads the true depth for a shared backend, and say when sum is right
(one separate backend per instance).  This also subsumes the earlier
per-instance grouping concern, since max collapses the instance and
scrape labels correctly.

fedify-dev#813 (comment)

Assisted-by: Claude Code:claude-opus-4-8
The signature-latency panel references
activitypub_signature_key_fetch_duration_milliseconds_bucket by its
Prometheus name, but that metric was absent from the name-translation
table, so readers hit the name with no mapping for where it came from.
Add the row.

fedify-dev#813 (comment)

Assisted-by: Claude Code:claude-opus-4-8
@dahlia

dahlia commented Jun 21, 2026

Copy link
Copy Markdown
Member Author

@codex review

@dahlia

dahlia commented Jun 21, 2026

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive production monitoring guide (docs/manual/monitoring.md) for Fedify applications, detailing how to turn OpenTelemetry metrics into a federation-health dashboard and Prometheus alert rules. It also updates the documentation configuration, proper nouns list, and related manual pages (deploy.md and opentelemetry.md) to link to this new guide. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c5964bfd2a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/manual/monitoring.md

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@dahlia dahlia requested review from 2chanhaeng and sij411 June 21, 2026 08:08

@2chanhaeng 2chanhaeng left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@dahlia dahlia merged commit 6152d3d into fedify-dev:main Jun 21, 2026
21 checks passed
@dahlia dahlia deleted the docs/monitoring branch June 21, 2026 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/otel OpenTelemetry integration

Development

Successfully merging this pull request may close these issues.

Production monitoring dashboards and alerting guide

3 participants