Skip to content

test(mem-repro): passthrough memory-leak reproduction harness#396

Open
coderdan wants to merge 1 commit into
mainfrom
dan/mem-repro-harness
Open

test(mem-repro): passthrough memory-leak reproduction harness#396
coderdan wants to merge 1 commit into
mainfrom
dan/mem-repro-harness

Conversation

@coderdan
Copy link
Copy Markdown
Contributor

@coderdan coderdan commented May 29, 2026

Follow-up to #395 (BUG-300). Adds the harness used to reproduce and root-cause the passthrough memory leak, so the result is reproducible by anyone.

Contents (scripts/mem-repro/)

  • soak.sh — sustained load + cgroup RSS sampling; the script that surfaces the leak.
  • run-docker.sh / run.sh — single-run variants (container / host process).
  • sample-rss.sh — standalone RSS sampler.
  • loadgen/ — Go load generator (parameterised INSERTs via pgx); prepleak/ — probe for the named-prepared-statement path.
  • schema.sql — plaintext table so all traffic is passthrough.
  • summary.md — trigger (empty encrypt config), root cause, reproduction table, and the fix (fix(passthrough): drain per-statement queues to stop memory leak (BUG-300) #395).

Notes

  • Self-contained under scripts/mem-repro/; not part of the Rust build. No local Go needed for the dockerised runners.
  • Raw run outputs (results/) are git-ignored.
  • The jemalloc allocator experiment from the investigation is not included (it showed no benefit; tracked separately if we want it).

How it was used

Empty the encrypt config, then soak.sh. On an affected build RSS climbs monotonically (300k inserts → 317 MiB, no release); with #395 it plateaus (~49 MiB).

Summary by CodeRabbit

  • Bug Fixes

    • Fixed a passthrough memory leak without encryption configuration causing unbounded memory growth and OOM conditions.
  • Documentation

    • Added comprehensive incident report documenting the memory leak issue and resolution details.
  • Tests

    • Added memory reproduction and soak-test harness for validating memory stability.

Review Change Stack

Tooling used to reproduce and root-cause the passthrough memory leak
(BUG-300, fixed in #395): a Go load generator (loadgen/), a sustained-load
soak runner with cgroup RSS sampling (soak.sh), single-run and host-process
variants, a plaintext schema, and a probe for the named-prepared-statement
path (prepleak/). summary.md documents the trigger (empty encrypt config),
root cause, and verification.

Self-contained under scripts/mem-repro/; not part of the Rust build. Raw run
outputs (results/) are git-ignored. The jemalloc allocator experiment from the
investigation is intentionally not included.

Relates to BUG-300.
Copilot AI review requested due to automatic review settings May 29, 2026 13:55
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

📝 Walkthrough

Walkthrough

This PR adds a complete memory reproduction harness for BUG-300, a passthrough memory leak occurring with empty encryption configs on unconfigured proxy deployments. The changes include comprehensive incident documentation, a SQL schema for test workloads, Go-based load generators, memory sampling utilities, and bash orchestration scripts for both host-based and containerized test execution.

Changes

Memory Reproduction Harness for BUG-300

Layer / File(s) Summary
Incident report and harness documentation
scripts/mem-repro/summary.md, scripts/mem-repro/README.md
Customer-facing incident report explaining the passthrough memory leak, immediate workaround via CS_DEVELOPMENT__DISABLE_MAPPING=true, root cause analysis of per-statement cleanup skipping in passthrough mode, and reproduction methodology. README documents harness layout, prerequisites, step-by-step reproduction via soak.sh, variant test matrix (baseline, allocator arena, mapping workaround, control), and the prepleak named-statement probe.
Schema and build configuration
scripts/mem-repro/schema.sql, scripts/mem-repro/.gitignore
SQL schema defining the plaintext credit_data_order_v2 table matching customer configuration (id, organization_id, order_id, account_review, full_report/raw_report jsonb, timestamps). .gitignore excludes test output results/ directory.
Load generator and statement probe tools
scripts/mem-repro/go.mod, scripts/mem-repro/loadgen/main.go, scripts/mem-repro/prepleak/main.go
Go module with pgx and uuid dependencies; load generator that stress-tests concurrent inserts via transaction pools with configurable worker/count parameters and fixed JSON payloads; statement probe that repeatedly Prepares uniquely named statements on a single connection to exercise unbounded per-statement paths.
Memory sampling and test orchestration
scripts/mem-repro/sample-rss.sh, scripts/mem-repro/run.sh, scripts/mem-repro/run-docker.sh, scripts/mem-repro/soak.sh
sample-rss.sh captures per-process RSS/VSZ to CSV at configurable intervals; run.sh orchestrates single test runs against local proxy, starting/stopping RSS sampler and computing peak/final memory from output; run-docker.sh executes load generator in a transient golang container against Docker proxy, sampling cgroup memory.current to track leak; soak.sh runs multi-round sustained load via Docker with cooldown observation to validate memory plateau behavior.

Sequence Diagram(s)

sequenceDiagram
  participant Operator
  participant Orchestrator as Test Orchestrator
  participant Sampler as Memory Sampler
  participant LoadGen as Load Generator
  participant Proxy
  participant Database

  Operator->>Orchestrator: run.sh or soak.sh
  Orchestrator->>Sampler: Start background RSS sampling
  Orchestrator->>Orchestrator: Capture baseline memory
  Orchestrator->>LoadGen: Launch load generator
  LoadGen->>Proxy: Insert operations (many transactions)
  Proxy->>Database: Execute INSERTs via passthrough
  Sampler->>Proxy: Poll RSS continuously
  Database->>Proxy: Acknowledge writes
  Proxy->>LoadGen: Transaction commits
  Orchestrator->>Orchestrator: Wait cooldown period
  Orchestrator->>Sampler: Stop sampling
  Orchestrator->>Orchestrator: Compute peak/final RSS from CSV
  Orchestrator->>Operator: Report memory metrics and leak detection
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

  • tobyhede

Poem

🐰 A harness born to track the leak,
Where memory graphs shall plateau, not peak,
Load-gen stress-tests, samplers keep watch,
And through the scripts, the bug we've scotched!
Soak it down, let cooling winds blow,
Now plaintext tables no longer grow. 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: a reproduction harness for investigating a passthrough memory leak, which aligns with the PR's primary purpose.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dan/mem-repro-harness

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a self-contained reproduction harness under scripts/mem-repro/ to reliably surface, measure, and document the passthrough memory leak (BUG-300 / follow-up to #395), including Docker-based runners, RSS sampling, a Go load generator, and a written investigation summary.

Changes:

  • Added shell scripts to run single-shot and soak-style repros while sampling RSS (host + Docker/cgroup).
  • Added Go-based load generators (loadgen/, prepleak/) plus Go module metadata for reproducible runs.
  • Added schema + write-up docs (schema.sql, summary.md, README.md) and ignored results/ outputs.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 20 comments.

Show a summary per file
File Description
scripts/mem-repro/README.md Documents harness purpose, layout, prerequisites, and how to reproduce/compare variants
scripts/mem-repro/summary.md Detailed incident summary, root cause explanation, and reproduction table/results
scripts/mem-repro/soak.sh Sustained-load Docker soak runner with cgroup memory sampling and CSV output
scripts/mem-repro/run-docker.sh Single-run Docker-based runner + cgroup memory sampling
scripts/mem-repro/run.sh Single-run host-process runner + RSS sampling via ps
scripts/mem-repro/sample-rss.sh Standalone RSS/VSZ sampler that appends to CSV
scripts/mem-repro/schema.sql Plaintext schema to keep traffic passthrough for the repro
scripts/mem-repro/loadgen/main.go Concurrent insert load generator using pgxpool
scripts/mem-repro/prepleak/main.go Probe for named-prepared-statement retention path
scripts/mem-repro/go.mod Go module for loadgen/ and prepleak/
scripts/mem-repro/go.sum Go dependency lockfile for the harness
scripts/mem-repro/.gitignore Ignores generated results/ output directory

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/mem-repro/soak.sh
ROUNDS="${2:-6}"
COUNT="${3:-50000}"
WORKERS="${4:-16}"
NET="$(docker inspect proxy --format '{{range $k,$v := .NetworkSettings.Networks}}{{$k}}{{end}}' | head -n1)"
DSN="postgres://${PGUSER}:${PGPASS}@proxy:6432/${PGDB}?sslmode=disable"

# Resolve the docker network the proxy is attached to.
NET="$(docker inspect proxy --format '{{range $k,$v := .NetworkSettings.Networks}}{{$k}}{{end}}' | head -n1)"
Comment thread scripts/mem-repro/soak.sh
Comment on lines +30 to +31
docker run --rm --network "${NET}" -v "${HERE}:/src" -w /src -e GOFLAGS=-mod=mod golang:1.22-bookworm \
bash -c "go mod tidy >/dev/null 2>&1; go run ./loadgen -count ${COUNT} -workers ${WORKERS} -dsn '${DSN}'" >/dev/null 2>&1 || true
Comment on lines +51 to +55
docker run --rm --network "${NET}" \
-v "${HERE}:/src" -w /src \
-e GOFLAGS=-mod=mod \
golang:1.22-bookworm \
bash -c "go mod tidy >/dev/null 2>&1; go run ./loadgen -count ${COUNT} -workers ${WORKERS} -dsn '${DSN}'"
Comment on lines +185 to +189
if err := tx.Commit(ctx); err != nil {
log.Printf("Insert failed to commit: %v", err)
atomic.AddInt64(&completed, 1)
continue
}
Comment thread scripts/mem-repro/soak.sh
Comment on lines +11 to +14
HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "${HERE}/../.." && pwd)"
LABEL="${1:?usage: soak.sh <label> [rounds] [count] [workers]}"
ROUNDS="${2:-6}"
Comment on lines +16 to +20
HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "${HERE}/../.." && pwd)"

LABEL="${1:?usage: run-docker.sh <label> [count] [workers]}"
COUNT="${2:-20000}"
Comment thread scripts/mem-repro/run.sh
Comment on lines +22 to +26
HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "${HERE}/../.." && pwd)"

LABEL="${1:?usage: run.sh <label> [count] [workers] [dsn]}"
COUNT="${2:-20000}"
Comment on lines +69 to +70
Harness: `scripts/mem-repro/` (`soak.sh`). Allocator is irrelevant (glibc ==
jemalloc); the leak is logical retention, not fragmentation.
Comment thread scripts/mem-repro/soak.sh
Comment on lines +5 to +6
# Runs several back-to-back rounds of simple-protocol inserts (max per-statement
# churn) and keeps sampling through a cooldown to see whether RSS is released.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (1)
scripts/mem-repro/summary.md (1)

44-58: ⚡ Quick win

Use symbols or permalinks instead of live source line numbers.

These backend.rs:175 / frontend.rs:380 references will drift as the code changes, which makes the root-cause section hard to verify later. Symbol names alone, or commit-permalinks tied to the investigation snapshot, will age much better.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/mem-repro/summary.md` around lines 44 - 58, Replace fragile numeric
file:line references in the root-cause section with stable symbols or a commit
permalink: reference the backend symbol(s) (e.g., the function that returns
early such as backend::execute or backend::handle_statement and the methods
complete_execution()/finish_session()), the frontend symbols (e.g.,
frontend::start_session, frontend::set_execute*), and context::is_passthrough /
encrypt_config.is_empty / mapping_disabled, or add a commit-permalink for the
investigation snapshot; update the three places that currently say
"backend.rs:175", "frontend.rs:380", and "context/mod.rs:780" to use those
symbol names or permalinks so the explanation remains verifiable as code moves.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/mem-repro/loadgen/main.go`:
- Around line 155-188: The worker currently treats Begin/Exec/Commit failures as
"completed" (atomic.AddInt64(&completed, 1)) and continues, causing main() to
finish with exit 0 even on DB errors; change the worker to propagate the first
error back to main by sending the error (with context cancel or a buffered errCh
and sync.Once) whenever pool.Begin(ctx), tx.Exec(ctx, ...), or tx.Commit(ctx)
fail (and still rollback where appropriate), stop incrementing completed for
failed attempts so completed only counts successful commits, and have main()
listen for that error and return/exit non-zero on the first reported error.
- Around line 17-25: The flags validation is incomplete: after flag.Parse() add
checks that *count and *workers are positive (>0) and fail fast; specifically,
validate the variables count and workers (and keep the existing dsn check) and
call log.Fatal with clear messages like "count must be > 0" or "workers must be
> 0" when invalid so main (and the insert loop that uses *count and worker
goroutines that rely on *workers) never proceeds with negative or zero values.

In `@scripts/mem-repro/run-docker.sh`:
- Around line 29-30: The NET assignment concatenates all network names with no
separator, producing an invalid --network when proxy is attached to multiple
networks; change the docker inspect template so each network key is emitted on
its own line and then keep the head -n1 selection. Concretely, update the NET
assignment (the variable NET and its docker inspect invocation) to use a newline
in the template, e.g. '{{range $k,$v :=
.NetworkSettings.Networks}}{{$k}}{{"\n"}}{{end}}' | head -n1, so NET contains
exactly one valid network name.

In `@scripts/mem-repro/sample-rss.sh`:
- Around line 24-29: The script currently resolves PID by taking the first pgrep
match into PID, which is unsafe when multiple processes match PATTERN; change
the logic around the PID resolution (the PID variable and the pgrep -f
"$PATTERN" invocation) to require exactly one match: run pgrep -f "$PATTERN",
count matches, and if count == 0 exit with a clear error, if count > 1 exit with
a clear error asking the user to provide an explicit PID or tighten PATTERN (do
not silently pick head -n1). Ensure error messages reference PATTERN and the
ambiguous match count so callers know why it failed.

In `@scripts/mem-repro/schema.sql`:
- Around line 10-19: The schema must be deterministic: replace the current
CREATE TABLE IF NOT EXISTS for credit_data_order_v2 with a deterministic setup
(e.g., run DROP TABLE IF EXISTS credit_data_order_v2; then CREATE TABLE ... with
the exact columns and defaults) or switch to a dedicated harness table name and
ensure the harness that expects the plaintext passthrough table (the loadgen
main harness) targets that name; alternatively add an explicit schema
verification step that inspects credit_data_order_v2 columns/types and fails
fast on drift so the loadgen harness cannot silently run against an incompatible
shape.

In `@scripts/mem-repro/soak.sh`:
- Line 17: The NET assignment in soak.sh uses a docker inspect template that
concatenates all network names into one token; change the template passed to
docker inspect in the NET assignment (the command starting with docker inspect
proxy --format ...) to emit one network name per line (e.g. add a newline inside
the range template) so that piping to head -n1 returns a single valid network;
update the NET variable assignment accordingly (same fix as applied to
run-docker.sh) so the template prints each .NetworkSettings.Networks key on its
own line before head -n1.
- Around line 29-32: The loop currently swallows failures because the docker run
... go run ./loadgen command ends with "|| true"; remove that "|| true" and
instead check the command's exit status (the docker run invocation inside the
for r in $(seq 1 "${ROUNDS}") loop) immediately after it; if the exit code is
non‑zero, echo an error mentioning the LABEL and round number and exit non‑zero
(or break) so failed load rounds are not treated as successful before printing
the RSS summary.

In `@scripts/mem-repro/summary.md`:
- Around line 11-13: The fenced code block containing the environment variable
line "CS_DEVELOPMENT__DISABLE_MAPPING=true" needs a language tag; change the
opening triple backticks from ``` to ```bash (or ```sh) so the snippet is marked
as shell and markdownlint stops flagging it, e.g., update the fence around that
exact snippet in the summary.md file.

---

Nitpick comments:
In `@scripts/mem-repro/summary.md`:
- Around line 44-58: Replace fragile numeric file:line references in the
root-cause section with stable symbols or a commit permalink: reference the
backend symbol(s) (e.g., the function that returns early such as
backend::execute or backend::handle_statement and the methods
complete_execution()/finish_session()), the frontend symbols (e.g.,
frontend::start_session, frontend::set_execute*), and context::is_passthrough /
encrypt_config.is_empty / mapping_disabled, or add a commit-permalink for the
investigation snapshot; update the three places that currently say
"backend.rs:175", "frontend.rs:380", and "context/mod.rs:780" to use those
symbol names or permalinks so the explanation remains verifiable as code moves.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e1af32e3-7568-4047-9c4a-7f816f99d5dd

📥 Commits

Reviewing files that changed from the base of the PR and between 100829b and e387a42.

⛔ Files ignored due to path filters (1)
  • scripts/mem-repro/go.sum is excluded by !**/*.sum
📒 Files selected for processing (11)
  • scripts/mem-repro/.gitignore
  • scripts/mem-repro/README.md
  • scripts/mem-repro/go.mod
  • scripts/mem-repro/loadgen/main.go
  • scripts/mem-repro/prepleak/main.go
  • scripts/mem-repro/run-docker.sh
  • scripts/mem-repro/run.sh
  • scripts/mem-repro/sample-rss.sh
  • scripts/mem-repro/schema.sql
  • scripts/mem-repro/soak.sh
  • scripts/mem-repro/summary.md

Comment on lines +17 to +25
var (
dsn = flag.String("dsn", "postgres://bloomuser:password@127.0.0.1:6432/pdts_db?sslmode=disable", "Database connection string (DSN)")
count = flag.Int("count", 10000, "Number of inserts to perform")
workers = flag.Int("workers", 10, "Number of concurrent workers")
)
flag.Parse()

if *dsn == "" {
log.Fatal("DSN is required. Use -dsn flag to provide database connection string")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate -count and -workers as positive values.

Right now Line 143 will panic for a negative -count, and -workers=0 falls through to a successful Done! without inserting anything. This harness should fail fast on invalid inputs instead of producing misleading no-op runs.

Proposed fix
 	flag.Parse()
 
 	if *dsn == "" {
 		log.Fatal("DSN is required. Use -dsn flag to provide database connection string")
 	}
+	if *count <= 0 {
+		log.Fatal("-count must be > 0")
+	}
+	if *workers <= 0 {
+		log.Fatal("-workers must be > 0")
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
var (
dsn = flag.String("dsn", "postgres://bloomuser:password@127.0.0.1:6432/pdts_db?sslmode=disable", "Database connection string (DSN)")
count = flag.Int("count", 10000, "Number of inserts to perform")
workers = flag.Int("workers", 10, "Number of concurrent workers")
)
flag.Parse()
if *dsn == "" {
log.Fatal("DSN is required. Use -dsn flag to provide database connection string")
var (
dsn = flag.String("dsn", "postgres://bloomuser:password@127.0.0.1:6432/pdts_db?sslmode=disable", "Database connection string (DSN)")
count = flag.Int("count", 10000, "Number of inserts to perform")
workers = flag.Int("workers", 10, "Number of concurrent workers")
)
flag.Parse()
if *dsn == "" {
log.Fatal("DSN is required. Use -dsn flag to provide database connection string")
}
if *count <= 0 {
log.Fatal("-count must be > 0")
}
if *workers <= 0 {
log.Fatal("-workers must be > 0")
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/mem-repro/loadgen/main.go` around lines 17 - 25, The flags validation
is incomplete: after flag.Parse() add checks that *count and *workers are
positive (>0) and fail fast; specifically, validate the variables count and
workers (and keep the existing dsn check) and call log.Fatal with clear messages
like "count must be > 0" or "workers must be > 0" when invalid so main (and the
insert loop that uses *count and worker goroutines that rely on *workers) never
proceeds with negative or zero values.

Comment on lines +155 to +188
tx, err := pool.Begin(ctx)
if err != nil {
log.Printf("Insert failed to begin tx: %v", err)
atomic.AddInt64(&completed, 1)
continue
}

// NOTE: The original repro issued `SET CIPHERSTASH.KEYSET_NAME`
// here to mirror repo.setKeysetNameOnTx. For the passthrough
// (no-encryption) memory test that SET is omitted: this proxy has
// a default keyset configured, which rejects the SET, and it is
// irrelevant to the per-statement type-check churn we measure.

_, err = tx.Exec(ctx, query,
id,
orgID,
orderID,
false,
testJSON,
testJSON,
now,
now,
)
if err != nil {
tx.Rollback(ctx)
log.Printf("Insert failed: %v", err)
atomic.AddInt64(&completed, 1)
continue
}

if err := tx.Commit(ctx); err != nil {
log.Printf("Insert failed to commit: %v", err)
atomic.AddInt64(&completed, 1)
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail the run when a worker hits a database error.

Lines 156-188 currently count begin/exec/commit failures as completed work and still let main() exit 0 with Done!. That makes a broken repro look successful and can invalidate any RSS series collected around it. Please propagate the first worker error back to main and return a non-zero exit instead of continuing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/mem-repro/loadgen/main.go` around lines 155 - 188, The worker
currently treats Begin/Exec/Commit failures as "completed"
(atomic.AddInt64(&completed, 1)) and continues, causing main() to finish with
exit 0 even on DB errors; change the worker to propagate the first error back to
main by sending the error (with context cancel or a buffered errCh and
sync.Once) whenever pool.Begin(ctx), tx.Exec(ctx, ...), or tx.Commit(ctx) fail
(and still rollback where appropriate), stop incrementing completed for failed
attempts so completed only counts successful commits, and have main() listen for
that error and return/exit non-zero on the first reported error.

Comment on lines +29 to +30
# Resolve the docker network the proxy is attached to.
NET="$(docker inspect proxy --format '{{range $k,$v := .NetworkSettings.Networks}}{{$k}}{{end}}' | head -n1)"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pick one Docker network name instead of concatenating all of them.

Line 30 renders every attached network key with no separator, so a multi-network proxy container produces an invalid --network value. head -n1 only works if the template prints one network per line.

Proposed fix
-NET="$(docker inspect proxy --format '{{range $k,$v := .NetworkSettings.Networks}}{{$k}}{{end}}' | head -n1)"
+NET="$(docker inspect proxy --format '{{range $k,$v := .NetworkSettings.Networks}}{{println $k}}{{end}}' | head -n1)"
+if [[ -z "${NET}" ]]; then
+  echo "Could not determine the Docker network for container 'proxy'." >&2
+  exit 1
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Resolve the docker network the proxy is attached to.
NET="$(docker inspect proxy --format '{{range $k,$v := .NetworkSettings.Networks}}{{$k}}{{end}}' | head -n1)"
# Resolve the docker network the proxy is attached to.
NET="$(docker inspect proxy --format '{{range $k,$v := .NetworkSettings.Networks}}{{println $k}}{{end}}' | head -n1)"
if [[ -z "${NET}" ]]; then
echo "Could not determine the Docker network for container 'proxy'." >&2
exit 1
fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/mem-repro/run-docker.sh` around lines 29 - 30, The NET assignment
concatenates all network names with no separator, producing an invalid --network
when proxy is attached to multiple networks; change the docker inspect template
so each network key is emitted on its own line and then keep the head -n1
selection. Concretely, update the NET assignment (the variable NET and its
docker inspect invocation) to use a newline in the template, e.g. '{{range $k,$v
:= .NetworkSettings.Networks}}{{$k}}{{"\n"}}{{end}}' | head -n1, so NET contains
exactly one valid network name.

Comment on lines +24 to +29
# Resolve the PID once so we follow a single process for the whole run.
PID="$(pgrep -f "$PATTERN" | head -n1 || true)"
if [[ -z "${PID}" ]]; then
echo "No process matching '${PATTERN}' is running. Start the proxy first." >&2
exit 1
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject ambiguous process matches instead of sampling the first PID.

Lines 25-29 silently pick the first pgrep match. If more than one process matches the pattern, the CSV can end up tracking the wrong process while still looking valid. For a repro harness, it's safer to require exactly one match or ask for an explicit PID.

Proposed fix
-# Resolve the PID once so we follow a single process for the whole run.
-PID="$(pgrep -f "$PATTERN" | head -n1 || true)"
-if [[ -z "${PID}" ]]; then
+# Resolve the PID once so we follow a single process for the whole run.
+mapfile -t PIDS < <(pgrep -f "$PATTERN" || true)
+if (( ${`#PIDS`[@]} == 0 )); then
   echo "No process matching '${PATTERN}' is running. Start the proxy first." >&2
   exit 1
 fi
+if (( ${`#PIDS`[@]} > 1 )); then
+  echo "Pattern '${PATTERN}' matched multiple processes: ${PIDS[*]}" >&2
+  echo "Use a more specific pattern or extend the script to accept an explicit PID." >&2
+  exit 1
+fi
+PID="${PIDS[0]}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Resolve the PID once so we follow a single process for the whole run.
PID="$(pgrep -f "$PATTERN" | head -n1 || true)"
if [[ -z "${PID}" ]]; then
echo "No process matching '${PATTERN}' is running. Start the proxy first." >&2
exit 1
fi
# Resolve the PID once so we follow a single process for the whole run.
mapfile -t PIDS < <(pgrep -f "$PATTERN" || true)
if (( ${`#PIDS`[@]} == 0 )); then
echo "No process matching '${PATTERN}' is running. Start the proxy first." >&2
exit 1
fi
if (( ${`#PIDS`[@]} > 1 )); then
echo "Pattern '${PATTERN}' matched multiple processes: ${PIDS[*]}" >&2
echo "Use a more specific pattern or extend the script to accept an explicit PID." >&2
exit 1
fi
PID="${PIDS[0]}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/mem-repro/sample-rss.sh` around lines 24 - 29, The script currently
resolves PID by taking the first pgrep match into PID, which is unsafe when
multiple processes match PATTERN; change the logic around the PID resolution
(the PID variable and the pgrep -f "$PATTERN" invocation) to require exactly one
match: run pgrep -f "$PATTERN", count matches, and if count == 0 exit with a
clear error, if count > 1 exit with a clear error asking the user to provide an
explicit PID or tighten PATTERN (do not silently pick head -n1). Ensure error
messages reference PATTERN and the ambiguous match count so callers know why it
failed.

Comment on lines +10 to +19
CREATE TABLE IF NOT EXISTS credit_data_order_v2 (
id uuid PRIMARY KEY,
organization_id uuid NOT NULL,
order_id uuid NOT NULL,
account_review boolean NOT NULL DEFAULT false,
full_report jsonb NOT NULL,
raw_report jsonb NOT NULL,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make the repro schema deterministic.

CREATE TABLE IF NOT EXISTS will silently reuse any existing credit_data_order_v2, even if it still has a different shape from a prior test setup. That breaks the “self-contained repro harness” goal, because scripts/mem-repro/loadgen/main.go:136-177 will then run against whatever schema happened to be there instead of the plaintext passthrough table this harness expects.

Consider dropping/recreating the table in a scratch DB, using a dedicated harness table name, or adding an explicit schema verification step that fails on drift.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/mem-repro/schema.sql` around lines 10 - 19, The schema must be
deterministic: replace the current CREATE TABLE IF NOT EXISTS for
credit_data_order_v2 with a deterministic setup (e.g., run DROP TABLE IF EXISTS
credit_data_order_v2; then CREATE TABLE ... with the exact columns and defaults)
or switch to a dedicated harness table name and ensure the harness that expects
the plaintext passthrough table (the loadgen main harness) targets that name;
alternatively add an explicit schema verification step that inspects
credit_data_order_v2 columns/types and fails fast on drift so the loadgen
harness cannot silently run against an incompatible shape.

Comment thread scripts/mem-repro/soak.sh
ROUNDS="${2:-6}"
COUNT="${3:-50000}"
WORKERS="${4:-16}"
NET="$(docker inspect proxy --format '{{range $k,$v := .NetworkSettings.Networks}}{{$k}}{{end}}' | head -n1)"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix Docker network resolution here as well.

Line 17 has the same concatenation bug as run-docker.sh: if proxy is attached to multiple networks, the template returns a single invalid combined name. Emit one network per line before head -n1.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/mem-repro/soak.sh` at line 17, The NET assignment in soak.sh uses a
docker inspect template that concatenates all network names into one token;
change the template passed to docker inspect in the NET assignment (the command
starting with docker inspect proxy --format ...) to emit one network name per
line (e.g. add a newline inside the range template) so that piping to head -n1
returns a single valid network; update the NET variable assignment accordingly
(same fix as applied to run-docker.sh) so the template prints each
.NetworkSettings.Networks key on its own line before head -n1.

Comment thread scripts/mem-repro/soak.sh
Comment on lines +29 to +32
for r in $(seq 1 "${ROUNDS}"); do
docker run --rm --network "${NET}" -v "${HERE}:/src" -w /src -e GOFLAGS=-mod=mod golang:1.22-bookworm \
bash -c "go mod tidy >/dev/null 2>&1; go run ./loadgen -count ${COUNT} -workers ${WORKERS} -dsn '${DSN}'" >/dev/null 2>&1 || true
echo "[$LABEL] round ${r}/${ROUNDS} done; RSS=$(docker exec proxy cat /sys/fs/cgroup/memory.current|awk '{printf "%.1f",$1/1048576}') MiB"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not swallow failed load rounds.

Line 31 ends the docker run ... go run command with || true, so the script prints round ... done even when the load generator fails. That makes the final RSS summary unsafe to interpret because it may describe a run that never actually exercised the proxy.

Proposed fix
-  docker run --rm --network "${NET}" -v "${HERE}:/src" -w /src -e GOFLAGS=-mod=mod golang:1.22-bookworm \
-    bash -c "go mod tidy >/dev/null 2>&1; go run ./loadgen -count ${COUNT} -workers ${WORKERS} -dsn '${DSN}'" >/dev/null 2>&1 || true
+  docker run --rm --network "${NET}" -v "${HERE}:/src" -w /src -e GOFLAGS=-mod=mod golang:1.22-bookworm \
+    bash -c "go mod tidy >/dev/null 2>&1; go run ./loadgen -count ${COUNT} -workers ${WORKERS} -dsn '${DSN}'" >/dev/null
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for r in $(seq 1 "${ROUNDS}"); do
docker run --rm --network "${NET}" -v "${HERE}:/src" -w /src -e GOFLAGS=-mod=mod golang:1.22-bookworm \
bash -c "go mod tidy >/dev/null 2>&1; go run ./loadgen -count ${COUNT} -workers ${WORKERS} -dsn '${DSN}'" >/dev/null 2>&1 || true
echo "[$LABEL] round ${r}/${ROUNDS} done; RSS=$(docker exec proxy cat /sys/fs/cgroup/memory.current|awk '{printf "%.1f",$1/1048576}') MiB"
for r in $(seq 1 "${ROUNDS}"); do
docker run --rm --network "${NET}" -v "${HERE}:/src" -w /src -e GOFLAGS=-mod=mod golang:1.22-bookworm \
bash -c "go mod tidy >/dev/null 2>&1; go run ./loadgen -count ${COUNT} -workers ${WORKERS} -dsn '${DSN}'" >/dev/null
echo "[$LABEL] round ${r}/${ROUNDS} done; RSS=$(docker exec proxy cat /sys/fs/cgroup/memory.current|awk '{printf "%.1f",$1/1048576}') MiB"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/mem-repro/soak.sh` around lines 29 - 32, The loop currently swallows
failures because the docker run ... go run ./loadgen command ends with "||
true"; remove that "|| true" and instead check the command's exit status (the
docker run invocation inside the for r in $(seq 1 "${ROUNDS}") loop) immediately
after it; if the exit code is non‑zero, echo an error mentioning the LABEL and
round number and exit non‑zero (or break) so failed load rounds are not treated
as successful before printing the RSS summary.

Comment on lines +11 to +13
```
CS_DEVELOPMENT__DISABLE_MAPPING=true
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language tag to this fenced block.

markdownlint is already flagging this snippet; marking it as bash/sh keeps the doc lint-clean.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 11-11: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/mem-repro/summary.md` around lines 11 - 13, The fenced code block
containing the environment variable line "CS_DEVELOPMENT__DISABLE_MAPPING=true"
needs a language tag; change the opening triple backticks from ``` to ```bash
(or ```sh) so the snippet is marked as shell and markdownlint stops flagging it,
e.g., update the fence around that exact snippet in the summary.md file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants