Skip to content

refactor(scrapy): make AsyncThread timeout configurable#955

Merged
vdusek merged 3 commits into
masterfrom
refactor/scrapy-async-thread-timeout
Jun 9, 2026
Merged

refactor(scrapy): make AsyncThread timeout configurable#955
vdusek merged 3 commits into
masterfrom
refactor/scrapy-async-thread-timeout

Conversation

@vdusek

@vdusek vdusek commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

AsyncThread.run_coro hardcoded a 60s timeout. This adds a configurable default_timeout constructor argument (default unchanged at 60s), used when a per-call timeout is not given, and documents why each Scrapy consumer (scheduler, HTTP cache) owns its own event-loop thread.

Part of splitting the larger Scrapy integration fix (fix/scrapy-integration) into reviewable pieces.

@vdusek vdusek added adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. labels Jun 9, 2026
@vdusek vdusek self-assigned this Jun 9, 2026
@github-actions github-actions Bot added this to the 142nd sprint - Tooling team milestone Jun 9, 2026
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.33%. Comparing base (10203bc) to head (85237fa).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
src/apify/scrapy/_async_thread.py 40.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #955      +/-   ##
==========================================
- Coverage   86.38%   86.33%   -0.06%     
==========================================
  Files          48       48              
  Lines        2916     2919       +3     
==========================================
+ Hits         2519     2520       +1     
- Misses        397      399       +2     
Flag Coverage Δ
e2e ?
integration 57.79% <0.00%> (-0.03%) ⬇️
unit 74.47% <40.00%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vdusek vdusek requested a review from Pijukatel June 9, 2026 11:03
@vdusek vdusek marked this pull request as ready for review June 9, 2026 11:03
Comment thread src/apify/scrapy/_async_thread.py Outdated
@vdusek vdusek requested a review from Pijukatel June 9, 2026 12:39
@vdusek vdusek merged commit a6b6839 into master Jun 9, 2026
26 of 28 checks passed
@vdusek vdusek deleted the refactor/scrapy-async-thread-timeout branch June 9, 2026 12:42
vdusek added a commit that referenced this pull request Jun 18, 2026
…ut setting (#979)

## Description

Fixes several defects in the Scrapy integration's background event-loop
thread (`AsyncThread`), the scheduler, and the HTTP cache storage, and
makes the loop timeout configurable.

## Fixes

- **`run_coro` startup race** — the `is_running()` guard fired
spuriously when a coroutine was submitted before the loop thread reached
`run_forever()` (observed ~122/500 in `scheduler.open()`). It now guards
on `is_closed()`. A coroutine queued on a not-yet-running loop runs once
the loop starts; only a closed loop raises.
- **`close()` thread leak** — if task cancellation timed out or raised,
the loop was never stopped or joined. Stop, join, and the
forced-shutdown fallback now run in a `finally`, and the original error
still propagates.
- **`close()` second call** — a repeated close raised `RuntimeError:
Event loop is closed`. An `is_closed()` early-return makes it a no-op.
- **`close()` ignored its `timeout`** for the cancellation step (it used
the constructor default). It now passes the caller's timeout through.
- **`run_coro` timeout** left the coroutine running. It now cancels the
future on timeout.
- **HTTP cache open/cleanup thread leaks** — `open_spider` now closes
the thread if opening the key-value store fails (matching
`ApifyScheduler.open`). The expiration sweep runs inside `try` with
`close()` in a `finally`.
- **Configurable timeout (#955)** — new
`APIFY_ASYNC_THREAD_TIMEOUT_SECS` setting, wired into the scheduler (via
`from_crawler`) and the cache storage.

## Error logging

The integration now follows consistent conventions for caught
exceptions:

- **`except … as exc:` → `logger.warning(f'… {exc}')`, swallowed** — for
*expected, recoverable* conditions handled locally: a malformed or
legacy stored payload skipped as a cache/queue miss, or non-UTF-8
headers preserved in the serialized request. A short message plus the
exception text, with no traceback, because it is not a bug.
- **`except Exception:` → `logger.exception('…')`, swallowed** — for
*unexpected* failures handled at a terminal point: the cleanup sweep,
shutdown, or skip-and-continue. `logger.exception` attaches the full
traceback, and nothing re-raises because the error is handled here.
- **`except …:` → `raise` (no logging)** — when the error is re-raised
and the caller or Scrapy logs it with a traceback anyway. `run_coro`'s
timeout path cancels the future and re-raises without logging, so the
failure is reported once.
- **`except Exception:` → `logger.exception('…'); raise`** — the
boundary log, used only where local context materially helps *and* the
propagated error would otherwise be logged only generically or not at
all. The scheduler's `next_request` / `enqueue_request` /
`has_pending_requests` are called synchronously by the Scrapy engine
(not inside a Deferred), so without this log the Apify-specific context
would be lost.

**Why `logger.exception` replaced `traceback.print_exc()`:**
`traceback.print_exc()` writes a bare traceback straight to stderr,
bypassing logging entirely. It has no level, no logger name, no message,
and ignores Scrapy's and the SDK's log configuration and handlers.
`logger.exception(msg)` logs at ERROR through the configured logging, so
it is routed, formatted, and filterable like every other log line. It
adds a message explaining *what* failed and still attaches the full
traceback automatically, which makes including the exception object in
the message (`{exc}`) redundant (ruff TRY401).

## Tests

New `tests/unit/scrapy/test_async_thread.py` covers the startup race,
run-after-close, timeout cancellation, idempotent close, the caller
timeout reaching the shutdown step, and stop/join when task cancellation
fails. The scheduler and HTTP cache test modules gain coverage for the
timeout setting, closing the thread on open failure, and the
cleanup-failure path still closing the thread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants