refactor(scrapy): make AsyncThread timeout configurable by vdusek · Pull Request #955 · apify/apify-sdk-python

vdusek · 2026-06-09T10:45:05Z

AsyncThread.run_coro hardcoded a 60s timeout. This adds a configurable default_timeout constructor argument (default unchanged at 60s), used when a per-call timeout is not given, and documents why each Scrapy consumer (scheduler, HTTP cache) owns its own event-loop thread.

Part of splitting the larger Scrapy integration fix (fix/scrapy-integration) into reviewable pieces.

codecov · 2026-06-09T10:46:36Z

Codecov Report

❌ Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.33%. Comparing base (10203bc) to head (85237fa).
⚠️ Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
src/apify/scrapy/_async_thread.py	40.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #955      +/-   ##
==========================================
- Coverage   86.38%   86.33%   -0.06%     
==========================================
  Files          48       48              
  Lines        2916     2919       +3     
==========================================
+ Hits         2519     2520       +1     
- Misses        397      399       +2

Flag	Coverage Δ
e2e	`?`
integration	`57.79% <0.00%> (-0.03%)`	⬇️
unit	`74.47% <40.00%> (-0.08%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…ut setting (#979) ## Description Fixes several defects in the Scrapy integration's background event-loop thread (`AsyncThread`), the scheduler, and the HTTP cache storage, and makes the loop timeout configurable. ## Fixes - **`run_coro` startup race** — the `is_running()` guard fired spuriously when a coroutine was submitted before the loop thread reached `run_forever()` (observed ~122/500 in `scheduler.open()`). It now guards on `is_closed()`. A coroutine queued on a not-yet-running loop runs once the loop starts; only a closed loop raises. - **`close()` thread leak** — if task cancellation timed out or raised, the loop was never stopped or joined. Stop, join, and the forced-shutdown fallback now run in a `finally`, and the original error still propagates. - **`close()` second call** — a repeated close raised `RuntimeError: Event loop is closed`. An `is_closed()` early-return makes it a no-op. - **`close()` ignored its `timeout`** for the cancellation step (it used the constructor default). It now passes the caller's timeout through. - **`run_coro` timeout** left the coroutine running. It now cancels the future on timeout. - **HTTP cache open/cleanup thread leaks** — `open_spider` now closes the thread if opening the key-value store fails (matching `ApifyScheduler.open`). The expiration sweep runs inside `try` with `close()` in a `finally`. - **Configurable timeout (#955)** — new `APIFY_ASYNC_THREAD_TIMEOUT_SECS` setting, wired into the scheduler (via `from_crawler`) and the cache storage. ## Error logging The integration now follows consistent conventions for caught exceptions: - **`except … as exc:` → `logger.warning(f'… {exc}')`, swallowed** — for *expected, recoverable* conditions handled locally: a malformed or legacy stored payload skipped as a cache/queue miss, or non-UTF-8 headers preserved in the serialized request. A short message plus the exception text, with no traceback, because it is not a bug. - **`except Exception:` → `logger.exception('…')`, swallowed** — for *unexpected* failures handled at a terminal point: the cleanup sweep, shutdown, or skip-and-continue. `logger.exception` attaches the full traceback, and nothing re-raises because the error is handled here. - **`except …:` → `raise` (no logging)** — when the error is re-raised and the caller or Scrapy logs it with a traceback anyway. `run_coro`'s timeout path cancels the future and re-raises without logging, so the failure is reported once. - **`except Exception:` → `logger.exception('…'); raise`** — the boundary log, used only where local context materially helps *and* the propagated error would otherwise be logged only generically or not at all. The scheduler's `next_request` / `enqueue_request` / `has_pending_requests` are called synchronously by the Scrapy engine (not inside a Deferred), so without this log the Apify-specific context would be lost. **Why `logger.exception` replaced `traceback.print_exc()`:** `traceback.print_exc()` writes a bare traceback straight to stderr, bypassing logging entirely. It has no level, no logger name, no message, and ignores Scrapy's and the SDK's log configuration and handlers. `logger.exception(msg)` logs at ERROR through the configured logging, so it is routed, formatted, and filterable like every other log line. It adds a message explaining *what* failed and still attaches the full traceback automatically, which makes including the exception object in the message (`{exc}`) redundant (ruff TRY401). ## Tests New `tests/unit/scrapy/test_async_thread.py` covers the startup race, run-after-close, timeout cancellation, idempotent close, the caller timeout reaching the shutdown step, and stop/join when task cancellation fails. The scheduler and HTTP cache test modules gain coverage for the timeout setting, closing the thread on open failure, and the cleanup-failure path still closing the thread.

refactor(scrapy): make AsyncThread timeout configurable

9854382

vdusek added adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. labels Jun 9, 2026

vdusek self-assigned this Jun 9, 2026

github-actions Bot added this to the 142nd sprint - Tooling team milestone Jun 9, 2026

style(scrapy): tighten comments and docstrings

df26807

vdusek requested a review from Pijukatel June 9, 2026 11:03

vdusek marked this pull request as ready for review June 9, 2026 11:03

Pijukatel reviewed Jun 9, 2026

View reviewed changes

Comment thread src/apify/scrapy/_async_thread.py Outdated

refactor(scrapy): use 'default' sentinel for run_coro timeout

85237fa

vdusek requested a review from Pijukatel June 9, 2026 12:39

Pijukatel approved these changes Jun 9, 2026

View reviewed changes

vdusek merged commit a6b6839 into master Jun 9, 2026
26 of 28 checks passed

vdusek deleted the refactor/scrapy-async-thread-timeout branch June 9, 2026 12:42

This was referenced Jun 12, 2026

fix(scrapy): async-thread shutdown, duplicate error logs, and timeout setting #980

Closed

fix(scrapy): async-thread startup race, shutdown lifecycle, and timeout setting #979

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(scrapy): make AsyncThread timeout configurable#955

refactor(scrapy): make AsyncThread timeout configurable#955
vdusek merged 3 commits into
masterfrom
refactor/scrapy-async-thread-timeout

vdusek commented Jun 9, 2026

Uh oh!

codecov Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vdusek commented Jun 9, 2026

Uh oh!

codecov Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Jun 9, 2026 •

edited

Loading