refactor(scrapy): make AsyncThread timeout configurable#955
Merged
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #955 +/- ##
==========================================
- Coverage 86.38% 86.33% -0.06%
==========================================
Files 48 48
Lines 2916 2919 +3
==========================================
+ Hits 2519 2520 +1
- Misses 397 399 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Pijukatel
reviewed
Jun 9, 2026
Pijukatel
approved these changes
Jun 9, 2026
This was referenced Jun 12, 2026
vdusek
added a commit
that referenced
this pull request
Jun 18, 2026
…ut setting (#979) ## Description Fixes several defects in the Scrapy integration's background event-loop thread (`AsyncThread`), the scheduler, and the HTTP cache storage, and makes the loop timeout configurable. ## Fixes - **`run_coro` startup race** — the `is_running()` guard fired spuriously when a coroutine was submitted before the loop thread reached `run_forever()` (observed ~122/500 in `scheduler.open()`). It now guards on `is_closed()`. A coroutine queued on a not-yet-running loop runs once the loop starts; only a closed loop raises. - **`close()` thread leak** — if task cancellation timed out or raised, the loop was never stopped or joined. Stop, join, and the forced-shutdown fallback now run in a `finally`, and the original error still propagates. - **`close()` second call** — a repeated close raised `RuntimeError: Event loop is closed`. An `is_closed()` early-return makes it a no-op. - **`close()` ignored its `timeout`** for the cancellation step (it used the constructor default). It now passes the caller's timeout through. - **`run_coro` timeout** left the coroutine running. It now cancels the future on timeout. - **HTTP cache open/cleanup thread leaks** — `open_spider` now closes the thread if opening the key-value store fails (matching `ApifyScheduler.open`). The expiration sweep runs inside `try` with `close()` in a `finally`. - **Configurable timeout (#955)** — new `APIFY_ASYNC_THREAD_TIMEOUT_SECS` setting, wired into the scheduler (via `from_crawler`) and the cache storage. ## Error logging The integration now follows consistent conventions for caught exceptions: - **`except … as exc:` → `logger.warning(f'… {exc}')`, swallowed** — for *expected, recoverable* conditions handled locally: a malformed or legacy stored payload skipped as a cache/queue miss, or non-UTF-8 headers preserved in the serialized request. A short message plus the exception text, with no traceback, because it is not a bug. - **`except Exception:` → `logger.exception('…')`, swallowed** — for *unexpected* failures handled at a terminal point: the cleanup sweep, shutdown, or skip-and-continue. `logger.exception` attaches the full traceback, and nothing re-raises because the error is handled here. - **`except …:` → `raise` (no logging)** — when the error is re-raised and the caller or Scrapy logs it with a traceback anyway. `run_coro`'s timeout path cancels the future and re-raises without logging, so the failure is reported once. - **`except Exception:` → `logger.exception('…'); raise`** — the boundary log, used only where local context materially helps *and* the propagated error would otherwise be logged only generically or not at all. The scheduler's `next_request` / `enqueue_request` / `has_pending_requests` are called synchronously by the Scrapy engine (not inside a Deferred), so without this log the Apify-specific context would be lost. **Why `logger.exception` replaced `traceback.print_exc()`:** `traceback.print_exc()` writes a bare traceback straight to stderr, bypassing logging entirely. It has no level, no logger name, no message, and ignores Scrapy's and the SDK's log configuration and handlers. `logger.exception(msg)` logs at ERROR through the configured logging, so it is routed, formatted, and filterable like every other log line. It adds a message explaining *what* failed and still attaches the full traceback automatically, which makes including the exception object in the message (`{exc}`) redundant (ruff TRY401). ## Tests New `tests/unit/scrapy/test_async_thread.py` covers the startup race, run-after-close, timeout cancellation, idempotent close, the caller timeout reaching the shutdown step, and stop/join when task cancellation fails. The scheduler and HTTP cache test modules gain coverage for the timeout setting, closing the thread on open failure, and the cleanup-failure path still closing the thread.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
AsyncThread.run_corohardcoded a 60s timeout. This adds a configurabledefault_timeoutconstructor argument (default unchanged at 60s), used when a per-call timeout is not given, and documents why each Scrapy consumer (scheduler, HTTP cache) owns its own event-loop thread.Part of splitting the larger Scrapy integration fix (
fix/scrapy-integration) into reviewable pieces.