Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
256 changes: 152 additions & 104 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,131 @@
<h1 align=center>Apify SDK for Python</h1>
<h1 align="center">Apify SDK for Python</h1>

<p align="center">
<a href="https://badge.fury.io/py/apify" rel="nofollow"><img src="https://badge.fury.io/py/apify.svg" alt="PyPI package version"></a>
<a href="https://pypi.org/project/apify/" rel="nofollow"><img src="https://img.shields.io/pypi/dm/apify" alt="PyPI package downloads"></a>
<a href="https://codecov.io/gh/apify/apify-sdk-python"><img src="https://codecov.io/gh/apify/apify-sdk-python/graph/badge.svg?token=Y6JBIZQFT6" alt="Codecov report"></a>
<a href="https://pypi.org/project/apify/" rel="nofollow"><img src="https://img.shields.io/pypi/pyversions/apify" alt="PyPI Python version"></a>
<a href="https://discord.gg/jyEM2PRvMU" rel="nofollow"><img src="https://img.shields.io/discord/801163717915574323?label=discord" alt="Chat on Discord"></a>
<strong>The official Python SDK for building <a href="https://docs.apify.com/platform/actors">Apify Actors</a>.</strong>
</p>

The Apify SDK for Python is the official library to create [Apify Actors](https://docs.apify.com/platform/actors)
in Python. It provides useful features like Actor lifecycle management, local storage emulation, and Actor
event handling.
<p align="center">
<a href="https://pypi.org/project/apify/"><img src="https://badge.fury.io/py/apify.svg" alt="PyPI version"></a>
<a href="https://pypi.org/project/apify/"><img src="https://img.shields.io/pypi/dm/apify" alt="PyPI downloads"></a>
<a href="https://pypi.org/project/apify/"><img src="https://img.shields.io/badge/python-3.11%2B-blue" alt="Python versions"></a>
<a href="https://github.com/apify/apify-sdk-python/actions/workflows/on_master.yaml"><img src="https://github.com/apify/apify-sdk-python/actions/workflows/on_master.yaml/badge.svg?branch=master" alt="Build status"></a>
<a href="https://codecov.io/gh/apify/apify-sdk-python"><img src="https://codecov.io/gh/apify/apify-sdk-python/graph/badge.svg?token=Y6JBIZQFT6" alt="Coverage"></a>
<a href="https://github.com/apify/apify-sdk-python/blob/master/LICENSE"><img src="https://img.shields.io/pypi/l/apify" alt="License"></a>
<a href="https://discord.gg/jyEM2PRvMU"><img src="https://img.shields.io/discord/801163717915574323?label=discord" alt="Chat on Discord"></a>
</p>

`apify` is the official SDK for building [Apify Actors](https://docs.apify.com/platform/actors) in Python. It handles the Actor lifecycle, [storage](https://docs.apify.com/platform/storage) access, platform events, [Apify Proxy](https://docs.apify.com/platform/proxy), pay-per-event charging, and more.

If you just need to access the [Apify API](https://docs.apify.com/api/v2) from your Python applications,
check out the [Apify Client for Python](https://docs.apify.com/api/client/python) instead.
> If you only need to **consume** the [Apify API](https://docs.apify.com/api/v2) from Python (running Actors, reading datasets, managing storages) rather than building Actors, use the [Apify API client for Python](https://docs.apify.com/api/client/python) instead. It comes bundled with this SDK.

## Table of contents

- [Installation](#installation)
- [Quick start](#quick-start)
- [What are Actors?](#what-are-actors)
- [Features](#features)
- [What you can build](#what-you-can-build)
- [Usage examples](#usage-examples)
- [Documentation](#documentation)
- [Related projects](#related-projects)
- [Support and community](#support-and-community)
- [Contributing](#contributing)
- [License](#license)

## Installation

The Apify SDK for Python is available on PyPI as the `apify` package.
For default installation, using Pip, run the following:
The Apify SDK for Python requires **Python 3.11 or higher**. It is published on [PyPI](https://pypi.org/project/apify/) as the `apify` package and can be installed with [pip](https://pip.pypa.io/):

```bash
pip install apify
```

For users interested in integrating Apify with Scrapy, we provide a package extra called `scrapy`.
To install Apify with the `scrapy` extra, use the following command:
or with [uv](https://docs.astral.sh/uv/):

```bash
pip install apify[scrapy]
uv add apify
Comment thread
Pijukatel marked this conversation as resolved.
```

## Documentation
To use the Scrapy integration, install the `scrapy` extra:

```bash
pip install 'apify[scrapy]'
```

## Quick start

An Actor is a Python program that runs inside the `async with Actor:` context. The context initializes the Actor when it starts and tears it down when it finishes. Here's a minimal Actor that reads its input and stores a result:

```python
from apify import Actor


async def main() -> None:
async with Actor:
actor_input = await Actor.get_input()
Actor.log.info('Actor input: %s', actor_input)
await Actor.set_value('OUTPUT', 'Hello, world!')
```

The quickest way to scaffold a full Actor project, with the `.actor` configuration, input schema, and Dockerfile already in place, is the [Apify CLI](https://docs.apify.com/cli):

1. Install the CLI:

```bash
npm install -g apify-cli
```

2. Create a new Actor from the Python "getting started" template:

```bash
apify create my-actor --template python-start
```

For usage instructions, check the documentation on [Apify Docs](https://docs.apify.com/sdk/python/).
3. Run it locally:

## Examples
```bash
cd my-actor
apify run
```

Below are few examples demonstrating how to use the Apify SDK with some web scraping-related libraries.
To create, run, and deploy your first Actor step by step, see the [Quick start guide](https://docs.apify.com/sdk/python/docs/quick-start).

### Apify SDK with HTTPX and BeautifulSoup
## What are Actors?

Actors are serverless cloud programs that can do almost anything a human can do in a web browser. They range from small tasks, such as filling in forms or unsubscribing from online services, all the way up to scraping and processing vast numbers of web pages.

They run either locally or on the [Apify platform](https://docs.apify.com/platform/), where you can run them at scale, monitor them, schedule them, or publish and monetize them. If you're new to Apify, learn [what Apify is](https://docs.apify.com/platform/about) in the platform documentation.
Comment thread
vdusek marked this conversation as resolved.

## Features

- Run the full Actor lifecycle inside `async with Actor:`, covering init, exit, failures, status messages, and reboots ([Actor lifecycle](https://docs.apify.com/sdk/python/docs/concepts/actor-lifecycle)).
- Read Actor input validated against your input schema with `Actor.get_input()` ([Actor input](https://docs.apify.com/sdk/python/docs/concepts/actor-input)).
- Read and write datasets, key-value stores, and request queues, locally or on the platform ([Working with storages](https://docs.apify.com/sdk/python/docs/concepts/storages)).
- React to platform events such as system info, migration, and abort ([Actor events](https://docs.apify.com/sdk/python/docs/concepts/actor-events)).
- Route requests through Apify Proxy with group selection, country targeting, and rotation ([Proxy management](https://docs.apify.com/sdk/python/docs/concepts/proxy-management)).
- Start, call, abort, and metamorph other Actors and tasks, and attach webhooks to run events ([Interacting with other Actors](https://docs.apify.com/sdk/python/docs/concepts/interacting-with-other-actors), [Webhooks](https://docs.apify.com/sdk/python/docs/concepts/webhooks)).
- Monetize your Actor with pay-per-event charging ([Pay-per-event](https://docs.apify.com/sdk/python/docs/concepts/pay-per-event)).
- Reach the full [Apify API](https://docs.apify.com/api/v2) through a preconfigured `ApifyClient` ([Accessing the Apify API](https://docs.apify.com/sdk/python/docs/concepts/access-apify-api)).

## What you can build

Almost any Python project can become an Actor, including projects for:

- **Web scraping and crawling** — The SDK is fully compatible with [Crawlee](https://crawlee.dev/python), which makes Apify a natural place to deploy and scale your crawlers (see the [Crawlee guide](https://docs.apify.com/sdk/python/docs/guides/crawlee)). It also works with other popular scraping libraries, such as [Scrapy](https://docs.apify.com/sdk/python/docs/guides/scrapy), [Scrapling](https://docs.apify.com/sdk/python/docs/guides/scrapling), or [Crawl4AI](https://docs.apify.com/sdk/python/docs/guides/crawl4ai).
- **Browser automation** — Drive a real browser with [Playwright](https://docs.apify.com/sdk/python/docs/guides/playwright) or [Selenium](https://docs.apify.com/sdk/python/docs/guides/selenium), or with higher-level tools such as [Browser Use](https://docs.apify.com/sdk/python/docs/guides/browser-use).
- **Web servers and APIs** — Run a [web server](https://docs.apify.com/sdk/python/docs/guides/running-webserver) inside an Actor to serve HTTP requests, for example to expose your scraper as a live API.
- **AI agents** — Host agents built with your framework of choice. Ready-made Actor templates cover [PydanticAI](https://apify.com/templates/python-pydanticai), [CrewAI](https://apify.com/templates/python-crewai), [LangGraph](https://apify.com/templates/python-langgraph), [LlamaIndex](https://apify.com/templates/python-llamaindex-agent), and [Smolagents](https://apify.com/templates/python-smolagents).
- **MCP servers** — Deploy a Python MCP server as an Actor and make its tools available to any MCP client. See [MCP server](https://apify.com/templates/python-mcp-empty) and [MCP proxy](https://apify.com/templates/python-mcp-proxy) templates

Whatever you build, the Apify SDK doesn't lock you into a particular framework. Bring the libraries you already use, and let Apify run your project in the cloud.

## Usage examples

The examples below show two common setups, but the same `async with Actor:` pattern works with any stack. For more, see the [guides](https://docs.apify.com/sdk/python/docs/guides/beautifulsoup-httpx).

### HTTPX with BeautifulSoup

This example illustrates how to integrate the Apify SDK with [HTTPX](https://www.python-httpx.org/) and [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) to scrape data from web pages.
Scrape pages with [HTTPX](https://www.python-httpx.org/) and [BeautifulSoup](https://pypi.org/project/beautifulsoup4/), using the Actor's request queue to track URLs:

```python
from bs4 import BeautifulSoup
Expand All @@ -52,45 +136,31 @@ from apify import Actor

async def main() -> None:
async with Actor:
# Retrieve the Actor input, and use default values if not provided.
actor_input = await Actor.get_input() or {}
start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}])

# Open the default request queue for handling URLs to be processed.
# Enqueue the start URLs into the default request queue.
request_queue = await Actor.open_request_queue()

# Enqueue the start URLs.
for start_url in start_urls:
url = start_url.get('url')
await request_queue.add_request(url)
await request_queue.add_request(start_url['url'])

# Process the URLs from the request queue.
# Process the queue until it's empty.
while request := await request_queue.fetch_next_request():
Actor.log.info(f'Scraping {request.url} ...')

# Fetch the HTTP response from the specified URL using HTTPX.
async with AsyncClient() as client:
response = await client.get(request.url)

# Parse the HTML content using Beautiful Soup.
soup = BeautifulSoup(response.content, 'html.parser')

# Extract the desired data.
data = {
# Push the extracted data to the default dataset.
await Actor.push_data({
'url': request.url,
'title': soup.title.string,
'h1s': [h1.text for h1 in soup.find_all('h1')],
'h2s': [h2.text for h2 in soup.find_all('h2')],
'h3s': [h3.text for h3 in soup.find_all('h3')],
}

# Store the extracted data to the default dataset.
await Actor.push_data(data)
'title': soup.title.string if soup.title else None,
})
```

### Apify SDK with PlaywrightCrawler from Crawlee
### Crawlee with Playwright

This example demonstrates how to use the Apify SDK alongside `PlaywrightCrawler` from [Crawlee](https://crawlee.dev/python) to perform web scraping.
Scrape pages with [Crawlee](https://crawlee.dev/python)'s `PlaywrightCrawler`, which handles queueing, concurrency, and the browser for you:

```python
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
Expand All @@ -100,83 +170,61 @@ from apify import Actor

async def main() -> None:
async with Actor:
# Retrieve the Actor input, and use default values if not provided.
actor_input = await Actor.get_input() or {}
start_urls = [url.get('url') for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])]
start_urls = [url['url'] for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])]

# Exit if no start URLs are provided.
if not start_urls:
Actor.log.info('No start URLs specified in Actor input, exiting...')
await Actor.exit()
crawler = PlaywrightCrawler(max_requests_per_crawl=50, headless=True)

# Create a crawler.
crawler = PlaywrightCrawler(
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=50,
headless=True,
)

# Define a request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
url = context.request.url
Actor.log.info(f'Scraping {url}...')

# Extract the desired data.
data = {
async def handler(context: PlaywrightCrawlingContext) -> None:
Actor.log.info(f'Scraping {context.request.url} ...')
await context.push_data({
'url': context.request.url,
'title': await context.page.title(),
'h1s': [await h1.text_content() for h1 in await context.page.locator('h1').all()],
'h2s': [await h2.text_content() for h2 in await context.page.locator('h2').all()],
'h3s': [await h3.text_content() for h3 in await context.page.locator('h3').all()],
}

# Store the extracted data to the default dataset.
await context.push_data(data)

# Enqueue additional links found on the current page.
})
# Follow links found on the page.
await context.enqueue_links()

# Run the crawler with the starting URLs.
await crawler.run(start_urls)
```

## What are Actors?
## Documentation

Actors are serverless cloud programs that can do almost anything a human can do in a web browser.
They can do anything from small tasks such as filling in forms or unsubscribing from online services,
all the way up to scraping and processing vast numbers of web pages.
The full SDK documentation lives at **[docs.apify.com/sdk/python](https://docs.apify.com/sdk/python)**. For the Apify platform itself, see the [Apify documentation](https://docs.apify.com/).

They can be run either locally, or on the [Apify platform](https://docs.apify.com/platform/),
where you can run them at scale, monitor them, schedule them, or publish and monetize them.
| Section | What you'll find |
|---|---|
| [Overview](https://docs.apify.com/sdk/python/docs/overview) | What the SDK is, what Actors are, and how the pieces fit together. |
| [Quick start](https://docs.apify.com/sdk/python/docs/quick-start) | Create, run, and deploy your first Python Actor. |
| [Concepts](https://docs.apify.com/sdk/python/docs/concepts/actor-lifecycle) | Actor lifecycle, input, storages, events, proxy management, interacting with other Actors, webhooks, accessing the Apify API, logging, configuration, and pay-per-event. |
| [Guides](https://docs.apify.com/sdk/python/docs/guides/beautifulsoup-httpx) | Integrations with BeautifulSoup, Parsel, Playwright, Selenium, Crawlee, Scrapy, Crawl4AI, and Browser Use, plus running a web server and using uv. |
| [Upgrading](https://docs.apify.com/sdk/python/docs/upgrading/upgrading-to-v4) | Migrating between major versions. |
| [API reference](https://docs.apify.com/sdk/python/reference) | Generated reference for every class and method. |
| [Changelog](https://docs.apify.com/sdk/python/docs/changelog) | Release history and breaking changes. |

If you're new to Apify, learn [what is Apify](https://docs.apify.com/platform/about)
in the Apify platform documentation.
## Related projects

## Creating Actors
- **[Apify API client for Python](https://docs.apify.com/api/client/python)** — talk to the Apify API directly from Python (bundled with this SDK).
- **[Crawlee for Python](https://crawlee.dev/python)** — web scraping and browser automation framework; fully compatible with this SDK.
- **[Apify SDK for JavaScript / TypeScript](https://docs.apify.com/sdk/js)** — the equivalent SDK for Node.js.
- **[Apify API client for JavaScript / TypeScript](https://docs.apify.com/api/client/js)** — the equivalent API client for Node.js.
- **[Crawlee for JavaScript / TypeScript](https://crawlee.dev)** — the original Node.js implementation of Crawlee.
- **[Apify CLI](https://docs.apify.com/cli)** — command-line tool for creating, running, and deploying Actors locally and on the platform.

To create and run Actors through Apify Console,
see the [Console documentation](https://docs.apify.com/academy/getting-started/creating-actors#choose-your-template).
## Support and community

To create and run Python Actors locally, check the documentation for
[how to create and run Python Actors locally](https://docs.apify.com/sdk/python/docs/quick-start).
- **Discord** — chat with the team and other users on the [Apify Discord server](https://discord.gg/jyEM2PRvMU).
- **GitHub issues** — report a bug or request a feature in the [issue tracker](https://github.com/apify/apify-sdk-python/issues).

## Guides
## Contributing

To see how you can use the Apify SDK with other popular libraries used for web scraping,
check out our guides for using
[BeautifulSoup with HTTPX](https://docs.apify.com/sdk/python/docs/guides/beautifulsoup-httpx),
[Parsel with Impit](https://docs.apify.com/sdk/python/docs/guides/parsel-impit),
[Playwright](https://docs.apify.com/sdk/python/docs/guides/playwright),
[Selenium](https://docs.apify.com/sdk/python/docs/guides/selenium),
[Crawlee](https://docs.apify.com/sdk/python/docs/guides/crawlee),
or [Scrapy](https://docs.apify.com/sdk/python/docs/guides/scrapy).
Bug reports, fixes, and improvements are welcome! See [CONTRIBUTING.md](./CONTRIBUTING.md) for the development setup, coding standards, testing, and release process. The project uses [uv](https://docs.astral.sh/uv/) for project management and [Poe the Poet](https://poethepoet.natn.io/) as a task runner; the typical loop is:

```bash
uv run poe install-dev # install dev dependencies and git hooks
uv run poe check-code # lint, type-check, and unit tests
```

## Usage concepts
## License

To learn more about the features of the Apify SDK and how to use them,
check out the Usage Concepts section in the sidebar,
particularly the guides for the [Actor lifecycle](https://docs.apify.com/sdk/python/docs/concepts/actor-lifecycle),
[working with storages](https://docs.apify.com/sdk/python/docs/concepts/storages),
[handling Actor events](https://docs.apify.com/sdk/python/docs/concepts/actor-events)
or [how to use proxies](https://docs.apify.com/sdk/python/docs/concepts/proxy-management).
Released under the [Apache License 2.0](./LICENSE).
Loading