docs: Add `HTTP headers` guide by Mantisus · Pull Request #1957 · apify/crawlee-python

Mantisus · 2026-06-09T21:50:03Z

Description

Add a guide on working with HTTP headers in web scraping (docs/guides/http_headers.mdx) with a runnable example.

Issues

Closes: Expose a browser-impersonation toggle directly on HTTP crawlers #1923

Pijukatel

Hi, I think it would be better to do just a documentation change and keep the current implementation.

I wrote the reasons into the issue, as the current wording of the issue is asking for a code change.

#1923 (comment)

vdusek

Two comments from my side.

And regarding this:

Hi, I think it would be better to do just a documentation change and keep the current implementation.

I wrote the reasons into the issue, as the current wording of the issue is asking for a code change.

Exposing high-level convenience arguments on crawlers, which configure the underlying components, is a Crawlee design choice. And we follow this all over the place - crawlers (and Actor in SDK) already act as partial facades over the components they compose. A few examples:

PlaywrightCrawler - headless, browser_type, browser_launch_options, use_incognito_pages, user_data_dir, and similar BrowserPool/plugin internals directly on the crawler. With your approach, there should be only browser_pool.
StagehandCrawler follows the same pattern.
BasicCrawler has use_session_pool: bool next to the session_pool object, which is the same shape as impersonate.

So impersonate is not introducing a new pattern. It follows the same approach we already use for browsers, sessions, proxy configuration, concurrency, and more.

This design decision was made a long time ago, and we should be consistent and follow it, rather than diverging from it. And AFAIK @B4nan is a strong proponent of this approach.

This creates the ugly edge case HttpCrawler(impersonate=False, http_client=...).

We should simply validate the argument combination and raise an error when it is invalid, like in other places.

...

TLDR; A convenience flag (with guard and documentation) is consistent with the rest of Crawlee. It also keeps the simple case simple: users can turn off impersonation without needing to know what an HTTP client is.

Pijukatel · 2026-06-10T08:16:38Z

...
Exposing high-level convenience arguments on crawlers, which configure the underlying components, is a Crawlee design choice.
...

Those internal components usually have more arguments than what is exposed on the Crawler level, and many internal component arguments remain unexposed (which is fine). I do not think we have sufficient evidence to say that this specific internal component argument is so useful for the general user base that it deserves to be exposed on the Crawler level. JS tooling has been around for a while. Was anyone missing such an argument?

We should exercise restraint when exposing those convenient arguments. The more we have, the harder it is to understand the code.

Mantisus · 2026-06-10T11:49:55Z

Regarding the impersonate flag. To me, this situation is similar to #1487. Both cases require only a minor configuration change on the user’s part, and if it were entirely up to me, I would limit myself to providing documentation.

But since we’re already taking the approach of "giving the user a simple configuration option", this PR is fully consistent with that approach.

vdusek

LGTM from my side, but since we don't have consensus here yet, let's gather more input - @janbuchar, what do you think?

If we decide not to include this flag at the HttpCrawler level, what about adding it at the HTTP client level at least? It could just be a simple flag to turn this behavior on or off completely. Right now, different HTTP clients handle this differently (have different interfaces (arguments)).

janbuchar · 2026-06-26T11:12:44Z

See crawlee v4's session management guide. The way to fine-tune how fingerprint impersonation (is that a good way to call it? 🤔) is already there and it's different and arguably more capable.

Then again, achieving full parity will be a major undertaking. But from my side, adding a single flag that stops working once you set up a custom http client is not optimal and we'd have to deprecate it once we have the new session management in place.

Mantisus · 2026-06-26T12:33:33Z

See crawlee v4's session management guide. The way to fine-tune how fingerprint impersonation (is that a good way to call it? 🤔) is already there and it's different and arguably more capable.

If we plan to implement this in v2, then I would limit this PR to documentation only.

@vdusek, what do you think about that?

vdusek · 2026-06-26T13:01:30Z

@janbuchar Thanks for your input.

@Mantisus Yeah, let's update the docs only. Thanks.

vdusek

A few comments

vdusek · 2026-06-28T10:28:27Z

+    # Set default headers on the client. They are sent on every request.
+    http_client = ImpitHttpClient(headers={'X-Api-Key': 'secret'})
+
+    crawler = HttpCrawler(http_client=http_client)


All crawler examples set max_requests_per_crawl.

Suggested change

crawler = HttpCrawler(http_client=http_client)

crawler = HttpCrawler(http_client=http_client, max_requests_per_crawl=10)

vdusek · 2026-06-28T10:28:27Z

+    async def request_handler(context: HttpCrawlingContext) -> None:
+        # `httpbin.org/headers` echoes the received request headers back.
+        response = (await context.http_response.read()).decode()
+        context.log.info(response)


Both requests hit the same URL, so the two log lines are indistinguishable. Maybe we can add unique_key prefix so it's clear which response carried the per-request Accept?

Suggested change

context.log.info(response)

context.log.info(f'{context.request.unique_key}: {response}')

vdusek · 2026-06-28T10:28:27Z

+
+### Identity headers
+
+`User-Agent` identifies the client. Many sites serve different markup to a browser than to a crawler. Some reject requests whose `User-Agent` doesn't look like a real browser. It is one of the basic headers a server uses to identify the client, though not the only one.


Doc writing style: Prefer a contraction.

Suggested change

`User-Agent` identifies the client. Many sites serve different markup to a browser than to a crawler. Some reject requests whose `User-Agent` doesn't look like a real browser. It is one of the basic headers a server uses to identify the client, though not the only one.

`User-Agent` identifies the client. Many sites serve different markup to a browser than to a crawler. Some reject requests whose `User-Agent` doesn't look like a real browser. It's one of the basic headers a server uses to identify the client, though not the only one.

vdusek · 2026-06-28T10:28:27Z

+
+`Accept` lists the formats the client wants. The same endpoint can return HTML to one `Accept` and JSON to another. If you need data from an API, try setting it to `application/json` to get JSON instead of a rendered page.
+
+`Accept-Language` lists the languages the client prefers, in priority order. It is a preference, not a switch. A server honors it only for content it actually serves in more than one language, and ignores it otherwise. Where it applies, it changes translated text, date and number formats, and sometimes currency. Set it to match the locale you expect, then confirm from the response that the server applied it.


Doc writing style: Prefer a contraction.

Suggested change

`Accept-Language` lists the languages the client prefers, in priority order. It is a preference, not a switch. A server honors it only for content it actually serves in more than one language, and ignores it otherwise. Where it applies, it changes translated text, date and number formats, and sometimes currency. Set it to match the locale you expect, then confirm from the response that the server applied it.

`Accept-Language` lists the languages the client prefers, in priority order. It's a preference, not a switch. A server honors it only for content it actually serves in more than one language, and ignores it otherwise. Where it applies, it changes translated text, date and number formats, and sometimes currency. Set it to match the locale you expect, then confirm from the response that the server applied it.

vdusek · 2026-06-28T10:28:27Z

+
+## Default headers in Crawlee
+
+All built-in HTTP clients impersonate a browser by default. Instead of a bare library `User-Agent` like `python-httpx/0.27`, they send a realistic set of browser-like headers: a browser `User-Agent`, an `Accept`, an `Accept-Language`, and client hints where the client supports them. This makes a crawl look like normal browser traffic and avoids the simplest forms of blocking.


Doc writing style: The sentence opens with a bare "This" as the subject — give it a noun (the verb then agrees as "avoid").

Suggested change

All built-in HTTP clients impersonate a browser by default. Instead of a bare library `User-Agent` like `python-httpx/0.27`, they send a realistic set of browser-like headers: a browser `User-Agent`, an `Accept`, an `Accept-Language`, and client hints where the client supports them. This makes a crawl look like normal browser traffic and avoids the simplest forms of blocking.

All built-in HTTP clients impersonate a browser by default. Instead of a bare library `User-Agent` like `python-httpx/0.27`, they send a realistic set of browser-like headers: a browser `User-Agent`, an `Accept`, an `Accept-Language`, and client hints where the client supports them. Such headers make a crawl look like normal browser traffic and avoid the simplest forms of blocking.

vdusek · 2026-06-28T10:28:27Z

+
+Anti-bot systems look at more than header values. They look at which headers are present, their casing, and the order they arrive in. Real browsers send a consistent, recognizable set. A request that has a browser `User-Agent` but the wrong header order, or missing client hints, still looks automated.
+
+This is why `ImpitHttpClient` and `CurlImpersonateHttpClient` replicate the browser at the transport layer rather than just attaching headers. Setting a browser `User-Agent` on a plain client is not enough to pass these checks. If a target uses fingerprinting, prefer an impersonating client over hand-set headers.


Doc writing style: Replace the bare "This" opener with a noun, and contract "is not" → "isn't".

Suggested change

This is why `ImpitHttpClient` and `CurlImpersonateHttpClient` replicate the browser at the transport layer rather than just attaching headers. Setting a browser `User-Agent` on a plain client is not enough to pass these checks. If a target uses fingerprinting, prefer an impersonating client over hand-set headers.

This fingerprinting is why `ImpitHttpClient` and `CurlImpersonateHttpClient` replicate the browser at the transport layer rather than just attaching headers. Setting a browser `User-Agent` on a plain client isn't enough to pass these checks. If a target uses fingerprinting, prefer an impersonating client over hand-set headers.

vdusek · 2026-06-28T10:38:34Z

Hi @szaganek, could we ask you for a final doc style review? Thank you.

expose impersonate flag on HTTP crawlers

97e3c75

Mantisus self-assigned this Jun 9, 2026

Mantisus requested review from szaganek and vdusek June 9, 2026 21:51

Pijukatel reviewed Jun 10, 2026

View reviewed changes

vdusek reviewed Jun 10, 2026

View reviewed changes

Comment thread src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py Outdated

Comment thread src/crawlee/crawlers/_abstract_http/_abstract_http_crawler.py Outdated

add warning

681bc28

Mantisus requested a review from vdusek June 10, 2026 18:03

vdusek approved these changes Jun 11, 2026

View reviewed changes

vdusek changed the title ~~fix: Expose impersonate flag on HTTP crawlers.~~ fix: Expose impersonate flag on HTTP crawlers Jun 26, 2026

Mantisus added 3 commits June 26, 2026 13:20

Merge branch 'master' into http-impersonation-expose

c910ac5

drop impersonate flag

9e98458

update docs

925f97d

Mantisus changed the title ~~fix: Expose impersonate flag on HTTP crawlers~~ docs: Add HTTP headers guide Jun 26, 2026

vdusek reviewed Jun 28, 2026

View reviewed changes

	crawler = HttpCrawler(http_client=http_client)
	crawler = HttpCrawler(http_client=http_client, max_requests_per_crawl=10)

	context.log.info(response)
	context.log.info(f'{context.request.unique_key}: {response}')


		### Identity headers

		`User-Agent` identifies the client. Many sites serve different markup to a browser than to a crawler. Some reject requests whose `User-Agent` doesn't look like a real browser. It is one of the basic headers a server uses to identify the client, though not the only one.


		`Accept` lists the formats the client wants. The same endpoint can return HTML to one `Accept` and JSON to another. If you need data from an API, try setting it to `application/json` to get JSON instead of a rendered page.

		`Accept-Language` lists the languages the client prefers, in priority order. It is a preference, not a switch. A server honors it only for content it actually serves in more than one language, and ignores it otherwise. Where it applies, it changes translated text, date and number formats, and sometimes currency. Set it to match the locale you expect, then confirm from the response that the server applied it.


		## Default headers in Crawlee

		All built-in HTTP clients impersonate a browser by default. Instead of a bare library `User-Agent` like `python-httpx/0.27`, they send a realistic set of browser-like headers: a browser `User-Agent`, an `Accept`, an `Accept-Language`, and client hints where the client supports them. This makes a crawl look like normal browser traffic and avoids the simplest forms of blocking.


		Anti-bot systems look at more than header values. They look at which headers are present, their casing, and the order they arrive in. Real browsers send a consistent, recognizable set. A request that has a browser `User-Agent` but the wrong header order, or missing client hints, still looks automated.

		This is why `ImpitHttpClient` and `CurlImpersonateHttpClient` replicate the browser at the transport layer rather than just attaching headers. Setting a browser `User-Agent` on a plain client is not enough to pass these checks. If a target uses fingerprinting, prefer an impersonating client over hand-set headers.

	This is why `ImpitHttpClient` and `CurlImpersonateHttpClient` replicate the browser at the transport layer rather than just attaching headers. Setting a browser `User-Agent` on a plain client is not enough to pass these checks. If a target uses fingerprinting, prefer an impersonating client over hand-set headers.
	This fingerprinting is why `ImpitHttpClient` and `CurlImpersonateHttpClient` replicate the browser at the transport layer rather than just attaching headers. Setting a browser `User-Agent` on a plain client isn't enough to pass these checks. If a target uses fingerprinting, prefer an impersonating client over hand-set headers.

Uh oh!

Conversation

Mantisus commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Uh oh!

Pijukatel left a comment

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Pijukatel commented Jun 10, 2026

Uh oh!

Mantisus commented Jun 10, 2026

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

janbuchar commented Jun 26, 2026

Uh oh!

Mantisus commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vdusek commented Jun 26, 2026

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

vdusek Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

vdusek Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

vdusek Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

vdusek Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

vdusek Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

vdusek Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

vdusek commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Mantisus commented Jun 9, 2026 •

edited

Loading

Mantisus commented Jun 26, 2026 •

edited

Loading

vdusek commented Jun 28, 2026 •

edited

Loading