feat: allow using other HTTP clients #2661

janbuchar · 2024-09-06T14:47:03Z

closes HTTP client switching #2659

See https://gist.github.com/janbuchar/3a4724927de2c3a0bb16c46bb5940236 for an example curl-impersonate client.

The following got-scraping options were ignored (they will still work, but they're not part of the new interface):

decompress,
resolveBodyOnly,
allowGetBody,
dnsLookup,
dnsCache,
dnsLookupIpVersion,
retry,
hooks,
parseJson,
stringifyJson,
request,
cache,
cacheOptions,
http2
https
agent
localAddress
createConnection
pagination
setHost
maxHeaderSize
methodRewriting
enableUnixSockets
context

github-actions

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

github-actions

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

packages/core/src/http_clients/base_http_client.ts

barjin · 2024-09-30T10:32:29Z

So if I understand this correctly, we cannot get rid of got-scraping just yet because of the breaking change in BasicCrawlingContext.sendRequest, right?

Does it mean we have to model the BaseHttpClient to follow the got-scraping interfaces, though? It would make more sense to me to make BaseHttpClient independent (or have it follow some well-known API like fetch) and only care for the got-scraping interface where it matters, i.e. sendRequest.

In sendRequest, we would then translate from the got-scraping interface to the independent fetch API (and, in case of GotScrapingClient, later from fetch API back to the got-scraping calls).

barjin · 2024-09-30T10:35:06Z

Also, a completely unrelated thought / nit, but what file and directory casing convention are we following in Crawlee? It's already messy in the current codebase, but seeing snake_case files together with kebab-case in this PR made me think about it once again 😅

janbuchar · 2024-09-30T10:48:49Z

So if I understand this correctly, we cannot get rid of got-scraping just yet because of the breaking change in BasicCrawlingContext.sendRequest, right?

Well, yeah, since got is part of our public API, we cannot completely remove it 🤷. We'll see what our default http client for 4.0 will be.

Does it mean we have to model the BaseHttpClient to follow the got-scraping interfaces, though? It would make more sense to me to make BaseHttpClient independent (or have it follow some well-known API like fetch) and only care for the got-scraping interface where it matters, i.e. sendRequest.

Not really, but I'd prefer to implicitly support got stuff that won't be part of the http client interface (e.g., parseJson or cacheOptions, and this way, we can do it kinda easily with index signatures on http request/response. It would be doable if we used a fetch-like interface, but a bit more difficult.

Also, a completely unrelated thought / nit, but what file and directory casing convention are we following in Crawlee? It's already messy in the current codebase, but seeing snake_case files together with kebab-case in this PR made me think about it once again 😅

Well, I didn't notice any, so I just look at the nearest files and do my best effort.

B4nan · 2024-09-30T10:50:21Z

Historically things were snake case, but I believe nobody from the current team is in favor of that (I never like this). I would go with kebab case for anything new, and unify in v4 maybe (or sooner, I don't think it has any effect downstream, we don't allow deep imports anywhere).

janbuchar · 2024-10-08T14:24:24Z

@barjin @B4nan I added a basic test and a guide (I'll pull out the code samples later). Can you check it out? The request type will sadly have to stay this way.

docs/guides/custom-http-client.mdx

B4nan · 2024-10-08T15:37:53Z

docs/guides/custom-http-client.mdx

+	async sendRequest<TResponseType extends keyof ResponseTypes = "text">(
+		request: HttpRequest<TResponseType>,
+	): Promise<HttpResponse<TResponseType>> {
+        /* ... */


can we use fetch or axios here or is that more tricky than it sounds like? the example without any implementation is not really helpful.

if the problematic part is streaming, maybe we could provide some default dummy implementation/helper that would just load things into memory and wrap it in a stream?

I mean, if I'm going to implement it, I would rather just ship it as code. And writing an incomplete example is not exactly helpful either.

What's wrong with doing both? Docs are important, you won't get around that just by publishing some package, we need them. If you don't want to do it now, let's show the got scraping client implementation here (which can be simplified, referencing the full working code), but we need something better than what we have now.

Well, the curl-impersonate thing is (allegedly) slow and doesn't work on Windows - that's not something I'd like to put on display. A complete implementation of something else (fetch, axios, ...) would take some effort, and nobody asked for it (yet). That and the fact that we're going to change the interface pretty soon anyway (to be more ergonomic, among other things) kinda deter me.

Of course, I could just sketch something here and leave out the tricky parts (e.g., redirection + cookie handling), but that doesn't feel right either. It would look like we're leaving out the gnarly parts and leaving them for the user to figure out.

I guess just linking to that gist with curl-impersonate with a disclaimer could work?

but that doesn't feel right either.

It would still look better than what we have now. I am fine with leaving out tricky parts for now, but leaving out everything feels weird.

Btw, people already asked indirectly about axios support, we have several reports about things not working with got scraping, but working with axios.

A link to that gist could also help, and I would surely add it here.

killbus · 2024-12-16T08:02:13Z

Then how can I send a stream request from the router handler using sendRequest?

crawlingContext just only exposes a sendRequest property by createSendRequest.

janbuchar · 2024-12-16T12:19:59Z

Then how can I send a stream request from the router handler using sendRequest?

crawlingContext just only exposes a sendRequest property by createSendRequest.

That's the point, you don't have to change anything in your sendRequest calls. You just use https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions#httpClient to an alternative implementation of BaseHttpClient and you're done.

janbuchar added 3 commits September 6, 2024 16:45

Introduce BaseHttpClient interface

a995220

Add GotScrapingHttpClient

3c638ef

Use the http client in BasicCrawler

34043cb

janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 6, 2024

github-actions bot assigned janbuchar Sep 6, 2024

github-actions bot added this to the 97th sprint - Tooling team milestone Sep 6, 2024

github-actions bot reviewed Sep 6, 2024

View reviewed changes

Finalize using HttpClient in send_request

e3846bc

github-actions bot reviewed Sep 11, 2024

View reviewed changes

janbuchar added 8 commits September 11, 2024 16:00

Lint

374e973

Format

12b7eb5

Simplify types in cookie-related code

fb01709

Merge remote-tracking branch 'origin/master' into decouple-http-client

8e3709a

Decouple got-scraping from HttpCrawler

07da1f9

Adapt FileDownload class

c037552

Add httpClient to validation schema

936d064

Lint

1f42f86

janbuchar commented Sep 26, 2024

View reviewed changes

packages/core/src/http_clients/base_http_client.ts Outdated Show resolved Hide resolved

Fix type of context.sendRequest

d80641e

janbuchar requested review from B4nan, barjin and vladfrangu September 30, 2024 09:06

janbuchar added 4 commits September 30, 2024 15:11

Make BasicHttpClient an interface

448b1ec

Remove niche properties from the response type

66c0dae

Adjust cookie jar type

9330017

Extract sendRequest from BasicCrawler

b97f5a6

Add barebones guide on custom HTTP clients

035f69a

janbuchar requested a review from B4nan October 8, 2024 14:23

B4nan reviewed Oct 8, 2024

View reviewed changes

janbuchar and others added 5 commits October 9, 2024 10:19

Add e2e test

e0390e0

Lint & format

ef5a39a

Split up docs

fe1c45b

Extend the doc

d648f3b

fix e2e test setup

7779572

B4nan force-pushed the decouple-http-client branch from eef1c59 to 7779572 Compare October 9, 2024 14:26

janbuchar added 4 commits October 9, 2024 16:37

Fix eslintrc

d594a4f

Update e2e test

60aa5d2

Fix links

44e8c9c

Fix links again

e9ac0ce

janbuchar requested a review from B4nan October 9, 2024 15:24

B4nan added 4 commits October 10, 2024 11:39

try to fix e2e tests on platform

2ec7d7d

try to fix e2e tests on platform

e74f2f7

try to fix e2e tests on platform

3773f3c

try to fix e2e tests on platform

9a37515

B4nan changed the title ~~refactor: Decouple HTTP client~~ feat: allow using other HTTP clients Oct 10, 2024

B4nan added the product roadmap Issues synchronized to product roadmap. label Oct 10, 2024

try to fix e2e tests on platform

988e0b0

B4nan force-pushed the decouple-http-client branch from cf7f363 to 5d0200b Compare October 22, 2024 14:45

fix e2e test!

aaf84bf

B4nan force-pushed the decouple-http-client branch from 5d0200b to aaf84bf Compare October 22, 2024 14:56

Merge branch 'master' into decouple-http-client

e3984dc

B4nan merged commit 568c655 into master Oct 23, 2024
11 of 12 checks passed

B4nan deleted the decouple-http-client branch October 23, 2024 11:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow using other HTTP clients #2661

feat: allow using other HTTP clients #2661

janbuchar commented Sep 6, 2024 •

edited

Loading

github-actions bot left a comment

github-actions bot left a comment

barjin commented Sep 30, 2024

barjin commented Sep 30, 2024

janbuchar commented Sep 30, 2024

B4nan commented Sep 30, 2024

janbuchar commented Oct 8, 2024

B4nan Oct 8, 2024

janbuchar Oct 8, 2024

B4nan Oct 9, 2024

janbuchar Oct 9, 2024

B4nan Oct 9, 2024

killbus commented Dec 16, 2024

janbuchar commented Dec 16, 2024

feat: allow using other HTTP clients #2661

feat: allow using other HTTP clients #2661

Conversation

janbuchar commented Sep 6, 2024 • edited Loading

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

barjin commented Sep 30, 2024

barjin commented Sep 30, 2024

janbuchar commented Sep 30, 2024

B4nan commented Sep 30, 2024

janbuchar commented Oct 8, 2024

B4nan Oct 8, 2024

Choose a reason for hiding this comment

janbuchar Oct 8, 2024

Choose a reason for hiding this comment

B4nan Oct 9, 2024

Choose a reason for hiding this comment

janbuchar Oct 9, 2024

Choose a reason for hiding this comment

B4nan Oct 9, 2024

Choose a reason for hiding this comment

killbus commented Dec 16, 2024

janbuchar commented Dec 16, 2024

janbuchar commented Sep 6, 2024 •

edited

Loading