Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow using other HTTP clients #2661

Merged
merged 47 commits into from
Oct 23, 2024
Merged

feat: allow using other HTTP clients #2661

merged 47 commits into from
Oct 23, 2024

Conversation

janbuchar
Copy link
Contributor

@janbuchar janbuchar commented Sep 6, 2024

See https://gist.github.com/janbuchar/3a4724927de2c3a0bb16c46bb5940236 for an example curl-impersonate client.

The following got-scraping options were ignored (they will still work, but they're not part of the new interface):

  • decompress,
  • resolveBodyOnly,
  • allowGetBody,
  • dnsLookup,
  • dnsCache,
  • dnsLookupIpVersion,
  • retry,
  • hooks,
  • parseJson,
  • stringifyJson,
  • request,
  • cache,
  • cacheOptions,
  • http2
  • https
  • agent
  • localAddress
  • createConnection
  • pagination
  • setHost
  • maxHeaderSize
  • methodRewriting
  • enableUnixSockets
  • context

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Sep 6, 2024
@github-actions github-actions bot added this to the 97th sprint - Tooling team milestone Sep 6, 2024
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@barjin
Copy link
Contributor

barjin commented Sep 30, 2024

So if I understand this correctly, we cannot get rid of got-scraping just yet because of the breaking change in BasicCrawlingContext.sendRequest, right?

Does it mean we have to model the BaseHttpClient to follow the got-scraping interfaces, though? It would make more sense to me to make BaseHttpClient independent (or have it follow some well-known API like fetch) and only care for the got-scraping interface where it matters, i.e. sendRequest.

In sendRequest, we would then translate from the got-scraping interface to the independent fetch API (and, in case of GotScrapingClient, later from fetch API back to the got-scraping calls).

@barjin
Copy link
Contributor

barjin commented Sep 30, 2024

Also, a completely unrelated thought / nit, but what file and directory casing convention are we following in Crawlee? It's already messy in the current codebase, but seeing snake_case files together with kebab-case in this PR made me think about it once again 😅

@janbuchar
Copy link
Contributor Author

So if I understand this correctly, we cannot get rid of got-scraping just yet because of the breaking change in BasicCrawlingContext.sendRequest, right?

Well, yeah, since got is part of our public API, we cannot completely remove it 🤷. We'll see what our default http client for 4.0 will be.

Does it mean we have to model the BaseHttpClient to follow the got-scraping interfaces, though? It would make more sense to me to make BaseHttpClient independent (or have it follow some well-known API like fetch) and only care for the got-scraping interface where it matters, i.e. sendRequest.

Not really, but I'd prefer to implicitly support got stuff that won't be part of the http client interface (e.g., parseJson or cacheOptions, and this way, we can do it kinda easily with index signatures on http request/response. It would be doable if we used a fetch-like interface, but a bit more difficult.

Also, a completely unrelated thought / nit, but what file and directory casing convention are we following in Crawlee? It's already messy in the current codebase, but seeing snake_case files together with kebab-case in this PR made me think about it once again 😅

Well, I didn't notice any, so I just look at the nearest files and do my best effort.

@B4nan
Copy link
Member

B4nan commented Sep 30, 2024

Historically things were snake case, but I believe nobody from the current team is in favor of that (I never like this). I would go with kebab case for anything new, and unify in v4 maybe (or sooner, I don't think it has any effect downstream, we don't allow deep imports anywhere).

@janbuchar janbuchar requested a review from B4nan October 8, 2024 14:23
@janbuchar
Copy link
Contributor Author

@barjin @B4nan I added a basic test and a guide (I'll pull out the code samples later). Can you check it out? The request type will sadly have to stay this way.

docs/guides/custom-http-client.mdx Outdated Show resolved Hide resolved
docs/guides/custom-http-client.mdx Outdated Show resolved Hide resolved
async sendRequest<TResponseType extends keyof ResponseTypes = "text">(
request: HttpRequest<TResponseType>,
): Promise<HttpResponse<TResponseType>> {
/* ... */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use fetch or axios here or is that more tricky than it sounds like? the example without any implementation is not really helpful.

if the problematic part is streaming, maybe we could provide some default dummy implementation/helper that would just load things into memory and wrap it in a stream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if I'm going to implement it, I would rather just ship it as code. And writing an incomplete example is not exactly helpful either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with doing both? Docs are important, you won't get around that just by publishing some package, we need them. If you don't want to do it now, let's show the got scraping client implementation here (which can be simplified, referencing the full working code), but we need something better than what we have now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the curl-impersonate thing is (allegedly) slow and doesn't work on Windows - that's not something I'd like to put on display. A complete implementation of something else (fetch, axios, ...) would take some effort, and nobody asked for it (yet). That and the fact that we're going to change the interface pretty soon anyway (to be more ergonomic, among other things) kinda deter me.

Of course, I could just sketch something here and leave out the tricky parts (e.g., redirection + cookie handling), but that doesn't feel right either. It would look like we're leaving out the gnarly parts and leaving them for the user to figure out.

I guess just linking to that gist with curl-impersonate with a disclaimer could work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that doesn't feel right either.

It would still look better than what we have now. I am fine with leaving out tricky parts for now, but leaving out everything feels weird.

Btw, people already asked indirectly about axios support, we have several reports about things not working with got scraping, but working with axios.

A link to that gist could also help, and I would surely add it here.

@B4nan B4nan force-pushed the decouple-http-client branch from eef1c59 to 7779572 Compare October 9, 2024 14:26
@janbuchar janbuchar requested a review from B4nan October 9, 2024 15:24
@B4nan B4nan changed the title refactor: Decouple HTTP client feat: allow using other HTTP clients Oct 10, 2024
@B4nan B4nan added the product roadmap Issues synchronized to product roadmap. label Oct 10, 2024
@B4nan B4nan force-pushed the decouple-http-client branch from cf7f363 to 5d0200b Compare October 22, 2024 14:45
@B4nan B4nan force-pushed the decouple-http-client branch from 5d0200b to aaf84bf Compare October 22, 2024 14:56
@B4nan B4nan merged commit 568c655 into master Oct 23, 2024
11 of 12 checks passed
@B4nan B4nan deleted the decouple-http-client branch October 23, 2024 11:14
@killbus
Copy link

killbus commented Dec 16, 2024

Then how can I send a stream request from the router handler using sendRequest?

crawlingContext just only exposes a sendRequest property by createSendRequest.

@janbuchar
Copy link
Contributor Author

Then how can I send a stream request from the router handler using sendRequest?

crawlingContext just only exposes a sendRequest property by createSendRequest.

That's the point, you don't have to change anything in your sendRequest calls. You just use https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions#httpClient to an alternative implementation of BaseHttpClient and you're done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
product roadmap Issues synchronized to product roadmap. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTTP client switching
4 participants