You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I need to crawl a large set of hosts, some of which are hosted on a couple of "web hotel" servers. Is there a way that I can configure the collector so that it has a single delay value for all hosts that resolve to the same IP address?
Obviously one solution would be to group all the relevant sites to a single crawler to use as seeds and use ReferenceDelayResolver with scope=crawler, but it's unpredictable which sites will have links to these virtual hosts. The list of virtual hosted sites is also not static but changes almost daily, so maintaining it would be problematic.
Suggestions?
The text was updated successfully, but these errors were encountered:
Unfortunately, there is currently no out-of-the-box option to delay based on IP. The closest is "per site". You would have to write your own IDelayResolver or we can make this a feature request if you like.
Do you think you have many sites using the same IP? Because setting the scope to "site" on the GenericDelayResolver could be sufficient if you suspect there is only a handful.
Setting the scope to "crawler" is the "safest" but also the slowest when you have many sites.
FYI, with your start URLs, if you set "stayOnDomain" to "true", it will not go to sites beyond those you defined as start URLs.
I have on the order of dozens of sites on a handful of IP addresses. I can probably get away with using the "site" scope with a conservative default delay for now, we are only trying to think of problems in advance before I deploy the crawler to production.
Please do make a feature request, however. Much appreciated.
I need to crawl a large set of hosts, some of which are hosted on a couple of "web hotel" servers. Is there a way that I can configure the collector so that it has a single
delay
value for all hosts that resolve to the same IP address?Obviously one solution would be to group all the relevant sites to a single crawler to use as seeds and use
ReferenceDelayResolver
withscope=crawler
, but it's unpredictable which sites will have links to these virtual hosts. The list of virtual hosted sites is also not static but changes almost daily, so maintaining it would be problematic.Suggestions?
The text was updated successfully, but these errors were encountered: