-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: crawling in similar domain #612
Comments
The only thing I can think of is you added/modified the filtering rules after you ran the Collector a few times and it got that URL from the "crawlstore" cache. Do you have the same behavior if you delete your crawlstore directory and try again (starting from scratch)? If the problem is always there, please share your full config to reproduce. You may be interested to know there is a new flag added in the snapshot version that allows you to also include subdomains when you use <startURLs stayOnDomain="true" includeSubdomains="true" stayOnPort="false" stayOnProtocol="false">
... |
Thanks for replying. I tried includeSubdomains="true" and work fine. let me share the config to you, you may find "crawler.plugin.ContainsReferenceFilter" and "crawler.plugin.UrlReferenceFilter".
|
Given it works for you now. Can we close? Feel free to submit a pull request if you feel your filters are ready for general use. |
I met a similar problem again. Let say I am crawling start from a URL
Both two situations are included same referenceFilters
In situation 1: The collector don't extract any URLs from the start URL, the start URL contains lot of urls which are in domain In situation 2: The collector seems can extract the URLs from the start URL, but it will follow to the URLs which are outside of Also, I am always removing the "crawlstore" cache (which I store them into an output folder) before start the crawling. Below is the config files, crawled logs and the crawled document results. Please check on it. |
1. cannot-extract-links 2. processed-outside-domain For example, |
Thanks for replying. 1. cannot-extract-links 2. processed-outside-domain
For the case you mentioned, yes, I have changed the filter to |
The As far as having |
Hi Pascal,
I am working on a website which include different domains, such as...
In the config.xml, I did something like...
I found this solution from the past issues.
However, it seems not working in my case.
I got the following log which there is a unwanted url got fetched.
I would like to ask if there is any wrong from the config.
Thanks!
The text was updated successfully, but these errors were encountered: