-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
possibility of a com.norconex.collector.http.data.store.impl.dynamodb ? #448
Comments
Are you talking about the URL crawl store? If so, there are no plans to have a dynamodb implementation, but we can make this a feature request and we'll get to it if there is enough demand. A crawl store is a cache of what has already been crawled (e.g., to help detect modifications and deletions) and is not meant to store all content + metadata. For this, you need a Committer. If you want to create a new crawl store, look at how the current ones are implemented, such as MongoDB here and there. Committers should be simpler to implement and they usually are what you want. Have a look here to get started. Again, you may want to check how existing ones were done. Does this help? If you end up creating your own, let us know! |
Awesome response. Awesome software. Thank you!
Yes, I'm looking for the collector, not the committer. Point of note
however, my research indicated problems with SSL/TCP connections with
Mondo. I would read that as the current collector user base may well be
simply using local network databases (maybe).
Either way, any new ticket should probably mention testing with secure
protocols.
Thanks a ton. Can't wait to use this!
Also, to close the loop, I have a *completely separate question* in to a
Valerie Draper at Norconex. Just want to close the loop on that point. I
hope it doesn't cause confusion, as they are both technical in nature.
Regards,
- Pete Lombardo
|
+1 - DynamoDB is easy in AWS and a crawler like norconex has a calculable number of requests, which fits the DynamoDB provisioning system. DynamoDB crawlstore + S3StatusStore would mean that Norconex collector can be run on an AWS spot instance once a week, and people save loads. Current system, the best case would be MVstore on a persistent EBS volume that you reload to recrawl the next week or what not, but you still need some way to get the status off the box, e.g. periodic s3 sync etc. DynamoDB crawlstore + S3StatusStore would be a nearly complete solution without the result to DevOps like scripting to get it done (like another guys mkfifo). |
Along the lines of the mongodb driver a dynamodb driver would be great. If there are no plans/bandwidth to make one, please post any guidance here for a novice java developer to get started.
The text was updated successfully, but these errors were encountered: