-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawler on update: Checksum remains unchanged even if HTML document has been modified #1071
Comments
Can you confirm something changed in the fields you use to create the MD5 checksum? A change anywhere else will not change the checksum. If that is not the case, can you share your configuration? |
Also, the checksum is created AFTER the document is imported. That means you need to ensure the fields you use to create the checksum are still present in the document after it was imported. |
Morning Pascal, Our configuration is: ....
.....
Thanks Pascal. |
Are you limiting the fields present in your documents? Ensure the Importer does not eliminate the fields you represent in your checksummer. You can either share your full config, or I suggest you use the DebugTagger as the last item in your importer section (post-import). This will print all fields remaining in your document after it has been imported. You can then confirm if those fields are there for the Checksummer and if their values have changed. |
Hi Pascal, Thank you very much for your response. The DebugTagger have been very useful. We're currently using only the Title field for the checksum. I'll attach a sample config file for your reference. During an incremental crawl, it processes correctly the first time. However, if we make additional changes and re-crawl, it consistently flags the entries as unmodified and rejects them. This has been a consistent result in our testing. Thank you Regards, Stephen Jacob |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
It looks like this one fell through the cracks. Do you still have the checksum problem? |
Hi,
We are running a demo site on a Microsoft IIS web server and using the latest version of Norconex Crawler.
We've configured both the documentChecksummer and metadataChecksummer, but we’re noticing that the checksum value remains the same even after modifying the HTML file on the server.
We've tried using both the "Last Modified" field and the MD5 checksum on specific fields, but the document continues to be rejected because the checksum generated remains unchanged, even when the HTML document has been modified.
Any insights or suggestions would be greatly appreciated. Thanks.
The text was updated successfully, but these errors were encountered: