Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler on update: Checksum remains unchanged even if HTML document has been modified #1071

Open
stejacob opened this issue Oct 21, 2024 · 7 comments

Comments

@stejacob
Copy link

Hi,

We are running a demo site on a Microsoft IIS web server and using the latest version of Norconex Crawler.

We've configured both the documentChecksummer and metadataChecksummer, but we’re noticing that the checksum value remains the same even after modifying the HTML file on the server.

We've tried using both the "Last Modified" field and the MD5 checksum on specific fields, but the document continues to be rejected because the checksum generated remains unchanged, even when the HTML document has been modified.

....
Line  8680: 14:12:28.256 [es-node2.deimscloud.mil.ca#3] INFO  REJECTED_UNMODIFIED - http://es-node2.deimscloud.mil.ca/about.html - MD5DocumentChecksummer - Checksum=4529f56e11c85023cd3b815ffd1c2b1e|

....
	Line 10840: 14:30:39.085 [es-node2.deimscloud.mil.ca#3] INFO  REJECTED_UNMODIFIED - http://es-node2.deimscloud.mil.ca/about.html - MD5DocumentChecksummer - Checksum=4529f56e11c85023cd3b815ffd1c2b1e|

Any insights or suggestions would be greatly appreciated. Thanks.

@essiembre
Copy link
Contributor

Can you confirm something changed in the fields you use to create the MD5 checksum? A change anywhere else will not change the checksum.

If that is not the case, can you share your configuration?

@essiembre
Copy link
Contributor

Also, the checksum is created AFTER the document is imported. That means you need to ensure the fields you use to create the checksum are still present in the document after it was imported.

@stejacob
Copy link
Author

stejacob commented Oct 22, 2024

Morning Pascal,

Our configuration is:

....

<documentChecksummer class="MD5DocumentChecksummer"
                     combineFieldsAndContent="true"
                     keep="true"
                     toField="checksum">
  <fieldMatcher ignoreCase="false"
                ignoreDiacritic="false"
                method="CSV"
                partial="false"
                replaceAll="false">
        title,body_content,description</fieldMatcher>
</documentChecksummer>

.....

<metadataChecksummer class="com.norconex.collector.core.checksum.impl.GenericMetadataChecksummer" keep="true" targetField="metachecksum">
<sourceFieldsRegex>title|description</sourceFieldsRegex>
</metadataChecksummer>

Thanks Pascal.

@essiembre
Copy link
Contributor

Are you limiting the fields present in your documents? Ensure the Importer does not eliminate the fields you represent in your checksummer. You can either share your full config, or I suggest you use the DebugTagger as the last item in your importer section (post-import). This will print all fields remaining in your document after it has been imported. You can then confirm if those fields are there for the Checksummer and if their values have changed.

@stejacob
Copy link
Author

stejacob commented Oct 29, 2024

Hi Pascal,

Thank you very much for your response. The DebugTagger have been very useful.

We're currently using only the Title field for the checksum. I'll attach a sample config file for your reference.

During an incremental crawl, it processes correctly the first time. However, if we make additional changes and re-crawl, it consistently flags the entries as unmodified and rejects them. This has been a consistent result in our testing. Thank you

demo_site_config.txt

Regards,

Stephen Jacob

Copy link

stale bot commented Dec 29, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale From automation, when inactive for too long. label Dec 29, 2024
@essiembre
Copy link
Contributor

It looks like this one fell through the cracks. Do you still have the checksum problem?

@stale stale bot removed the stale From automation, when inactive for too long. label Dec 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants