BUG: pd.read_csv Incorrect Checksum validation for COMPOSITE Checksum #60779
Labels
Bug
IO CSV
read_csv, to_csv
IO Network
Local or Cloud (AWS, GCS, etc.) IO Issues
Needs Triage
Issue that has not been reviewed by a pandas team member
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Boto3 >=1.36.0 has modified behaviour to add CRC32 checksum by default where supported.
When accessing s3 objects with pd.read_csv any s3 object that has created a COMPOSITE checksum fails reading as the checksum compared against is the FULL_OBJECT checksum.
Composite checksum appears to be calculated when an object exceeds ~10Mb when using boto3 upload_file(), seemingly it switches to a multi-part upload behind the scenes at that threshold. Other explicit multi-part uploads will presumably have the same behaviour.
Included test using both Pandas and Awswrangler for completeness
Output for failing versions
Using boto3 <1.36 all scenarios from the example code work
Test with older version
pip install "boto3<1.36.0"
Output from working version
The text was updated successfully, but these errors were encountered: