Name		Name	Last commit message	Last commit date
parent directory ..
data		data
README.md		README.md
cc-main-2018-43-single-warc-file.ipynb		cc-main-2018-43-single-warc-file.ipynb
cc-main-2019-35-100-warc-files.ipynb		cc-main-2019-35-100-warc-files.ipynb
cc-main-2019-47-truncation-by-mime-type.ipynb		cc-main-2019-47-truncation-by-mime-type.ipynb

README.md

Truncated Records in WARC Files

Content payload in Common Crawl archives is truncated if the content exceeds a limit of

500 kiB in 2008 – 2012 ARC files
1 MiB in WARC files (since 2013)

The truncation is required to keep the crawl archives at a limited size and ensure that a broad sample of web pages is covered. It also avoids that the archives are filled by accidentally captured video or audio streams. The crawler needs to buffer the content temporarily and a limit ensures that this is possible with a limited amount of RAM for many parallel connections.

The notebooks in this folder analyze various aspects of payload truncation:

[cc-main-2018-43-single-warc-file.ipynb] - truncation counts for a single WARC file of CC-MAIN-2018-43 and a broken marking of truncated records
[cc-main-2019-35-100-warc-files.ipynb] - marking of truncated records has been fixed for CC-MAIN-2019-35 and 100 randomly selected WARC files are analyzed to verify the marking and get more detailed metrics
[cc-main-2019-47-truncation-by-mime-type.ipynb] - since November 2019 (CC-MAIN-2019-47) truncated records are marked in the URL indexes which allows to analyze distribution of truncated records over the entire monthly crawl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

warc-truncation

warc-truncation

README.md

Truncated Records in WARC Files

Files

warc-truncation

Directory actions

More options

Directory actions

More options

Latest commit

History

warc-truncation

Folders and files

parent directory

README.md

Truncated Records in WARC Files