Skip to content

Latest commit

 

History

History
199 lines (123 loc) · 9.82 KB

CHANGELIST.rst

File metadata and controls

199 lines (123 loc) · 9.82 KB

Unreleased

  • Add support for Python 3.12 by adjusting urllib3 dependency to >=1.26.4,<1.26.16

1.7.4

  • capture_http support for chunk-encoded requests #116
  • indexer: option to enable verify_http #116
  • Enable writing block digests for warcinfo records #115

1.7.3

  • Fix documentation for capture_http filter_records #110
  • Fix capture_http with http and https proxies #113

1.7.2

  • Ensure 1.1 revisit profile used with WARC/1.1 revisits #96
  • Include record offsets in warcio check output #98
  • CI fix for python 2.7, use jinja<3.0.0 (#105)
  • Fix in StatusAndHeaders when writing, then reading record #106
  • Fix issues related to http header re-encoding, ensure correct content-length and %-encoding #106, #107

1.7.1

  • Windows fixes: Fix reading from stdin, ensure all WARCs/ARCs are treated as binary #86
  • Fix ensure_digest(block=True) breaking on an existing record, RecordBuilder supports header_filter #85

1.7.0

  • Docs and Misc Cleanup: add docs for extract tool, correct doc for get_statuscode(), move all CLI tools to separate modules for better reusability.
  • Support indexing a WARC read from stdin #79
  • Automatically %-encode urls that have a space in WARC-Target-URI #80
  • Separate record creation into RecordBuilder class to allow building WARC records without a WARCWriter, which now derives from RecordBuilder #63
  • Support the ability to optionally check ARC/WARC record's block and payload digests #54, #58, #68, #77
    • Creation of ArchiveIterator and ArcWarcRecordLoader now accept an check_digests boolean keyword argument indicating if each records digest should be checked, defaults to False
    • Core digest checking functionality is provided by DigestChecker and DigestVerifyingReader importable from warcio.digestverifyingreader
    • New block and payload digest checking utility class, Checker, has been added and is importable from warcio.checker
    • The CLI has been updated to provide warcio check, a command for performing block and payload digest checking
  • Ensured that ARCHeadersParser's splitting on spaces does not split any spaces in uri's #62
  • Move the compute_headers_buffer method and headers_buff property to the StatusAndHeaders and fix incorrect digests in some test WARCs #67
  • Ensured that the BaseWARCWriter does not use a mutable default value for the warc_header_dict keyword argument #70

1.6.3

  • Make warcio recompress more robust in fixing improperly compressed WARCs, --verbose mode for printing results #52
  • BufferedReader supports streaming all members of multi-member gzip file with read_all_members=True option.

1.6.2

  • Ensure any non-ascii data in http headers is %-encoded, even if non-conformant to RFC 8187 #51

1.6.1

  • Fixes for warcio.utils.open() not opening files in binary mode in Python 2.7 on Windows #49
  • capture_http() various fixes and improvements, default writer, WARC-IP-Address header support #50

1.6.0

  • Support WARC/1.1 standard WARC records, reading #39 and writing #46 with microsecond precision WARC-Date
  • Support simplified semantics for capturing http traffic to a WARC #43
  • Support parsing incorrect wget 1.19 WARCs with angle brackets, eg: WARC-Target-URI: <uri> #42
  • Correct encoding of non-ascii HTTP headers per RFC 8187 #45
  • New Util Added: warcio.utils.open provides exclusive creation mode open(..., 'x') for Python 2.7

1.5.3

  • ArchiveIterator calls new close_decompressor() function in BufferedReader instead of close() to only close decompressor, not underlying stream. #35

1.5.2

  • Write any errors during decompression to stderr #31
  • to_native_str() returns original value unchanged if not a string/bytes type
  • WarcWriter.create_visit_record() accepts additional WARC headers dictionary
  • ArchiveIterator.close() added which calls decompressor.flush() to address possible issues in #34
  • Switch Warc-Record-ID uuid creation to uuid4() from uuid1()

1.5.1

  • remove test/data from wheel build, as it breaks latest setuptools wheel installation
  • add Content-Length when adding Content-Range via StatusAndHeaders.add_range #29

1.5.0

  • new extract cli command #26 (by @nlevitt)
  • fix for writing WARC record with no content-type #27 (by @thomaspreece)
  • better verification of chunk header before attempting to de-chunk with ChunkedDataReader
  • MANIFEST.in added (by @pmlandwehr)

1.4.0

  • Indexing API improvements:
    • Indexer class moved to indexer.py and all aspects of indexing process can be extended.
    • Support for accessing http headers with http:-prefixed fields #22
    • Special fields: filename field and http:status
    • JSON offset and length fields returned as strings for consistency.
    • ArchiveIterator API: add get_record_offset() and get_record_length() to return current offset/length, iterator now tracks current record
  • StatusAndHeaders accepts headers in more flexible formats (mapping, byte or string) and normalizes to string tuples #19

1.3.4

  • Continuous read for more data to decompress (introduced in 1.3.2 for brotli decomp) should only happen if no unused data remaining. Otherwise, likely at gzip member end.

1.3.3

  • Set default read block_size to 16384, ensure block_size is never None (caused an issue in py2.7)

1.3.2

  • Fixes issues with BufferedReader returning empty response due to brotli decompressor requiring additional data, for more details see: #21

1.3.1

  • Fixes #15, including:
  • WARCWriter.create_warc_record() works correctly when specifying a payload with no length param.
  • Writing DNS records now works (tests included).
  • HTTP headers only expected for writing request, response records if the URI has a http: or https: scheme (consistent with reading).

1.3

  • Support for reading "streaming" WARC records, with no Content-Length set. Content-Length and digests computed as expected when the record is written.
  • Additional tests for streaming WARC records, loading HTTP headers+payload from buffer, POST request record, arc2warc conversion.
  • recompress command now parses records fully and generates correct block and payload digests.
  • WARCWriter.writer.create_record_from_stream() removed, redundant with ArcWarcRecordLoader()

1.2

  • Support for special field offset to include WARC record offset when indexing (by @nlevitt, #4)
  • ArchiveIterator supports full iterator semantics
  • WARC headers encoded/decoded as UTF-8, with fallback to ISO-8859-1 (see #6, #7)
  • ArchiveIterator, StatusAndHeaders and WARCWriter now available from package root (by @nlevitt, #10)
  • StatusAndHeaders supports dict-like API (by @nlevitt, #11)
  • When reading, http headers never added by default, unless ensure_http_headers=True is set (see #12, #13)
  • All tests run on Windows, CI using Appveyor
  • Additional tests for writing/reading resource, metadata records
  • warcio -V now outputs current version.

1.1

  • Header filtering: support filtering via custom header function, instead of an exclusion list
  • Add tests for invalid data passed to recompress, remove unused code

1.0

Initial Release!