You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The HTTP headers appear to not be written to WARC correctly. The request headers are constructed independently of what Scrapy actually sends to the server; the response's original status line, as mentioned in a code comment, is discarded entirely; and the headers get normalised by Scrapy (in scrapy.http.headers.Headers). Instead, the exact bytes sent to and received from the server (on the HTTP layer) should be written to the WARC file.
The text was updated successfully, but these errors were encountered:
I'm willing to fix this problem but I'm not sure how to retrieve the raw information using scrapy.
Do you have any tips on how to approach this?
Related to #15
I haven't worked much with Scrapy, so no idea. I briefly looked around and saw that Scrapy has 'downloader middleware', but it looks like those also only see abstract Request/Response objects, so even that might not be enough. (Side note: Content-Encoding decompression is handled in a middleware, but the compressed data needs to be written to WARC.)
I wouldn't be surprised if it was hard to fix. Many HTTP libraries in Python make it difficult to access the raw data – or rather, the authors probably never thought about it because it's not normally necessary. However, until then, all WARCs written by crau must be considered inaccurate.
The HTTP headers appear to not be written to WARC correctly. The request headers are constructed independently of what Scrapy actually sends to the server; the response's original status line, as mentioned in a code comment, is discarded entirely; and the headers get normalised by Scrapy (in
scrapy.http.headers.Headers
). Instead, the exact bytes sent to and received from the server (on the HTTP layer) should be written to the WARC file.The text was updated successfully, but these errors were encountered: