Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Headers not preserved correctly #16

Open
JustAnotherArchivist opened this issue Nov 13, 2019 · 2 comments
Open

Headers not preserved correctly #16

JustAnotherArchivist opened this issue Nov 13, 2019 · 2 comments

Comments

@JustAnotherArchivist
Copy link

The HTTP headers appear to not be written to WARC correctly. The request headers are constructed independently of what Scrapy actually sends to the server; the response's original status line, as mentioned in a code comment, is discarded entirely; and the headers get normalised by Scrapy (in scrapy.http.headers.Headers). Instead, the exact bytes sent to and received from the server (on the HTTP layer) should be written to the WARC file.

@turicas
Copy link
Owner

turicas commented Feb 19, 2023

I'm willing to fix this problem but I'm not sure how to retrieve the raw information using scrapy.
Do you have any tips on how to approach this?
Related to #15

@JustAnotherArchivist
Copy link
Author

I haven't worked much with Scrapy, so no idea. I briefly looked around and saw that Scrapy has 'downloader middleware', but it looks like those also only see abstract Request/Response objects, so even that might not be enough. (Side note: Content-Encoding decompression is handled in a middleware, but the compressed data needs to be written to WARC.)
I wouldn't be surprised if it was hard to fix. Many HTTP libraries in Python make it difficult to access the raw data – or rather, the authors probably never thought about it because it's not normally necessary. However, until then, all WARCs written by crau must be considered inaccurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants