Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assorted improvements #6

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

notevenaperson
Copy link
Contributor

@scoliono here are some changes for review. 8569763 should fix #1.

Also your email visible in git log is very nice.

…een requests to prevent being blocked

1. adds the -R flag
2. Should fix scoliono#1 and
   adds the -nt flag
Because we no longer overwrite files without the user asking explicitly for it (-R flag)
The prompt also got in the way of running the script non-interactively
It's more informative to log the actual filename, which includes the page number. I also feel that gauging the progress is easy enough with (N/N) to make a percentage indicator unnecessary.

Changed from:
12% (1/8) done
25% (2/8) done
37% (3/8) done
50% (4/8) done
62% (5/8) done
75% (6/8) done
87% (7/8) done
100% (8/8) done

To:
Got ./OL370939M/100.jpg (1/8)
Got ./OL370939M/101.jpg (2/8)
Got ./OL370939M/102.jpg (3/8)
Got ./OL370939M/103.jpg (4/8)
Got ./OL370939M/104.jpg (5/8)
Got ./OL370939M/105.jpg (6/8)
Got ./OL370939M/106.jpg (7/8)
Got ./OL370939M/107.jpg (8/8)

(The command used to generate these logs was: `python3 ripper.py OL370939M -s 100 -e 107 -S 10`)
@scoliono scoliono self-requested a review September 14, 2021 08:09
@notevenaperson
Copy link
Contributor Author

@scoliono merge?

@scoliono
Copy link
Owner

scoliono commented Aug 1, 2023

This does not appear to totally circumvent Archive.org's rate limiting, from my testing. Around 100 pages or so, you start downloading 5 KB HTML documents instead of images.
Also, a nitpick: when some pages have already been downloaded, the total count that is displayed is misleading. For example, this book has 556 pages total, and about half of the pages were already downloaded:

./chiltonstoyotace0000unse_b0s4/177.jpg (0/556) already on disk, skipping
./chiltonstoyotace0000unse_b0s4/178.jpg (0/556) already on disk, skipping
./chiltonstoyotace0000unse_b0s4/179.jpg (0/556) already on disk, skipping
Got ./chiltonstoyotace0000unse_b0s4/180.jpg (1/556)
Got ./chiltonstoyotace0000unse_b0s4/181.jpg (2/556)
Got ./chiltonstoyotace0000unse_b0s4/182.jpg (3/556)
Got ./chiltonstoyotace0000unse_b0s4/183.jpg (4/556)
Got ./chiltonstoyotace0000unse_b0s4/184.jpg (5/556)
Got ./chiltonstoyotace0000unse_b0s4/185.jpg (6/556)
Got ./chiltonstoyotace0000unse_b0s4/186.jpg (7/556)
Got ./chiltonstoyotace0000unse_b0s4/187.jpg (8/556)
Got ./chiltonstoyotace0000unse_b0s4/188.jpg (9/556)
Got ./chiltonstoyotace0000unse_b0s4/189.jpg (10/556)
Got ./chiltonstoyotace0000unse_b0s4/190.jpg (11/556)
Got ./chiltonstoyotace0000unse_b0s4/191.jpg (12/556)
Got ./chiltonstoyotace0000unse_b0s4/192.jpg (13/556)
Got ./chiltonstoyotace0000unse_b0s4/193.jpg (14/556)
Got ./chiltonstoyotace0000unse_b0s4/194.jpg (15/556)
Got ./chiltonstoyotace0000unse_b0s4/195.jpg (16/556)
./chiltonstoyotace0000unse_b0s4/196.jpg (16/556) already on disk, skipping
./chiltonstoyotace0000unse_b0s4/197.jpg (16/556) already on disk, skipping
./chiltonstoyotace0000unse_b0s4/198.jpg (16/556) already on disk, skipping
./chiltonstoyotace0000unse_b0s4/199.jpg (16/556) already on disk, skipping

@scoliono
Copy link
Owner

scoliono commented Aug 1, 2023

If you're too persistent with the requests, it looks like you can also get this traceback:

  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1349, in getresponse
    response.begin()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/Library/Python/3.9/lib/python/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/util/retry.py", line 410, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/packages/six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "~/Library/Python/3.9/lib/python/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1349, in getresponse
    response.begin()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 316, in begin
    version, status, reason = self._read_status()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 285, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "~/git/archiveripper/ripper.py", line 107, in <module>
    main()
  File "~/git/archiveripper/ripper.py", line 89, in main
    contents = client.download_page(i, args.scale)
  File "~/git/archiveripper/api.py", line 141, in download_page
    res = self.session.get(self.book_page_urls[i] + "&scale=%d" % scale, headers={
  File "~/Library/Python/3.9/lib/python/site-packages/requests/sessions.py", line 543, in get
    return self.request('GET', url, **kwargs)
  File "~/Library/Python/3.9/lib/python/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "~/Library/Python/3.9/lib/python/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "~/Library/Python/3.9/lib/python/site-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hangs after downloading a few pages
2 participants