Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug on instagram scraping #892

Closed
redfalcoon opened this issue May 12, 2023 · 2 comments
Closed

Bug on instagram scraping #892

redfalcoon opened this issue May 12, 2023 · 2 comments
Labels
duplicate This issue or pull request already exists

Comments

@redfalcoon
Copy link

Describe the bug

trying snscrape --jsonl --max-results 10 instagram-hashtag oreo
I've got the following error:

2023-05-12 10:36:36.448 CRITICAL snscrape._cli Dumped stack and locals to /tmp/snscrape_locals_wgtjomc1
Traceback (most recent call last):
File "/usr/local/bin/snscrape", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/snscrape/_cli.py", line 320, in main
for i, item in enumerate(scraper.get_items(), start = 1):
File "/usr/local/lib/python3.8/site-packages/snscrape/modules/instagram.py", line 109, in get_items
r = self._initial_page()
File "/usr/local/lib/python3.8/site-packages/snscrape/modules/instagram.py", line 77, in _initial_page
r = self._get(self._initialUrl, headers = self._headers, responseOkCallback = self._check_initial_page_callback)
File "/usr/local/lib/python3.8/site-packages/snscrape/base.py", line 266, in _get
return self._request('GET', *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/snscrape/base.py", line 237, in _request
success, msg = responseOkCallback(r)
File "/usr/local/lib/python3.8/site-packages/snscrape/modules/instagram.py", line 88, in _check_initial_page_callback
jsonData = r.text.split('<script type="text/javascript">window._sharedData = ')[1].split(';</script>')[0] # May throw an IndexError if Instagram changes something again; we just let that bubble.
IndexError: list index out of range

the same behaviour happens also using a istagram-user and instagram-location

How to reproduce

Issuing the command
snscrape --jsonl --max-results 10 instagram-hashtag oreo

Expected behaviour

a json of the objects in page

Screenshots and recordings

No response

Operating system

centos 8

Python version: output of python3 --version

Python 3.8.12

snscrape version: output of snscrape --version

snscrape 0.6.2.20230321.dev13+g786815d

Scraper

snscrape --jsonl --max-results 10 instagram-hashtag oreo

How are you using snscrape?

CLI (snscrape ... as a command, e.g. in a terminal)

Backtrace

No response

Log output

2023-05-12 10:44:46.688 INFO snscrape.modules.instagram Retrieving initial data
2023-05-12 10:44:46.690 INFO snscrape.base Retrieving https://www.instagram.com/explore/tags/oreo/
2023-05-12 10:44:46.690 DEBUG snscrape.base ... with headers: {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
2023-05-12 10:44:46.690 DEBUG snscrape.base ... with environmentSettings: {'proxies': OrderedDict(), 'stream': False, 'verify': True, 'cert': None}
2023-05-12 10:44:46.690 DEBUG urllib3.connectionpool Starting new HTTPS connection (1): www.instagram.com:443
2023-05-12 10:44:46.755 DEBUG snscrape.base Connected to: ('31.13.86.174', 443)
2023-05-12 10:44:46.755 DEBUG snscrape.base Connection cipher: ('ECDHE-RSA-AES128-GCM-SHA256', 'TLSv1/SSLv3', 128)
2023-05-12 10:44:47.697 DEBUG urllib3.connectionpool https://www.instagram.com:443 "GET /explore/tags/oreo/ HTTP/1.1" 200 None
2023-05-12 10:44:47.832 INFO snscrape.base Retrieved https://www.instagram.com/explore/tags/oreo/: 200
2023-05-12 10:44:47.832 DEBUG snscrape.base ... with response headers: {'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'critical-ch': 'Sec-CH-UA-Model', 'accept-ch-lifetime': '4838400', 'accept-ch': 'viewport-width,Sec-CH-Prefers-Color-Scheme,Sec-CH-UA-Full-Version-List,Sec-CH-UA-Platform-Version', 'reporting-endpoints': 'coep_report="https://www.facebook.com/browser_reporting/?minimize=0", default="https://www.instagram.com/error/ig_web_error_reports/?device_level=unknown"', 'report-to': '{"max_age":86400,"endpoints":[{"url":"https:\/\/www.facebook.com\/browser_reporting\/?minimize=0"}],"group":"coep_report"}, {"max_age":259200,"endpoints":[{"url":"https:\/\/www.instagram.com\/error\/ig_web_error_reports\/?device_level=unknown"}]}', 'content-security-policy-report-only': "default-src *.facebook.com *.fbcdn.net *.instagram.com data: blob:;script-src *.facebook.com *.fbcdn.net *.facebook.net 'unsafe-inline' 'unsafe-eval' blob: data: 'self' *.instagram.com static.cdninstagram.com;style-src data: blob: 'unsafe-inline' *.fbcdn.net *.facebook.com *.instagram.com static.cdninstagram.com;connect-src *.facebook.com facebook.com .fbcdn.net .facebook.net wss://.facebook.com: blob: .instagram.com .cdninstagram.com wss://.instagram.com: 'self' wss://edge-chat.instagram.com connect.facebook.net;font-src *.facebook.com data: *.fbcdn.net *.instagram.com static.cdninstagram.com *.intern.facebook.com;img-src *.instagram.com *.facebook.com *.fbcdn.net data: blob: *.cdninstagram.com *.fbsbx.com android-webview-video-poster:;media-src *.facebook.com *.fbcdn.net *.instagram.com *.cdninstagram.com cdn.fbsbx.com data: blob:;frame-src *.instagram.com *.facebook.com *.fbsbx.com fbsbx.com data:;block-all-mixed-content;report-uri https://www.facebook.com/csp/reporting/?m=c&minimize=0;", 'content-security-policy': "default-src *.facebook.com *.fbcdn.net *.instagram.com data: blob:;script-src *.facebook.com *.fbcdn.net *.facebook.net 'unsafe-inline' 'unsafe-eval' blob: data: 'self' *.instagram.com static.cdninstagram.com;style-src data: blob: 'unsafe-inline' *.fbcdn.net *.facebook.com *.instagram.com static.cdninstagram.com;connect-src *.facebook.com facebook.com .fbcdn.net .facebook.net wss://.facebook.com: blob: .instagram.com .cdninstagram.com wss://.instagram.com: 'self' wss://edge-chat.instagram.com connect.facebook.net;font-src *.facebook.com data: *.fbcdn.net *.instagram.com static.cdninstagram.com *.intern.facebook.com;img-src *.instagram.com *.facebook.com *.fbcdn.net data: blob: *.cdninstagram.com *.fbsbx.com android-webview-video-poster: *.whatsapp.net;media-src *.facebook.com *.fbcdn.net *.instagram.com *.cdninstagram.com cdn.fbsbx.com data: blob:;frame-src *.instagram.com *.facebook.com *.fbsbx.com fbsbx.com data:;block-all-mixed-content;upgrade-insecure-requests;report-uri https://www.facebook.com/csp/reporting/?m=c&minimize=0;", 'document-policy': 'force-load-at-top', 'permissions-policy': 'accelerometer=()', 'cross-origin-resource-policy': 'rollout', 'cross-origin-embedder-policy-report-only': 'require-corp;report-to="coep_report"', 'cross-origin-opener-policy': 'same-origin-allow-popups', 'Pragma': 'no-cache', 'Cache-Control': 'private, no-cache, no-store, must-revalidate', 'Expires': 'Sat, 01 Jan 2000 00:00:00 GMT', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '0', 'X-Frame-Options': 'DENY', 'Strict-Transport-Security': 'max-age=15552000', 'Content-Type': 'text/html; charset="utf-8"', 'X-FB-Debug': '+eZyYqxs96slB6NBTlraJ0h9YS5OXMvuy+/L90+PskOEK8vZ+3YgAbE2msYaoJGl2cgSJYgRhPb0v9P9kWMIfA==', 'Date': 'Fri, 12 May 2023 08:44:47 GMT', 'X-FB-TRIP-ID': '1679558926', 'Alt-Svc': 'h3=":443"; ma=86400', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive'}

Dump of locals

No response

Additional context

No response

@redfalcoon redfalcoon added the bug Something isn't working label May 12, 2023
@AhmadGhachim
Copy link

Got the same error too

@TheTechRobo
Copy link
Contributor

#520 ?

@JustAnotherArchivist JustAnotherArchivist added duplicate This issue or pull request already exists and removed bug Something isn't working labels May 12, 2023
@JustAnotherArchivist JustAnotherArchivist closed this as not planned Won't fix, can't repro, duplicate, stale May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

4 participants