Issues with Proxy Module and Scraper API #330

danuccio · 2021-09-18T20:54:35Z

danuccio
Sep 18, 2021

I am not sure if this is technically an issue with the Scholarly package so I put this in the discussion section.

I am trying to scrape data from Google Scholar for several hundred authors (and their respective publications) each week. I wrote a code that integrates Scholarly and should be able to do this. However, to avoid getting blocked by Google I am looking into Scholarly's API options. Scraper API looked like the most promising of the options and I integrated it into my code right before Labor Day. At the time it worked, although with a pretty high failure rate (~16-33%). Scraper API's support team said they had a known issue with Google at the time but were fixing it. After they fixed it, I seemed to be able to scrape with a near 100% success rate, although my Scraper API dashboard indicated a near 100% failure rate; Scraper API support said they did not have a record of me even making requests to Google, which would suggest the requests were being made from my IP address.

My question is has anyone else been having issues with Scholarly/Scraper API compatibility in the past week or two? I am trying to find out if the issue is with my code or if Scholarly and Scraper API are no longer compatible.

On a related note, does anyone know if the Luminati option is still working since they became Bright Data?

Below I have included the most relevant portions of my code.

This chunk is where I use Scholarly to call ScraperAPI, search for an author, and call a function to gather publication info.

`def main(author_ids, output_file, random_interval_precaution, article_limit_precaution, verbosity, api_key):

pg = ProxyGenerator()

#Retrieve the author's data, fill-in, and return list of dictionaries containing author and pub info
dicList = []
for author_id in author_ids:
    try:
       
        if api_key is not None:
            pg.ScraperAPI(str(api_key)) 
            scholarly.use_proxy(pg)

            id = scholarly.search_author_id(a_id)
            name = id['name']
            if verbosity == 1 or verbosity == 2: print(name)

            author = scholarly.fill(id) #Fill author by ID
            numPub = len(author['publications'])
            if verbosity == 1 or verbosity == 2: print(name + ' has ' + str(numPub) + ' publications')

            random_intervals = genRandList(numPub) # See genRandList

            article_limit = determine_article_limit(article_limit_precaution, numPub) #see determine_article_limit

            dicList = gather_pub_info(author, random_intervals, dicList, random_interval_precaution, article_limit, verbosity, pg, api_key) #see gather_pub_info

        else:

            ...

    except AttributeError:
        print('An AttributeError occurred for ' + a_id)
        print('Please check to make sure this ID is correct.')

    except Exception as exc:
        print(f'Could not get info for: {a_id}, exception: {exc}')

#Produce final .tsv file
produce_final_tsv(output_file, dicList)

`

This chunk is a function that loops through the publications of an author and fills them using scholarly.fill().
`
def gather_pub_info(author, random_intervals, dicList, random_interval_precaution, article_limit, verbosity, pg, api_key):

i = 0
for publication in author['publications']:
    if i < article_limit:

        try:
            if api_key is not None:
                pg.ScraperAPI(str(api_key))
                scholarly.use_proxy(pg)

                nPub = scholarly.fill(author['publications'][i])
                if verbosity == 2: print(nPub)
                dicList.append(nPub)

            else:
                ...
        except Exception as exc:
            print(f'Could not get info for: {publication}, exception: {exc}')

`

arunkannawadi · 2021-09-21T23:20:08Z

arunkannawadi
Sep 21, 2021
Maintainer

I tried using ScraperAPI last week, and it worked for the first few attempts before being blocked consistently. I thought using their API endpoint had marginal advantage over using their Proxy mode, but that could have just been coincidence. Can you try the Method 1 and Method 3 they have in the documentation using cURL with a Google Scholar publication page (one that has scholar? in the URL) and see if both methods work for you?

2 replies

arunkannawadi Sep 21, 2021
Maintainer

On an unrelated note, starting from v1.4, you can import DOSException and MaxTriesExceededException raised by scholarly and use them for catching your errors instead of the generic Exception as you do currently.

danuccio Sep 23, 2021
Author

Thank you for the additional insight.

I went to the Documentation page and tried cURL Method 1 (Requests to the API Endpoint) and Method 2 (Send Requests to the Proxy Port). Both worked fine for both author and publication pages. This was verified in my Scraper API dashboard.

I did not see a Method 3 and included a link to the page I used in case there are multiple documentation pages.

https://www.scraperapi.com/documentation/

At this point it would would seem both my API key and Scholarly work fine on their own. However, when I try to integrate them, Scholarly will produce results but does not actually call Scraper API.

arunkannawadi · 2021-09-24T03:51:49Z

arunkannawadi
Sep 24, 2021
Maintainer

OK, I have some good news! I've figured out the issue in scholarly that makes it hard to work with ScraperAPI and will submit a fix soon. However, I find using ScraperAPI extremely slow, but at least it works and they have a free plan. So this is definitely worth it.

2 replies

danuccio Sep 29, 2021
Author

Thank you for working on this! Please keep me posted when the fix is submitted.

On a related matter, do you or your team have a different preferred API or proxy option for large scale scraping projects? You seem a little lukewarm towards ScraperAPI, but it also looks like a lot of people have had issues with FreeProxies and Tor. Is Luminati (now Bright Data) a better option?

arunkannawadi Sep 29, 2021
Maintainer

FreeProxy is not at all an option if you are scraping publications (i.e., citation networks, or searching for papers by keyword etc.) but works mostly fine if you are scraping authors' info. I haven't tried Tor myself but I hear it isn't working that well either. Currently, Luminati/Bright Data is the best option and in v1.4.2 of scholarly, I expect ScraperAPI to become competitive with Luminati.

The current issue with ScraperAPI is misconfiguration in scholarly, so it is not entirely ScraperAPI's fault. When this issue is fixed, if your project is small enough that you can get away with their free plan, that might be your best option. If you are anyway going to need a paid plan, you should perhaps do a cost comparison between these two services. And if that is the case, you can start with Bright Data for a month or two without waiting for the fix to come through.

arunkannawadi · 2021-10-12T18:17:33Z

arunkannawadi
Oct 12, 2021
Maintainer

@danuccio Please install v1.4.2 of scholarly with your ScraperAPI credentials and let us know how it works. It would be slower with ScraperAPI but it is reliable. For several hundred authors/publications, you would need to upgrade to a paid plan most likely but you could check if it works with a free plan first.

7 replies

danuccio Oct 30, 2021
Author

Upon upgrading to Scholarly v1.4.4 I have begun to encounter new issues with the Scraper API / Scholarly integration.

What I am trying to do is input a list of author ids into my code, call Scraper API, and use Scholarly to return the publication info for those authors.

With v1.4.2 I was able to do that and seemed to only have to set the proxy once.

pg = ProxyGenerator()
proxy_success = False
        if api_key is not None:
                while proxy_success == False:
                        proxy_success = pg.ScraperAPI(str(api_key)) 
                        print('Proxy setup was successful')
                scholarly.use_proxy(pg)
        else:
            pg = None

Upon doing this, my code would return the desired publication info and I would see a success tallied in my Scraper API dashboard for each action Scholarly took (e.g. scholarly.search_author_id(), scholarly_fill(), etc).

With v1.4.4, however, I started noticing that only one success would be tallied in my Scraper API dashboard regardless of how many author ids I entered or how many publications they had.

Subsequently, I began putting 'scholarly.use_proxy(pg)' prior to using any method in scholarly that might interact with Google Scholar:

for a_id in author_ids:
        if verbosity == 1 or verbosity == 2: print('Processing: ', a_id)

        #Process valid IDs while flagging invalid IDs
        try:

            #Search author ID
            scholarly.use_proxy(pg)
            id = scholarly.search_author_id(a_id)
            name = id['name']

            #Fill author container with with basic info for author and publications
            #Also provides number of pubs
            scholarly.use_proxy(pg)
            author = scholarly.fill(id, sections=['basics', 'publications'] )
            numPub = len(author['publications'])
  
        ....

for publication in author['publications']:
        if i < article_limit:

        try:
                ###Fill publication info and append dictionary to list
                scholarly.use_proxy(pg)
                nPub = scholarly.fill(author['publications'][i])

Upon doing this though I would notice that (1) I would still usually only get one successful API call registered in my Scraper API dashboard even if I successfully gathered info for multiple authors with multiple publications, and (2) Sometimes my code would get caught on a single publication for 30+ minutes, leading me to just cancel it and start over.

I am not sure if the bug was in my previous code, the changes I made to my code, or the new version of Scholarly. But, any thoughts you have would be appreciated.

Also, if you could provide additional guidance on where and how many times I am supposed use 'scholarly.use_proxy(pg)', as well as how many successes should be tallied in the ScraperAPI dashboard for a script that uses both Scholarly and ScraperAPI to, for example, scrape data for two authors with ten publications each, that would also be incredibly helpful.

Thanks!

arunkannawadi Oct 31, 2021
Maintainer

With v1.4.4, however, I started noticing that only one success would be tallied in my Scraper API dashboard regardless of how many author ids I entered or how many publications they had.

Starting from v1.4.4, scholarly is smarter about not using premium proxies if it knows Google Scholar won't block them. So when you setup the proxy, it checks once to see if it works and does not use them further, as your author queries are not actively blocked. So it automatically uses FreeProxy to fetch your queries instead of the ScraperAPI you setup.

Sometimes my code would get caught on a single publication for 30+ minutes, leading me to just cancel it and start over.

This is likely happening because the FreeProxy used underneath has stopped working after fetching the information of a few authors. In v1.5, we will make it fall back to ScraperAPI if FreeProxy stopped working during the execution of your code.

where and how many times I am supposed use 'scholarly.use_proxy(pg)',

You shouldn't have to use this more than once if you are using the same ScraperAPI key. Please don't set this up multiple times. If number of ScraperAPI calls is not a limitation to you, set it up as scholarly.use_proxy(pg, pg). This is an undocumented feature and we will document it with v1.5 release.

how many successes should be tallied in the ScraperAPI dashboard for a script

We will include this in the v1.5 documentation as well.

Finally, thanks for reporting the issues that we don't quite see in the unit tests. We are continually improving scholarly and your feedback is very helpful in ironing out the issues we did know about ourselves.

danuccio Nov 29, 2021
Author

Thank you for the suggestions and clarifications in your previous response. The undocumented fix you suggested really helped me out.

Since then, I've been using Scholarly in conjunction with Scraper API pretty regularly and it generally works great.

However, the other day I noticed that although everything runs smoothly on Python3.6 I run into problems when I use it in Python3.9.7.

For example, the following code works fine with Python3.6.9

key = 'xxxxxxxxxxxxxxxxx'
a_id = '4bahYMkAAAAJ'
pg = ProxyGenerator()
pg.ScraperAPI(key)
scholarly.use_proxy(pg, pg)
id = scholarly.search_author_id(a_id)
print(id)
author = scholarly.fill(id, sections=['basics', 'publications'] )
print(author)

However, when I run it with Python3.9.7, I get the following error.

TypeError: use_proxy() takes 2 positional arguments but 3 were given

I don't think it's really a huge problem, but thought I would let you know.

Thanks again!

arunkannawadi Nov 29, 2021
Maintainer

It is likely that you have an older version of the package installed with your python 3.9.7? We see no such errors with python3.8 and it is therefore guaranteed to work in python 3.9. Can you confirm the scholarly version is v1.4.5?

danuccio Dec 6, 2021
Author

Thank you for pointing this out. Scholarly was up to date for my 3.6 version but not 3.9. Sorry about that.

ipeirotis · 2021-11-01T17:02:58Z

ipeirotis
Nov 1, 2021
Maintainer

A few ideas, to keep it simple:

When people set a single proxy, it should be used for all the requests. The "smart/economical" mode should only be activated when people explicitly select two proxy modes.
We can remove Luminati. The "Single Proxy" is sufficient to add a Luminati proxy if necessary.
We can remove Tor (both internal and external), it is not useful anymore.
The FreeProxies are sometimes less reliable than direct calls. We should not be making FreeProxies a default option; the user should explicitly ask to use it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with Proxy Module and Scraper API #330

{{title}}

Replies: 4 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Issues with Proxy Module and Scraper API #330

danuccio Sep 18, 2021

Replies: 4 comments · 11 replies

arunkannawadi Sep 21, 2021 Maintainer

arunkannawadi Sep 21, 2021 Maintainer

danuccio Sep 23, 2021 Author

arunkannawadi Sep 24, 2021 Maintainer

danuccio Sep 29, 2021 Author

arunkannawadi Sep 29, 2021 Maintainer

arunkannawadi Oct 12, 2021 Maintainer

danuccio Oct 30, 2021 Author

arunkannawadi Oct 31, 2021 Maintainer

danuccio Nov 29, 2021 Author

arunkannawadi Nov 29, 2021 Maintainer

danuccio Dec 6, 2021 Author

ipeirotis Nov 1, 2021 Maintainer

danuccio
Sep 18, 2021

Replies: 4 comments 11 replies

arunkannawadi
Sep 21, 2021
Maintainer

arunkannawadi Sep 21, 2021
Maintainer

danuccio Sep 23, 2021
Author

arunkannawadi
Sep 24, 2021
Maintainer

danuccio Sep 29, 2021
Author

arunkannawadi Sep 29, 2021
Maintainer

arunkannawadi
Oct 12, 2021
Maintainer

danuccio Oct 30, 2021
Author

arunkannawadi Oct 31, 2021
Maintainer

danuccio Nov 29, 2021
Author

arunkannawadi Nov 29, 2021
Maintainer

danuccio Dec 6, 2021
Author

ipeirotis
Nov 1, 2021
Maintainer