You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been having a hard time to get past my queries lately - they get into almost infinite loops of messages like:
Malformed response from arXiv API - no data in feed
Malformed response from arXiv API - no data in feed
Malformed response from arXiv API - no data in feed
...
The queries actually return far less than 50000 results, the supposed limit of arxiv's API - they reside anywhere between 3000 and 12000 results. Here is an example:
category='math.AG'; start_date='20170101'; end_date='20190827'; getpapers --api 'arxiv' --query "cat:$category AND lastUpdatedDate:[${start_date}* TO ${end_date}*] " --outdir "$category" -p -l debug
In this issue (not strictly a 'bug') I document my attempts to get past those showstoppers. Here's what I did:
Set page size to 1000
I experimented with page sizes from 200 to 2000:
At 200, it takes ages to get all 10000+ results and you run a higher risk of entering the above-mentioned infinite loop of death due to the much-increased number of extra queries required to fetch them all.
At 1000, you get more results at once, you finish faster, you send less queries - and the risk of entering the infinite loop of death is not higher than with just 500. Plus: you don't automatically get just 200 results back, as seems to be the case with 2000...
I thus settled for a page size of 1000 in getpapers/lib/arxiv.js:
arxiv.pagesize = 1000
Set a higher delay between retries
I experimented with various delays too: the default 3 seconds are really too fast a hammering. 30 seconds is too much of sleeping. 15 or 20 seconds seem to be O.K., so I have set
arxiv.page_delay = 20000
in getpapers/lib/arxiv.js
Do not urlencode the whole query URL, only the parts that need it
I have been having a hard time to get past my queries lately - they get into almost infinite loops of messages like:
The queries actually return far less than 50000 results, the supposed limit of arxiv's API - they reside anywhere between 3000 and 12000 results. Here is an example:
category='math.AG'; start_date='20170101'; end_date='20190827'; getpapers --api 'arxiv' --query "cat:$category AND lastUpdatedDate:[${start_date}* TO ${end_date}*] " --outdir "$category" -p -l debug
In this issue (not strictly a 'bug') I document my attempts to get past those showstoppers. Here's what I did:
Set page size to 1000
I experimented with page sizes from 200 to 2000:
I thus settled for a page size of 1000 in getpapers/lib/arxiv.js:
arxiv.pagesize = 1000
Set a higher delay between retries
I experimented with various delays too: the default 3 seconds are really too fast a hammering. 30 seconds is too much of sleeping. 15 or 20 seconds seem to be O.K., so I have set
arxiv.page_delay = 20000
in getpapers/lib/arxiv.js
Do not urlencode the whole query URL, only the parts that need it
See #178 for this.
Correct bug where the results feed is not empty - but not full either...
See #177 for details.
Last but not least...(I will repeat myself on this): do yourself a favour and spoof your User Agent in getpapers/lib/config.js:
config.userAgent = 'Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0'
With the above changes in place, things have been getting better for me - and I hope the same for you too! :-)
The text was updated successfully, but these errors were encountered: