-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getpapers 'JavaScript heap out of memory error' #191
Comments
I think the issue may be more towards the interaction with the EUPMC API.
I recieve the following output:
This time the results do output to the directory but the progress bar appear to halt at 5% (4990 unique downloads as shows in log) of the 100k specified because of this wrong hitcount error which may be the real problem. To add, the total number of available papers with this query is supposedly 1,029,162. |
Thanks!
On Thu, May 13, 2021 at 3:40 PM James Sanders ***@***.***> wrote:
I am hoping to do an extensive mine (~1 million papers ultimately). I am
attempting a 100k paper mine first, however, I am running into memory
errors which I would guess are due to the number of papers attempting to
download?
The error I recieve is:
I am aware that there seems to be a limit - I think people have mentioned
that it cuts out about 30K
Is there anyway around this or a fix? Would this be improved if I
increased the RAM on the machine (currently 8gb)?
I didn't write the software - Rick Smith-Unna did and I don't know Node
well enough to comment.
My fix would be to try to batch it by date - divide it into d1-d2, d2-d3,
etc.
It's possible that if you restart it will read the results_eupmc.json file,
note what has been downloaded it and skip those.
You might try
for i = 1,50 {
getpapers -q biotechnology -a -k 30000 -o 1000000_biotechnology
}
but I haven't tried. It may apply the cutoff to the initial batch, not
count the downloads
FWIW Ayush Garg and I are rewriting getpapers in Python so we can
maintain it but it's much slower as it's not multithreaded.
HTH
Many Thanks,
… James
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#191>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFTCSYYNV2NFUSXR636HHLTNPQG5ANCNFSM442VQ2SQ>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Thanks @petermr for a speedy reply. I would actually be very interested in mining papers by date as I am hoping to elucidate trends in biotechnology over time. I don't quite understand your implementation here:
and I can't see anything in the wiki related to this, would you mind explaining a bit more of how to refine by date? Many Thanks, |
I am hoping to do an extensive mine (~1 million papers ultimately). I am attempting a 100k paper mine first, however, I am running into memory errors which I would guess are due to the number of papers attempting to download?
The error I recieve is:
Nothing is ultimately downloaded into the output directory, attempts to restart fail.
Is there anyway around this or a fix? Would this be improved if I increased the RAM on the machine (currently 8gb)?
Many Thanks,
James
The text was updated successfully, but these errors were encountered: