- Web scraper built using Python 3 & the beautifulsoup4 web scraper library.
- Built specifically for The Journal of Machine Learning Research website/publication
- CURRENT STATUS: In progress.
- Python 3. (https://www.python.org/download/releases/3.0/)
- beautifulsoup4 (https://www.crummy.com/software/BeautifulSoup/)
- lxml (https://lxml.de/)
Use the package manager pip to install packages.
pip install beautifulsoup4
pip install lxml
pip install requests
pip install unidecode
- JMLR_scraper_FULL.py : For the entire scrape.
- Titles, Abstracts, Abstract URLs, Authors, Keywords, Affiliations, Month of Publication, Volume URL, Journal Name, Year of Publication, Volume List, and Issue List.
- JMLR_scraper_VolumeX_abstract.py: Use to scrape just the abstracts of specific volumes(x).
- JMLR_scraper_VolumeX_abstractURL: Use to scrape just the abstracts URLs of specific volumes.
- Similar usage for the rest of the individual scrapers.
python JMLR_scraper_FULL.py
- Output is written onto csv files in the same directory as the program file:
JMLR_Volume_1.csv
JMLR_Volume_2.csv
...
etc.
Created by Santosh Khadka [email protected]
Pull requests are welcome.