This document contains a list of libraries and resources for web scraping in R.
Note: All selected libraries are either actively maintained or widely used.
- httr2: An R libary to make HTTP requests and process their responses. A modern reimagining of httr
- crul: An R6 based HTTP client for R (for developers)
- RCurl: A wrapper for libcurl
- request: HTTP requests DSL for R
- httpRequest: An R library to HTTP request protocols. Implements the GET, POST and multipart POST request
- routr: Routing of web requests in R
- fauxpas: An R library that provides HTTP error helpers
- reqres: Powerful classes for HTTP requests and responses
- tryr: Client/Server error handling for HTTP APIs
- base64url: A fast and url-safe base64 encoder and decoder for R
- xml2: R bindings to libxml2
- XiMpLe: A library that provides a simple XML tree parser/generator
- [XML]: Tools for parsing and generating XML within R and S-Plus
- xml2relational: A library for converting XML documents into relational data models
- xmlconvert: A library for comfortably converting XML documents to dataframes and vice versa
- xmlr: XML dom package for R similar to jdom implemented using Reference Classes
- csvread: A fast and specialized CSV file loader
- easycsv: An R package for easy data loading from multiple tables
- parsedate: An R package to parse dates given in arbitrary formats
- pdfsearch: A libeary to search PDF files for keywords
- pdftools: Tools for text extraction, rendering, and converting of PDF documents
- tabulapdf: R bindings for Tabula PDF table extractor library
- pdf: A library for programmatic conversion of PDF tables with R
- Rpoppler: PDF tools based on Poppler
- staplr: A toolkit for PDF files that provides functions to manipulate PDF files
- pdfminer: An R library that provides an interface to PDFMiner, a Python package for extracting information from PDF-files
- marquee: A Markdown parser and renderer for R Graphics
- parsemd: A library to extract the content of an R Markdown file to allow for programmatic interactions with the document’s contents
- md4r: A Markdown parser implemented using the MD4C library
- yum: Utilities to extract and process YAML fragments
- RSqlParser: A parser for SQL statements
- queryparser: A library to translate SQL queries into R expressions
- [sqlparseR]: A wrapper for the Python module sqlparse
- readxl: A library to read excel files (.xls and .xlsx) into R
- readxlsb: A library to import Excel binary (.xlsb) spreadsheets into R
- exceldata: A library to streamline data import, cleaning, and recoding from Excel
- modgetxl: A shiny module for reading Excel sheets
- humaniformat: A human name parser
- parcr: Construct parser combinators in R
- qmrparser: A parser combinator in R that provides basic functions for building parsers
- robotstxt: An R library for parsing and checking robots.txt files
- r-optparse: A command-line optional argument parser
- configr: A library that implements the JSON, INI, YAML and TOML parser
- xmlparsedata: R code parse data as an XML tree
- rvest: Simple web scraping for R
- Rcrawler: An R web crawler and scraper
- ralger: A library that makes it easy to scrape a website. Built on the shoulders of titans: rvest, xml2
- scrapeR: Functions to fetch and extract text content from specified web pages
- scraEP: Tools for scraping information from webpages and other XML contents, using XPath or CSS selectors
- Bright Data's proxy services: A proxy network with over 72 million IPs offering premium residential, datacenter, mobile, and ISP proxies. Supports state, country, ZIP, and ASN level targeting across 195 countries. Works with any HTTP client or scraping library [Bright Data's solution]
- getProxy: An R library to get a free proxy IP and port in R
- ip2proxy: An R library that enables user to find the IP addresses which are used as VPN anonymizer, open proxies, web proxies and Tor exits
- r.proxy: A library to set a proxy in an R console
- CAPTCHA Solver: A rapid and automated CAPTCHA solver that can solve challenges from reCAPTCHA, hCaptcha, px_captcha, SimpleCaptcha, GeeTest CAPTCHA, and more [Bright Data's solution]
- Randomuseragent: A library for filtering and randomly sampling real User-Agent strings
- heapsofpapers: A library to easily download heaps of PDF and CSV files
- antiword: A library to extract text from Microsoft Word documents
- tidyrss: An R package for extracting tidy dataframes from RSS, Atom and JSON feeds
- getwiki: An R wrapper for Wikipedia data
- tidywikidatar: A library to explore Wikidata through Tidy dataframes
- RSelenium: R bindings for Selenium WebDriver
- chromote: A Chrome remote interface for R
- hayalbaz: An R package provides a puppeteer inspired interface to the Chrome Devtools Protocol using chromote
- selenium-r: A low-level browser automation interface
- parsel: A tool for parallel execution of RSelenium
- selenider: A concise, lazy, and reliable wrapper for chromote and selenium
- jsonlite: A Robust, high performance JSON parser and generator for R
- geojson: GeoJSON classes for R
- yyjsonr: A fast JSON package for R
- rapidjsonr: A library to provide JSON parsing capability through the Rapidjson C++ header-only library
- RJSONIO: A package that allows conversion to and from data in Javascript object notation (JSON) format
- rjson: A library that converts R object into JSON objects and vice-versa
- csv: A library to read and write CSV files with selected conventions
- csvwr: A library to read and write CSVW (i.e., CSV tables and JSON metadata)
- csvy: A library that provides for import from and export to the CSVY file format
- cleanrmd: Clean class-less R Markdown HTML documents
- r-yaml: An R package for converting objects to and from YAML
- df2yaml: A library that converts dataframes to YAML
- xlsx: An R package to interact with Excel files using the Apache POI Java library
- writexl: A zero-dependency dataframe to xlsx exporter based on libxlsxwriter
- tidyverse: Easily install and load packages from the tidyverse
- tinytable: Simple and customizable tables in R
- utf8: A library to process and print UTF-8 encoded international text
- base64: A Base64 encoder and decoder
- lubridate: A library that makes working with dates in R just that little bit easier
- date: Functions for handling dates
- datefixR: A library to standardize dates in different formats or with missing data
- dialr: A library that parse, format, and validate international phone numbers in R
- geojsonR: A GeoJSON processing toolkit
- jsonStrings: A library to manipulate JSON strings in R
- excel.link: A library for convenient data exchange between R and Microsoft Excel
- officer: A package that lets R users manipulate Word (.docx) and PowerPoint (.pptx) documents
- parallel: A built-in R library that provides support for parallel computation, including by forking (taken from package multicore), by sockets (taken from package snow) and random-number generation
- parallelly: A library for enhancing the parallel package
- RcppThread: A library that provides a C++11-style thread class and thread pool that can safely be interrupted from R
- cronR: A simple R package for managing your cron jobs
- later: A library to schedule an R function or formula to run after a specified period of time
- taskscheduleR: A library to schedule R scripts/processes with the Windows task scheduler
- sched: A library that offers classes and functions to contact web servers while enforcing scheduling rules required by the sites
- HTTP Client: httr2 or RCurl
- HTML Parser: xml2
- rvest or Rcrawler
- RSelenium