Skip to content

Latest commit

 

History

History
296 lines (207 loc) · 15.7 KB

r.md

File metadata and controls

296 lines (207 loc) · 15.7 KB

R Web Scraping

This document contains a list of libraries and resources for web scraping in R.

Table of Contents

Libraries

Note: All selected libraries are either actively maintained or widely used.

Network

HTTP Clients

  • httr2: An R libary to make HTTP requests and process their responses. A modern reimagining of httr
  • crul: An R6 based HTTP client for R (for developers)
  • RCurl: A wrapper for libcurl
  • request: HTTP requests DSL for R
  • httpRequest: An R library to HTTP request protocols. Implements the GET, POST and multipart POST request
  • routr: Routing of web requests in R

WebSockets

  • websocket: WebSocket client for R
  • httpuv: HTTP and WebSocket server package for R

Other

  • fauxpas: An R library that provides HTTP error helpers
  • reqres: Powerful classes for HTTP requests and responses
  • tryr: Client/Server error handling for HTTP APIs
  • base64url: A fast and url-safe base64 encoder and decoder for R

Parsers

HTML/XML Parsers

  • xml2: R bindings to libxml2
  • XiMpLe: A library that provides a simple XML tree parser/generator
  • [XML]: Tools for parsing and generating XML within R and S-Plus
  • xml2relational: A library for converting XML documents into relational data models
  • xmlconvert: A library for comfortably converting XML documents to dataframes and vice versa
  • xmlr: XML dom package for R similar to jdom implemented using Reference Classes

URL Parsers

  • adaR: A wrapper for ada-url, a WHATWG-compliant and fast URL parser written in modern C++

CSV Parsers

  • csvread: A fast and specialized CSV file loader
  • easycsv: An R package for easy data loading from multiple tables

Date and Time Parsers

  • parsedate: An R package to parse dates given in arbitrary formats

PDF Parsers

  • pdfsearch: A libeary to search PDF files for keywords
  • pdftools: Tools for text extraction, rendering, and converting of PDF documents
  • tabulapdf: R bindings for Tabula PDF table extractor library
  • pdf: A library for programmatic conversion of PDF tables with R
  • Rpoppler: PDF tools based on Poppler
  • staplr: A toolkit for PDF files that provides functions to manipulate PDF files
  • pdfminer: An R library that provides an interface to PDFMiner, a Python package for extracting information from PDF-files

Markdown Parsers

  • marquee: A Markdown parser and renderer for R Graphics
  • parsemd: A library to extract the content of an R Markdown file to allow for programmatic interactions with the document’s contents
  • md4r: A Markdown parser implemented using the MD4C library

YAML Parsers

  • yum: Utilities to extract and process YAML fragments

SQL Parsers

  • RSqlParser: A parser for SQL statements
  • queryparser: A library to translate SQL queries into R expressions
  • [sqlparseR]: A wrapper for the Python module sqlparse

Office File Parsers

  • readxl: A library to read excel files (.xls and .xlsx) into R
  • readxlsb: A library to import Excel binary (.xlsb) spreadsheets into R
  • exceldata: A library to streamline data import, cleaning, and recoding from Excel
  • modgetxl: A shiny module for reading Excel sheets

Other

  • humaniformat: A human name parser
  • parcr: Construct parser combinators in R
  • qmrparser: A parser combinator in R that provides basic functions for building parsers
  • robotstxt: An R library for parsing and checking robots.txt files
  • r-optparse: A command-line optional argument parser
  • configr: A library that implements the JSON, INI, YAML and TOML parser
  • xmlparsedata: R code parse data as an XML tree

Web Scraping

Frameworks

  • rvest: Simple web scraping for R
  • Rcrawler: An R web crawler and scraper
  • ralger: A library that makes it easy to scrape a website. Built on the shoulders of titans: rvest, xml2
  • scrapeR: Functions to fetch and extract text content from specified web pages

Tools and Plugins

  • scraEP: Tools for scraping information from webpages and other XML contents, using XPath or CSS selectors

Proxy Integration

  • Bright Data's proxy services: A proxy network with over 72 million IPs offering premium residential, datacenter, mobile, and ISP proxies. Supports state, country, ZIP, and ASN level targeting across 195 countries. Works with any HTTP client or scraping library [Bright Data's solution]
  • getProxy: An R library to get a free proxy IP and port in R
  • ip2proxy: An R library that enables user to find the IP addresses which are used as VPN anonymizer, open proxies, web proxies and Tor exits
  • r.proxy: A library to set a proxy in an R console

CAPTCHA Solving

  • CAPTCHA Solver: A rapid and automated CAPTCHA solver that can solve challenges from reCAPTCHA, hCaptcha, px_captcha, SimpleCaptcha, GeeTest CAPTCHA, and more [Bright Data's solution]

User-Agent Spoofing

  • Randomuseragent: A library for filtering and randomly sampling real User-Agent strings

Other

  • heapsofpapers: A library to easily download heaps of PDF and CSV files
  • antiword: A library to extract text from Microsoft Word documents
  • tidyrss: An R package for extracting tidy dataframes from RSS, Atom and JSON feeds
  • getwiki: An R wrapper for Wikipedia data
  • tidywikidatar: A library to explore Wikidata through Tidy dataframes

Web Automation

Browser Automation Frameworks

Tools and Plugins

  • parsel: A tool for parallel execution of RSelenium

Other

  • selenider: A concise, lazy, and reliable wrapper for chromote and selenium

Data Export

JSON

  • jsonlite: A Robust, high performance JSON parser and generator for R
  • geojson: GeoJSON classes for R
  • yyjsonr: A fast JSON package for R
  • rapidjsonr: A library to provide JSON parsing capability through the Rapidjson C++ header-only library
  • RJSONIO: A package that allows conversion to and from data in Javascript object notation (JSON) format
  • rjson: A library that converts R object into JSON objects and vice-versa

CSV

  • csv: A library to read and write CSV files with selected conventions
  • csvwr: A library to read and write CSVW (i.e., CSV tables and JSON metadata)
  • csvy: A library that provides for import from and export to the CSVY file format

Other

  • cleanrmd: Clean class-less R Markdown HTML documents
  • r-yaml: An R package for converting objects to and from YAML
  • df2yaml: A library that converts dataframes to YAML
  • xlsx: An R package to interact with Excel files using the Apache POI Java library
  • writexl: A zero-dependency dataframe to xlsx exporter based on libxlsxwriter

Data Processing

General

  • tidyverse: Easily install and load packages from the tidyverse

Tabular Data

  • tinytable: Simple and customizable tables in R

Character Encoding

  • utf8: A library to process and print UTF-8 encoded international text
  • base64: A Base64 encoder and decoder

Date and Time

  • lubridate: A library that makes working with dates in R just that little bit easier
  • date: Functions for handling dates
  • datefixR: A library to standardize dates in different formats or with missing data

Phone Numbers

  • dialr: A library that parse, format, and validate international phone numbers in R

Other

  • geojsonR: A GeoJSON processing toolkit
  • jsonStrings: A library to manipulate JSON strings in R
  • excel.link: A library for convenient data exchange between R and Microsoft Excel
  • officer: A package that lets R users manipulate Word (.docx) and PowerPoint (.pptx) documents

Other

Multiprocessing

  • parallel: A built-in R library that provides support for parallel computation, including by forking (taken from package multicore), by sockets (taken from package snow) and random-number generation
  • parallelly: A library for enhancing the parallel package
  • RcppThread: A library that provides a C++11-style thread class and thread pool that can safely be interrupted from R

Task Scheduling

  • cronR: A simple R package for managing your cron jobs
  • later: A library to schedule an R function or formula to run after a specified period of time
  • taskscheduleR: A library to schedule R scripts/processes with the Windows task scheduler
  • sched: A library that offers classes and functions to contact web servers while enforcing scheduling rules required by the sites

Popular Web Scraping Stacks

Static Web Pages

HTTP Client + HTML Parser

  • HTTP Client: httr2 or RCurl
  • HTML Parser: xml2

All-In-One Web Scraping Framework

  • rvest or Rcrawler

Dynamic Web Pages

All-In-One Browser Automation Framework

  • RSelenium

Guides and Tutorials