- [05m] 🏆 Objectives
- [05m] 🤷♀️ Why You Should Know This
- [15m] 📖 Overview: Web Scraping
- [20m] 💻 Game: Selector Diner
- [10m] 💻 Demo: Selecting Selectors
- [10m] BREAK
- [15m] 📖 Overview: Colly
- [20m] Activity: Colly Calls Back
- [10m] TT: Advantages and Disadvantages to Using Colly
- [30m] Video: Headless Web Scraping
- [20m] Example Code / Demo
- 📚 Resources & Credits
- Identify the critical steps to collecting data using web scraping techniques.
- Apply selectors to an HTML document to retreive data.
- Design and create a web scraper that retrieves data from your favorite website!
- All projects need data before launching!
- Available datasets may not meet your needs or require additional supporting data from a source on the web.
- Save important data before a website goes offline for archival purposes.
Web Scrapers crawl a website, extract it's data, transform that data to a usable structured format, finally writing it to a file or database for subsequent use.
Programs that use this design pattern follow the Extract-Transform-Load (ETL) Process.
- Not interchangeable terms!
- Crawlers download and store the contents of large numbers of sites by following the links in pages.
- How Google got famous
- Scrapers are built for the structure of a specific website.
- Use site's own structure to extract individual specific data elements.
- Crawling is the first step to web scraping.
Below are the most common selectors used when scraping the web for the purposes of data collection.
Name | Syntax | Description |
---|---|---|
Element | a |
Any element section , a , table , etc. |
ID | #home-link |
First element with id="video-player" |
Class | .blog-post |
Any element with class="blog-post" |
Attribute | a[href] |
All values of the href attribute assigned to any a element |
Pseudo-Attribute | a:first-child |
The first a element |
Let's practice selectors now --- they're the most important part of writing an awesome web scraper! If the selector isn't correct, nothing will return, and no data will have been collected as a result of running your scraper.
Choose the right plates while working the window at the CSS Diner. This fun game will level up your selector skills in preparation for your Web Scraper project.
Instructor will demonstrate how to find and test selectors in Chrome before integrating them in your web scraper.
- Inspect an element > right click it's node in the DOM tree > choosing Copy > Copy Selector.
- Testing using
Ctrl
+F
in the inspector.
If your selector does not work using these methods, it WILL NOT WORK IN YOUR SCRAPER!
A popular open source package, Colly, provides a clean foundation to write any kind of crawler/scraper/spider. Features include:
- Lots of cool Go language concepts!
- Fast (>1k request/sec on a single core)
- Manages request delays and maximum concurrency per domain
- Automatic cookie and session handling
- Sync/async/parallel scraping
- Distributed scraping
- Caching
Colly works via a series of callbacks that are executed anytime Visit()
is called on a collector.
Callbacks are functions that execute after another function completes.
Colly supports the following callbacks:
package main
import (
"fmt"
"github.com/gocolly/colly"
)
// main() contains code adapted from example found in Colly's docs:
// http://go-colly.org/docs/examples/basic/
func main() {
// Instantiate default collector
c := colly.NewCollector()
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
// Find link using an attribute selector
// Matches any element that includes href=""
link := e.Attr("href")
// Print link
fmt.Printf("Link found: %q -> %s\n", e.Text, link)
// Visit link
e.Request.Visit(link)
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.OnError(func(_ *colly.Response, err error) {
fmt.Println("Something went wrong:", err)
})
c.OnResponse(func(r *colly.Response) {
fmt.Println("Visited", r.Request.URL)
})
c.OnScraped(func(r *colly.Response) {
fmt.Println("Finished", r.Request.URL)
})
// Start scraping on https://hackerspaces.org
c.Visit("https://hackerspaces.org/")
}
With a partner, use the sample code to determine which order these callbacks fire To examine the output, paste the above snippet, build, and run your executable.
- Quick to copy and paste an example from the docs and modify it to create your own web scraper.
- Lots of plugins and libraries with good documentation
- Security features allow you cloak your scraper so it isn't detected
- Can't scrape websites that take advantage of a shadow DOM to render components
- This means you can't use Colly to scrape websites written in Angular, Vue, and React
- chromedp/examples: awesome
chromedp
examples, use these to get started - droxey/makeshort: example app for creating
make.sc
shortlinks from the command line
- ScrapeHero: What is Web Scraping – Part 1 – Beginner’s Guide
- W3C: Selectors
- Colly: Starter code derived from basic example.
- chromedp/examples: various
chromedp
examples - Chrome DevTools Protocol: Chrome DevTools Protocol Domain documentation