How to Parse HTML With Golang?

Master HTML parsing in Go using Node Parser, Tokenizer, and third-party tools like Goquery, Colly, and Bright Data's Web Scrapers for efficient web scraping.

Prerequisites

Basic understanding of Go and Web Scraping is helpful. Ensure Go is installed on your machine. Create a new project folder and initialize it:

mkdir goparser
cd goparser
go mod init goparser

Test your setup with:

package main

import "fmt"

func main() {
    fmt.Println("Hello, World!")
}

Run the file:

go run main.go

Install the dependency:

go get golang.org/x/net/html

Extracting Data With Node Parser

Use Node Parser to traverse the DOM. Here's an example to extract quotes and authors:

package main

import (
    "fmt"
    "net/http"
    "golang.org/x/net/html"
)

func main() {
    resp, _ := http.Get("http://quotes.toscrape.com")
    defer resp.Body.Close()
    doc, _ := html.Parse(resp.Body)

    var processNode func(*html.Node)
    processNode = func(n *html.Node) {
        if n.Type == html.ElementNode && n.Data == "span" {
            for _, a := range n.Attr {
                if a.Key == "class" && a.Val == "text" {
                    fmt.Println("Quote:", n.FirstChild.Data)
                }
            }
        }
        if n.Type == html.ElementNode && n.Data == "small" {
            for _, a := range n.Attr {
                if a.Key == "class" && a.Val == "author" {
                    fmt.Println("Author:", n.FirstChild.Data)
                }
            }
        }
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            processNode(c)
        }
    }
    processNode(doc)
}

Extracting Data With Tokenizer

Tokenizer processes pages differently, focusing on tokens:

package main

import (
    "fmt"
    "net/http"
    "strings"
    "golang.org/x/net/html"
)

func main() {
    resp, _ := http.Get("http://quotes.toscrape.com")
    defer resp.Body.Close()
    tokenizer := html.NewTokenizer(resp.Body)

    inQuote := false
    inAuthor := false

    for {
        tt := tokenizer.Next()
        switch tt {
        case html.ErrorToken:
            return
        case html.StartTagToken:
            t := tokenizer.Token()
            if t.Data == "span" {
                for _, a := range t.Attr {
                    if a.Key == "class" && a.Val == "text" {
                        inQuote = true
                    }
                }
            }
            if t.Data == "small" {
                for _, a := range t.Attr {
                    if a.Key == "class" && a.Val == "author" {
                        inAuthor = true
                    }
                }
            }
        case html.TextToken:
            if inQuote {
                fmt.Println("Quote:", strings.TrimSpace(tokenizer.Token().Data))
                inQuote = false
            }
            if inAuthor {
                fmt.Println("Author:", strings.TrimSpace(tokenizer.Token().Data))
                inAuthor = false
            }
        }
    }
}

Third Party Alternatives

Goquery: A Go alternative to jQuery, supports DOM traversal and CSS selectors.
htmlquery: Similar to Goquery but uses XPath selectors.
Colly: A full-fledged web scraping framework for Go.
Bright Data Web Scraper: An API service for scraping pages and returning data in JSON format.

Conclusion

Now you know how to parse HTML using Go. Use Node Parser for full-page traversal and Tokenizer for parsing relevant data. Explore third-party tools for more features. Go over some of our other scraping guides:

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to Parse HTML With Golang?

Prerequisites

Extracting Data With Node Parser

Extracting Data With Tokenizer

Third Party Alternatives

Conclusion

About

luminati-io/Golang-html-parsing

Folders and files

Latest commit

History

Repository files navigation

How to Parse HTML With Golang?

Prerequisites

Extracting Data With Node Parser

Extracting Data With Tokenizer

Third Party Alternatives

Conclusion

About

Topics

Resources

Stars

Watchers

Forks