Skip to content

Simplify web scraping with SeleniumBase, leveraging its user-friendly framework and advanced automation features to extract data seamlessly.

Notifications You must be signed in to change notification settings

luminati-io/Seleniumbase-web-scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 

Repository files navigation

Web Scraping With SeleniumBase

Promo

Simplify web scraping with SeleniumBase using its advanced features and step-by-step guide. Interested in Selenium web scraping? Check out this guide.

What Is SeleniumBase?

SeleniumBase is a Python framework for browser automation, built on top of Selenium/WebDriver APIs. It supports tasks from testing to scraping and includes features like CAPTCHA bypassing and bot-detection avoidance.

SeleniumBase vs Selenium: Feature and API Comparison

Feature SeleniumBase Selenium
Built-in test runners Integrates with pytest, pynose, and behave Requires manual setup for test integration
Driver management Auto-downloads matching browser driver Manual download and configuration
Web automation logic Combines steps into single method call Requires multiple lines of code
Selector handling Auto-detects CSS or XPath selectors Requires explicit selector types
Timeout handling Default timeouts to prevent failures Immediate failures without explicit timeouts
Error outputs Clean, readable error messages Verbose, less interpretable error logs
Dashboards and reports Built-in dashboards, reports, and screenshots No built-in dashboards or reporting
Desktop GUI applications Visual tools for test running Lacks desktop GUI tools
Test recorder Built-in test recorder Requires manual script writing
Test case management Provides CasePlans No built-in test case management
Data app support Includes ChartMaker for data apps No additional tools for data apps

Using SeleniumBase for Web Scraping: Step-By-Step Guide

Step #1: Project Initialization

mkdir seleniumbase-scraper
cd seleniumbase-scraper
python -m venv env

Activate the virtual environment:

  • On Linux/macOS: ./env/bin/activate
  • On Windows: env/Scripts/activate

Install SeleniumBase:

pip install seleniumbase

Step #2: SeleniumBase Test Setup

from seleniumbase import SB

with SB() as sb:
    pass

Run the script:

python3 scraper.py --headless

Step #3: Connect to the Target Page

sb.open("https://quotes.toscrape.com/")

Step #4: Select the Quote Elements

quote_elements = sb.find_elements(".quote")

Step #5: Scrape Quote Data

from selenium.webdriver.common.by import By

for quote_element in quote_elements:
    text_element = quote_element.find_element(By.CSS_SELECTOR, ".text")
    text = text_element.text.replace("“", "").replace("”", "")
    author_element = quote_element.find_element(By.CSS_SELECTOR, ".author")
    author = author_element.text
    tags = [tag.text for tag in quote_element.find_elements(By.CSS_SELECTOR, ".tag")]

Step #6: Populate the Quotes Array

quotes.append({"text": text, "author": author, "tags": tags})

Step #7: Implement Crawling Logic

while sb.is_element_present(".next"):
    sb.click(".next a")

Step #8: Export the Scraped Data

import csv

with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
    writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])
    writer.writeheader()
    for quote in quotes:
        writer.writerow({"text": quote["text"], "author": quote["author"], "tags": ";".join(quote["tags"])})

Step #9: Put It All Together

from seleniumbase import SB
from selenium.webdriver.common.by import By
import csv

with SB() as sb:
    sb.open("https://quotes.toscrape.com/")
    quotes = []
    while sb.is_element_present(".next"):
        quote_elements = sb.find_elements(".quote")
        for quote_element in quote_elements:
            text_element = quote_element.find_element(By.CSS_SELECTOR, ".text")
            text = text_element.text.replace("“", "").replace("”", "")
            author_element = quote_element.find_element(By.CSS_SELECTOR, ".author")
            author = author_element.text
            tags = [tag.text for tag in quote_element.find_elements(By.CSS_SELECTOR, ".tag")]
            quotes.append({"text": text, "author": author, "tags": tags})
        sb.click(".next a")
    with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
        writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])
        writer.writeheader()
        for quote in quotes:
            writer.writerow({"text": quote["text"], "author": quote["author"], "tags": ";".join(quote["tags"])})

Run the scraper:

python3 script.py --headless

Advanced SeleniumBase Scraping Use Cases

Automate Form Filling and Submission

from seleniumbase import BaseCase
BaseCase.main(__name__, __file__)

class LoginTest(BaseCase):
    def test_submit_login_form(self):
        self.open("https://quotes.toscrape.com/login")
        self.type("#username", "test")
        self.type("#password", "test")
        self.click("input[type=\"submit\"]")
        self.assert_text("Top Ten tags")

Run the test:

pytest login.py

Bypass Simple Anti-Bot Technologies

from seleniumbase import SB

with SB(uc=True) as sb:
    url = "https://www.scrapingcourse.com/antibot-challenge"
    sb.uc_open_with_reconnect(url, reconnect_time=4)
    sb.uc_gui_click_captcha()
    sb.save_screenshot("screenshot.png")

Bypass Complex Anti-Bot Technologies

from seleniumbase import SB

with SB(uc=True, test=True) as sb:
    url = "https://gitlab.com/users/sign_in"
    sb.activate_cdp_mode(url)
    sb.uc_gui_click_captcha()
    sb.sleep(2)
    sb.save_screenshot("screenshot.png")

Conclusion

SeleniumBase offers advanced features for web scraping, including UC Mode and CDP Mode for bypassing anti-bot measures. For more robust solutions, consider using cloud-based browsers like Scraping Browser from Bright Data.

About

Simplify web scraping with SeleniumBase, leveraging its user-friendly framework and advanced automation features to extract data seamlessly.

Topics

Resources

Stars

Watchers

Forks