Skip to content

Commit

Permalink
Add basic plugin implementation
Browse files Browse the repository at this point in the history
  • Loading branch information
s0ph1e committed Jan 14, 2019
1 parent 08517d5 commit fa3d203
Show file tree
Hide file tree
Showing 8 changed files with 1,491 additions and 1 deletion.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
node_modules
.idea
7 changes: 7 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
language: node_js
sudo: false
node_js:
- '8'
- '9'
- '10'
- '11'
31 changes: 30 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,31 @@
[![Version](https://img.shields.io/npm/v/website-scraper-puppeteer.svg?style=flat)](https://www.npmjs.org/package/website-scraper-puppeteer)
[![Downloads](https://img.shields.io/npm/dm/website-scraper-puppeteer.svg?style=flat)](https://www.npmjs.org/package/website-scraper-puppeteer)
[![Build Status](https://travis-ci.org/website-scraper/website-scraper-puppeteer.svg?branch=master)](https://travis-ci.org/website-scraper/website-scraper-puppeteer)

# website-scraper-puppeteer
Plugin for website-scraper which returns html for dynamic websites using puppeteer
Plugin for [website-scraper](https://github.com/website-scraper/node-website-scraper) which returns html for dynamic websites using [puppeteer](https://github.com/GoogleChrome/puppeteer)

## Requirements
* nodejs version >= 8
* website-scraper version >= 4

## Installation
```sh
npm install website-scraper website-scraper-puppeteer
```

## Usage
```javascript
const scrape = require('website-scraper');
const PuppeteerPlugin = require('website-scraper-puppeteer');

scrape({
urls: ['https://www.instagram.com/gopro/'],
directory: '/path/to/save',
plugins: [ new PuppeteerPlugin() ]
});
```

## How it works
It starts Chromium in headless mode which just opens page and waits until page is loaded.
It is far from ideal because probably you need to wait until some resource is loaded or click some button or log in. Currently this module doesn't support such functionality.
28 changes: 28 additions & 0 deletions index.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
const puppeteer = require('puppeteer');

class PuppeteerPlugin {
apply(registerAction) {
let browser, page;

registerAction('beforeStart', async () => {
browser = await puppeteer.launch();
page = await browser.newPage();
});

registerAction('afterResponse', async ({response}) => {
const contentType = response.headers['content-type'];
const isHtml = contentType && contentType.split(';')[0] === 'text/html';
if (isHtml) {
const url = response.request.href;
await page.goto(url);
return page.content();
} else {
return response.body;
}
});

registerAction('afterFinish', () => browser.close());
}
}

module.exports = PuppeteerPlugin;
Loading

0 comments on commit fa3d203

Please sign in to comment.