Skip to content

Commit

Permalink
added instructions for installing the webcrawler module
Browse files Browse the repository at this point in the history
Signed-off-by: Maroun Touma <[email protected]>
  • Loading branch information
touma-I committed Nov 15, 2024
1 parent 46b168a commit 190969b
Showing 1 changed file with 20 additions and 1 deletion.
21 changes: 20 additions & 1 deletion transforms/universal/web2parquet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This first release of the transform, only accepts the following 4 parameters. Ad

## Parameters

For configuring the crawl, users need to identify the follow parameters:
For configuring the crawl, users need to specify the follow parameters:

| parameter:type | Description |
| --- | --- |
Expand All @@ -17,6 +17,25 @@ For configuring the crawl, users need to identify the follow parameters:
| folder:str | folder where downloaded files are stored. If the folder is not empty, new files are added or replace the existing ones with the same URLs |


## Install the transform

The transform can be installed directly from pypi and has a dependency on the data-prep-toolkit and the data-prep-connector

```
pip install data-prep-connector
pip install data-prep-toolkit>=0.2.2.dev2
pip install data-prep-toolkit-transform[web2parquet]>=0.2.2.dev3
```

If working from a fork in the git repo, from the root folder of the git repo, do the following:

```
cd transform/universal/web2parquet
make venv
source venv/bin/activate
pip install -r requirements.txt
```

## Invoking the transform from a notebook

In order to invoke the transfrom from a notebook, users must enable nested asynchronous ( https://pypi.org/project/nest-asyncio/ ), import the transform class and call the `transform()`function as shown in the example below:
Expand Down

0 comments on commit 190969b

Please sign in to comment.