added instructions for installing the webcrawler module

Signed-off-by: Maroun Touma <[email protected]>
IBM · Nov 15, 2024 · 190969b · 190969b
1 parent 46b168a
commit 190969b
Showing 1 changed file with 20 additions and 1 deletion.
diff --git a/transforms/universal/web2parquet/README.md b/transforms/universal/web2parquet/README.md
@@ -7,7 +7,7 @@ This first release of the transform, only accepts the following 4 parameters. Ad
 
 ## Parameters
 
-For configuring the crawl, users need to identify the follow parameters:
+For configuring the crawl, users need to specify the follow parameters:
 
 | parameter:type | Description |
 | --- | --- |
@@ -17,6 +17,25 @@ For configuring the crawl, users need to identify the follow parameters:
 | folder:str | folder where downloaded files are stored. If the folder is not empty, new files are  added or replace the existing ones with the same URLs |
 
 
+## Install the transform
+
+The transform can be installed directly from pypi and has a dependency on the data-prep-toolkit and the data-prep-connector
+
+```
+pip install data-prep-connector
+pip install data-prep-toolkit>=0.2.2.dev2
+pip install data-prep-toolkit-transform[web2parquet]>=0.2.2.dev3
+```
+
+If working from a fork in the git repo, from the root folder of the git repo, do the following:
+
+```
+cd transform/universal/web2parquet
+make venv
+source venv/bin/activate
+pip install -r requirements.txt
+```
+
 ## Invoking the transform from a notebook
 
 In order to invoke the transfrom from a notebook, users must enable nested asynchronous ( https://pypi.org/project/nest-asyncio/ ), import the transform class and call the `transform()`function as shown in the example below: