Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler transform #797

Merged
merged 26 commits into from
Nov 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
41bed68
first implementation of web2parquet for crawling/downloading from see…
touma-I Nov 8, 2024
cf516b5
use makefile template
touma-I Nov 11, 2024
acc35cd
complete full implementation and testing with python runtime
touma-I Nov 13, 2024
3e05f30
identified current requirements for web2parquet module
touma-I Nov 13, 2024
5710653
relaxed dependencies
touma-I Nov 13, 2024
80e4ebe
added build target
touma-I Nov 13, 2024
cf20268
Merge branch 'dev' into crawler-transform
touma-I Nov 13, 2024
4dcebb6
added licence block
touma-I Nov 14, 2024
137d92c
Merge branch 'dev' into crawler-transform
touma-I Nov 14, 2024
d2404f4
fix filename issue
touma-I Nov 14, 2024
1e810d0
generate cicd workflow for new transform
touma-I Nov 14, 2024
fcbcc0a
build image only if a Dockerfile is defined
touma-I Nov 14, 2024
b5031c9
Ignore page content as long as we get the right count
touma-I Nov 14, 2024
9ad3d18
rename make.cicd.target
touma-I Nov 15, 2024
c9c9779
updated notebook with example
touma-I Nov 15, 2024
b77bbe9
updated notebook with example
touma-I Nov 15, 2024
8e71177
added readme.md
touma-I Nov 15, 2024
ef7c57d
fix typos
touma-I Nov 15, 2024
8c55ad8
More typos
touma-I Nov 15, 2024
ba4b0a4
more typos
touma-I Nov 15, 2024
6ea2e76
more typos
touma-I Nov 15, 2024
670f381
reference nested asyncio project
touma-I Nov 15, 2024
46b168a
fix typo
touma-I Nov 15, 2024
190969b
added instructions for installing the webcrawler module
touma-I Nov 15, 2024
96e46c7
added the module to the transform package
touma-I Nov 15, 2024
4a59970
added requirements for web2parquet
touma-I Nov 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 133 additions & 0 deletions .github/workflows/test-universal-web2parquet.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
#
# DO NOT EDIT THIS FILE: it is generated from test-transform.template, Edit there and run make to change these files
#
name: Test - transforms/universal/web2parquet

on:
workflow_dispatch:
push:
branches:
- "dev"
- "releases/**"
tags:
- "*"
paths:
- ".make.*"
- "transforms/.make.transforms"
- "transforms/universal/web2parquet/**"
- "data-processing-lib/**"
- "!transforms/universal/web2parquet/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!data-processing-lib/**/test/**"
- "!data-processing-lib/**/test-data/**"
- "!**.md"
- "!**/doc/**"
- "!**/images/**"
- "!**.gitignore"
pull_request:
branches:
- "dev"
- "releases/**"
paths:
- ".make.*"
- "transforms/.make.transforms"
- "transforms/universal/web2parquet/**"
- "data-processing-lib/**"
- "!transforms/universal/web2parquet/**/kfp_ray/**" # This is/will be tested in separate workflow
- "!data-processing-lib/**/test/**"
- "!data-processing-lib/**/test-data/**"
- "!**.md"
- "!**/doc/**"
- "!**/images/**"
- "!**.gitignore"

# Taken from https://stackoverflow.com/questions/66335225/how-to-cancel-previous-runs-in-the-pr-when-you-push-new-commitsupdate-the-curre
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
check_if_push_image:
# check whether the Docker images should be pushed to the remote repository
# The images are pushed if it is a merge to dev branch or a new tag is created.
# The latter being part of the release process.
# The images tag is derived from the value of the DOCKER_IMAGE_VERSION variable set in the .make.versions file.
runs-on: ubuntu-22.04
outputs:
publish_images: ${{ steps.version.outputs.publish_images }}
steps:
- id: version
run: |
publish_images='false'
if [[ ${GITHUB_REF} == refs/heads/dev && ${GITHUB_EVENT_NAME} != 'pull_request' && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ;
then
publish_images='true'
fi
if [[ ${GITHUB_REF} == refs/tags/* && ${GITHUB_REPOSITORY} == IBM/data-prep-kit ]] ;
then
publish_images='true'
fi
echo "publish_images=$publish_images" >> "$GITHUB_OUTPUT"
test-src:
runs-on: ubuntu-22.04
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Free up space in github runner
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/local/.ghcup
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
df -h
- name: Test transform source in transforms/universal/web2parquet
run: |
if [ -e "transforms/universal/web2parquet/Makefile" ]; then
make -C transforms/universal/web2parquet DOCKER=docker test-src
else
echo "transforms/universal/web2parquet/Makefile not found - source testing disabled for this transform."
fi
test-image:
needs: [check_if_push_image]
runs-on: ubuntu-22.04
timeout-minutes: 120
env:
DOCKER_REGISTRY_USER: ${{ secrets.DOCKER_REGISTRY_USER }}
DOCKER_REGISTRY_KEY: ${{ secrets.DOCKER_REGISTRY_KEY }}
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Free up space in github runner
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
sudo rm -rf /opt/ghc
sudo rm -rf "/usr/local/share/boost"
sudo rm -rf "$AGENT_TOOLSDIRECTORY"
sudo rm -rf /usr/share/dotnet /opt/ghc /usr/local/lib/android /usr/local/share/powershell /usr/share/swift /usr/lib/jvm /usr/local/.ghcup
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
df -h
- name: Test transform image in transforms/universal/web2parquet
run: |
if [ -e "transforms/universal/web2parquet/Makefile" ]; then
if [ -d "transforms/universal/web2parquet/spark" ]; then
make -C data-processing-lib/spark DOCKER=docker image
fi
make -C transforms/universal/web2parquet DOCKER=docker test-image
else
echo "transforms/universal/web2parquet/Makefile not found - testing disabled for this transform."
fi
- name: Print space
# Free space as indicated here : https://github.com/actions/runner-images/issues/2840#issuecomment-790492173
run: |
df -h
docker images
- name: Publish images
if: needs.check_if_push_image.outputs.publish_images == 'true'
run: |
if [ -e "transforms/universal/web2parquet/Makefile" ]; then
make -C transforms/universal/web2parquet publish
else
echo "transforms/universal/web2parquet/Makefile not found - publishing disabled for this transform."
fi
2 changes: 1 addition & 1 deletion .make.defaults
Original file line number Diff line number Diff line change
Expand Up @@ -475,7 +475,7 @@ endif
.defaults.test-src:: venv
@# Help: Run pytest on the test directory inside the venv
source venv/bin/activate; \
export PYTHONPATH=../src; \
export PYTHONPATH=../src:../: ; \
cd test; $(PYTEST) .

# This is small convenience and the image itself must already be created.
Expand Down
89 changes: 89 additions & 0 deletions transforms/.make.cicd.targets
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Define the root of the local git clone for the common rules to be able
# know where they are running from.

# Include a library of common .transform.* targets which most
# transforms should be able to reuse. However, feel free
# to override/redefine the rules below.
include $(REPOROOT)/transforms/.make.transforms

######################################################################
## Default setting for TRANSFORM_RUNTIME uses folder name-- Old layout
TRANSFORM_PYTHON_RUNTIME_SRC_FILE=-m dpk_$(TRANSFORM_NAME).transform
TRANSFORM_RAY_RUNTIME_SRC_FILE=-m dpk_$(TRANSFORM_NAME).ray.transform
TRANSFORM_PYTHON_RUNTIME_SRC_FILE=-m dpk_$(TRANSFORM_NAME).spark.transform

venv:: .defaults.create-venv
source venv/bin/activate && $(PIP) install -e $(REPOROOT)/data-processing-lib[ray,spark]
source venv/bin/activate && $(PIP) install -e $(REPOROOT)/data-connector-lib
if [ -e requirements.txt ]; then \
source venv/bin/activate && $(PIP) install -r requirements.txt; \
fi;


test:: .transforms.test-src test-image

clean:: .transforms.clean

## We need to think how we want to do this going forward
set-versions::

## We need to think how we want to do this going forward
build::

image::
@if [ -e Dockerfile ]; then \
$(MAKE) image-default ; \
else \
echo "Skipping image for $(shell pwd) since no Dockerfile is present"; \
fi

publish::
@if [ -e Dockerfile ]; then \
$(MAKE) publish-default ; \
else \
echo "Skipping publish for $(shell pwd) since no Dockerfile is present"; \
fi

publish-image::
@if [ -e Dockerfile ]; then \
$(MAKE) publish-image-default ; \
else \
echo "Skipping publish-image for $(shell pwd) since no Dockerfile is present"; \
fi

test-image::
@if [ -e Dockerfile ]; then \
$(MAKE) test-image-default ; \
else \
echo "Skipping test-image for $(shell pwd) since no Dockerfile is present"; \
fi

test-src:: .transforms.test-src

setup:: .transforms.setup

publish-default:: publish-image

publish-image-default:: .defaults.publish-image

test-image-default:: image .transforms.test-image-help .defaults.test-image-pytest .transforms.clean

build-lib-wheel:
make -C $(REPOROOT)/data-processing-lib build-pkg-dist

image-default:: build-lib-wheel
@$(eval LIB_WHEEL_FILE := $(shell find $(REPOROOT)/data-processing-lib/dist/*.whl))
rm -fr dist && mv $(REPOROOT)/data-processing-lib/dist .
$(eval WHEEL_FILE_NAME := $(shell basename $(LIB_WHEEL_FILE)))
$(DOCKER) build -t $(DOCKER_IMAGE_NAME) $(DOCKER_BUILD_EXTRA_ARGS) \
--platform $(DOCKER_PLATFORM) \
--build-arg EXTRA_INDEX_URL=$(EXTRA_INDEX_URL) \
--build-arg BASE_IMAGE=$(RAY_BASE_IMAGE) \
--build-arg BUILD_DATE=$(shell date -u +'%Y-%m-%dT%H:%M:%SZ') \
--build-arg WHEEL_FILE_NAME=$(WHEEL_FILE_NAME) \
--build-arg TRANSFORM_NAME=$(TRANSFORM_NAME) \
--build-arg GIT_COMMIT=$(shell git log -1 --format=%h) .
$(DOCKER) tag $(DOCKER_LOCAL_IMAGE) $(DOCKER_REMOTE_IMAGE)
rm -fr dist


7 changes: 7 additions & 0 deletions transforms/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ build-pkg-dist:
fi \
done
# Only needs to build the whl
git show --no-patch > src/data/gitshow.txt
$(MAKE) BUILD_WHEEL_EXTRA_ARG=-w .defaults.build-dist
-rm -fr src

Expand All @@ -131,3 +132,9 @@ test-pkg-dist:

publish-dist :: .defaults.publish-dist

publish-testpypi:
## when installing from testpypi, make sure you install the dependecies first (pip install data-prep-toolkit)
## and then use extra-url-index to install this package:
## pip install --extra-index-url https://test.pypi.org/simple/ 'data-prep-toolkit-transforms[all]==x.x.x.devx'
twine upload --repository testpypi dist/*

23 changes: 12 additions & 11 deletions transforms/README-list.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,18 +23,19 @@ Note: This list includes the transforms that were part of the release starting w
* [code_quality](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/code_quality/python/README.md)
* [proglang_select](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/proglang_select/python/README.md)
* language
* [doc_chunk](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/doc_chunk/python/README.md)
* [doc_quality](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/doc_quality/python/README.md)
* [lang_id](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/lang_id/python/README.md)
* [pdf2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/pdf2parquet/python/README.md)
* [text_encoder](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/text_encoder/python/README.md)
* [pii_redactor](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/pii_redactor/python/README.md)
* [doc_chunk](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_chunk/python/README.md)
* [doc_quality](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_quality/python/README.md)
* [lang_id](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/lang_id/python/README.md)
* [pdf2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/README.md)
* [text_encoder](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/text_encoder/python/README.md)
* [pii_redactor](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pii_redactor/python/README.md)
* universal
* [ededup](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/ededup/python/README.md)
* [filter](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/filter/python/README.md)
* [resize](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/resize/python/README.md)
* [tokenization](https://github.com/IBM/data-prep-kit/blob/dev/transforms/tokenization/doc_chunk/python/README.md)
* [doc_id](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/doc_id/python/README.md)
* [ededup](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/ededup/python/README.md)
* [filter](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/filter/python/README.md)
* [resize](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/resize/python/README.md)
* [tokenization](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/tokenization/python/README.md)
* [doc_id](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/doc_id/python/README.md)
* [web2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/web2parquet/README.md)



Expand Down
15 changes: 13 additions & 2 deletions transforms/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[project]
name = "data_prep_toolkit_transforms"
version = "0.2.2.dev2"
version = "0.2.2.dev3"
requires-python = ">=3.10,<3.13"
keywords = ["transforms", "data preprocessing", "data preparation", "llm", "generative", "ai", "fine-tuning", "llmapps" ]
description = "Data Preparation Toolkit Transforms using Ray"
Expand Down Expand Up @@ -47,7 +47,8 @@ all = { file = [
"universal/profiler/python/requirements.txt",
"universal/doc_id/python/requirements.txt",
"universal/filter/python/requirements.txt",
"universal/resize/python/requirements.txt"
"universal/resize/python/requirements.txt",
"universal/web2parquet/requirements.txt"
]}

# pyproject.toml must be in a parent and cannot be in sibling
Expand All @@ -74,10 +75,20 @@ profiler = { file = ["universal/profiler/python/requirements.txt"]}
doc_id = { file = ["universal/doc_id/python/requirements.txt"]}
filter = { file = ["universal/filter/python/requirements.txt"]}
resize = { file = ["universal/resize/python/requirements.txt"]}
web2parquet = { file = ["universal/web2parquet/requirements.txt"]}

# Does not seem to work for our custom layout
# copy all files to a single src and let automatic discovery find them

[tool.setuptools.package-data]
"*" = ["*.txt"]

[tool.setuptools.packages.find]
where = ["src"]

#[tool.setuptools.package-dir]
#dpk_web2parquet = "universal/web2parquet/dpk_web2parquet"

[options]
package_dir = ["src","test"]

Expand Down
23 changes: 23 additions & 0 deletions transforms/universal/web2parquet/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
REPOROOT=../../..
# Use make help, to see the available rules
include $(REPOROOT)/transforms/.make.cicd.targets

#
# This is intended to be included across the Makefiles provided within
# a given transform's directory tree, so must use compatible syntax.
#
################################################################################
# This defines the name of the transform and is used to match against
# expected files and is used to define the transform's image name.
TRANSFORM_NAME=$(shell basename `pwd`)

################################################################################
# This defines the transforms' version number as would be used
# when publishing the wheel. In general, only the micro version
# number should be advanced relative to the DPK_VERSION.
#
# If you change the versions numbers, be sure to run "make set-versions" to
# update version numbers across the transform (e.g., pyproject.toml).
#TRANSFORM_VERSION=$(DPK_VERSION)


52 changes: 52 additions & 0 deletions transforms/universal/web2parquet/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Web Crawler to Parquet

This tranform crawls the web and downloads files in real-time.

This first release of the transform, only accepts the following 4 parameters. Additional releases will extend the functionality to allow the user to specify additional constraints such as mime-type, domain-focus, etc.

touma-I marked this conversation as resolved.
Show resolved Hide resolved

## Parameters

touma-I marked this conversation as resolved.
Show resolved Hide resolved
For configuring the crawl, users need to specify the following parameters:

touma-I marked this conversation as resolved.
Show resolved Hide resolved
touma-I marked this conversation as resolved.
Show resolved Hide resolved
| parameter:type | Description |
| --- | --- |
| urls:list | list of seed URLs (i.e., ['https://thealliance.ai'] or ['https://www.apache.org/projects','https://www.apache.org/foundation']). The list can include any number of valid URLS that are not configured to block web crawlers |
|depth:int | control crawling depth |
| downloads:int | number of downloads that are stored to the download folder. Since the crawler operations happen asynchronously, the process can result in any 10 of the visited URLs being retrieved (i.e. consecutive runs can result in different files being downloaded) |
| folder:str | folder where downloaded files are stored. If the folder is not empty, new files are added or replace the existing ones with the same URLs |


## Install the transform

The transform can be installed directly from pypi and has a dependency on the data-prep-toolkit and the data-prep-connector

```
pip install data-prep-connector
pip install data-prep-toolkit>=0.2.2.dev2
pip install data-prep-toolkit-transform[web2parquet]>=0.2.2.dev3
```

If working from a fork in the git repo, from the root folder of the git repo, do the following:

```
cd transform/universal/web2parquet
make venv
source venv/bin/activate
pip install -r requirements.txt
```

## Invoking the transform from a notebook

In order to invoke the transfrom from a notebook, users must enable nested asynchronous ( https://pypi.org/project/nest-asyncio/ ), import the transform class and call the `transform()`function as shown in the example below:


touma-I marked this conversation as resolved.
Show resolved Hide resolved
```
import nest_asyncio
nest_asyncio.apply()
from dpk_web2parquet.transform import Web2Parquet
Web2Parquet(urls= ['https://thealliance.ai/'],
depth=2,
downloads=10,
folder='downloads').transform()
````
Loading