-
Notifications
You must be signed in to change notification settings - Fork 154
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #416 from dolfim-ibm/docling-pdf2md
Add pdf2parquet transform
- Loading branch information
Showing
51 changed files
with
2,131 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
REPOROOT=../../.. | ||
# Use make help, to see the available rules | ||
include $(REPOROOT)/.make.defaults | ||
|
||
setup:: | ||
@# Help: Recursively make $@ all subdirs | ||
$(MAKE) RULE=$@ .recurse | ||
|
||
clean:: | ||
@# Help: Recursively make $@ all subdirs | ||
$(MAKE) RULE=$@ .recurse | ||
|
||
build:: | ||
@# Help: Recursively make $@ in subdirs | ||
$(MAKE) RULE=$@ .recurse | ||
venv:: | ||
@# Help: Recursively make $@ in subdirs | ||
$(MAKE) RULE=$@ .recurse | ||
|
||
image:: | ||
@# Help: Recursively make $@ in all subdirs | ||
@$(MAKE) RULE=$@ .recurse | ||
|
||
set-versions: | ||
@# Help: Recursively $@ in all subdirs | ||
@$(MAKE) RULE=$@ .recurse | ||
|
||
publish:: | ||
@# Help: Recursively make $@ in all subdirs | ||
@$(MAKE) RULE=$@ .recurse | ||
|
||
test-image:: | ||
@# Help: Recursively make $@ in all subdirs | ||
@$(MAKE) RULE=$@ .recurse | ||
|
||
test:: | ||
@# Help: Recursively make $@ in all subdirs | ||
@$(MAKE) RULE=$@ .recurse | ||
|
||
test-src:: | ||
@# Help: Recursively make $@ in all subdirs | ||
$(MAKE) RULE=$@ .recurse | ||
|
||
kind-load-image:: | ||
@# Help: Recursively make $@ in all subdirs | ||
$(MAKE) RULE=$@ .recurse | ||
|
||
docker-load-image:: | ||
@# Help: Recursively make $@ in all subdirs | ||
$(MAKE) RULE=$@ .recurse | ||
|
||
docker-save-image:: | ||
@# Help: Recursively make $@ in all subdirs | ||
$(MAKE) RULE=$@ .recurse | ||
|
||
.PHONY: workflow-venv | ||
workflow-venv: | ||
$(MAKE) -C kfp_ray workflow-venv | ||
|
||
.PHONY: workflow-test | ||
workflow-test: | ||
$(MAKE) -C kfp_ray workflow-test | ||
|
||
.PHONY: workflow-upload | ||
workflow-upload: | ||
$(MAKE) -C kfp_ray workflow-upload | ||
|
||
.PHONY: workflow-build | ||
workflow-build: | ||
$(MAKE) -C kfp_ray workflow-build |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# PDF2PARQUET Transform | ||
|
||
|
||
The PDF2PARQUET transforms iterate through PDF files or zip of PDF files and generates parquet files | ||
containing the converted document in Markdown format. | ||
|
||
The PDF conversion is using the [Docling package](https://github.com/DS4SD/docling). | ||
|
||
The following runtimes are available: | ||
|
||
* [python](python/README.md) - provides the base python-based transformation | ||
implementation. | ||
* [ray](ray/README.md) - enables the running of the base python transformation | ||
in a Ray runtime | ||
* [kfp](kfp_ray/README.md) - enables running the ray docker image | ||
in a kubernetes cluster using a generated `yaml` file. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
REPOROOT=${CURDIR}/../../../../ | ||
WORKFLOW_VENV_ACTIVATE=${REPOROOT}/transforms/venv/bin/activate | ||
include $(REPOROOT)/transforms/.make.workflows | ||
|
||
SRC_DIR=${CURDIR}/../ray/ | ||
|
||
PYTHON_WF := $(shell find ./ -name '*_wf.py') | ||
YAML_WF := $(patsubst %.py, %.yaml, ${PYTHON_WF}) | ||
|
||
workflow-venv: .check_python_version ${WORKFLOW_VENV_ACTIVATE} | ||
|
||
.PHONY: clean | ||
clean: | ||
@# Help: Clean up the virtual environment. | ||
rm -rf ${REPOROOT}/transforms/venv | ||
|
||
venv:: | ||
|
||
build:: | ||
|
||
setup:: | ||
|
||
test:: | ||
|
||
test-src:: | ||
|
||
test-image:: | ||
|
||
publish:: | ||
|
||
image:: | ||
|
||
kind-load-image:: | ||
|
||
docker-load-image:: | ||
|
||
docker-save-image:: | ||
|
||
.PHONY: workflow-build | ||
workflow-build: workflow-venv | ||
$(MAKE) $(YAML_WF) | ||
|
||
.PHONY: workflow-test | ||
workflow-test: workflow-build | ||
$(MAKE) .workflows.test-pipeline TRANSFORM_SRC=${SRC_DIR} PIPELINE_FILE=noop_wf.yaml | ||
|
||
.PHONY: workflow-upload | ||
workflow-upload: workflow-build | ||
@for file in $(YAML_WF); do \ | ||
$(MAKE) .workflows.upload-pipeline PIPELINE_FILE=$$file; \ | ||
done |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# PDF2PARQUET Ray-base KubeFlow Pipeline Transformation | ||
|
||
|
||
## Summary | ||
This project allows execution of the [pdf2parquet Ray transform](../ray) as a | ||
[KubeFlow Pipeline](https://www.kubeflow.org/docs/components/pipelines/overview/) | ||
|
||
The detail pipeline is presented in the [Simplest Transform pipeline tutorial](../../../../kfp/doc/simple_transform_pipeline.md) | ||
|
||
## Compilation | ||
|
||
In order to compile pipeline definitions run | ||
```shell | ||
make workflow-build | ||
``` | ||
from the directory. It creates a virtual environment (make workflow-venv) and after that compiles the pipeline | ||
definitions in the folder. The virtual environment is created once for all transformers. | ||
|
||
Note: the pipelines definitions can be compiled and executed on KFPv1 and KFPv2. Meantime, KFPv1 is our default. If you | ||
prefer KFPv2, please do the following: | ||
```shell | ||
make clean | ||
export KFPv2=1 | ||
make workflow-build | ||
``` | ||
|
||
The next steps are described in [Deploying a pipeline](../../../../kfp/doc/simple_transform_pipeline.md#deploying-a-pipeline-) | ||
and [Executing pipeline and watching execution results](../../../../kfp/doc/simple_transform_pipeline.md#executing-pipeline-and-watching-execution-results-) |
Oops, something went wrong.