diff --git a/.gitignore b/.gitignore index adfb5ea..62a3f1c 100644 --- a/.gitignore +++ b/.gitignore @@ -8,6 +8,7 @@ test_out # Environments .env* .venv* +**/.env/ env/ venv/ ENV/ diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..e8ea130 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,22 @@ +# Genalog Changelog +All notable changes to this project will be documented in this file. + +Types of changes +1. `Added` for new features. +1. `Changed` for changes in existing functionality. +1. `Deprecated` for soon-to-be removed features. +1. `Removed` for now removed features. +1. `Fixed` for any bug fixes. +1. `Security` in case of vulnerabilities. + +The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), +and we adopt the [Semantic Versioning](https://semver.org/spec/v2.0.0.html). + +## [v0.1.0] - 2021-07-19 +### Added +- Initial package release: + - 3 standard HTML document template for generation + - basic image degradation effects including blur, bleed-through, salt & pepper and other morphological operations. + - 2 flavors of text alignment algorithm: Needleman-Wunsch (shorter text segments) and RETAS (longer text segments) + - Full e2e NER-OCR label generation notebooks + - See [documentation](https://microsoft.github.io/genalog/installation.html) for more on the initial features of the package. diff --git a/README.md b/README.md index 8da32b7..cd35251 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,8 @@ # Genalog - Synthetic Data Generator -[![Build Status](https://dev.azure.com/genalog-dev/genalog/_apis/build/status/Nightly-Build?branchName=main)](https://dev.azure.com/genalog-dev/genalog/_build/latest?definitionId=4&branchName=main) ![Azure DevOps tests (compact)](https://img.shields.io/azure-devops/tests/genalog-dev/genalog/4?compact_message) ![Azure DevOps coverage (main)](https://img.shields.io/azure-devops/coverage/genalog-dev/genalog/4/main) ![Python Versions](https://img.shields.io/badge/py-3.6%20%7C%203.7%20%7C%203.8%20-blue) ![Supported OSs](https://img.shields.io/badge/platform-%20linux--64%20-red) ![MIT license](https://img.shields.io/badge/License-MIT-blue.svg) +[![Build Status](https://dev.azure.com/genalog-dev/genalog/_apis/build/status/Nightly-Build?branchName=main)](https://dev.azure.com/genalog-dev/genalog/_build/latest?definitionId=4&branchName=main) ![Azure DevOps tests (compact)](https://img.shields.io/azure-devops/tests/genalog-dev/genalog/4?compact_message) ![Azure DevOps coverage (main)](https://img.shields.io/azure-devops/coverage/genalog-dev/genalog/4/main) ![Python Versions](https://img.shields.io/badge/py-3.6%20%7C%203.7%20%7C%203.8%20-blue) ![Supported OSs](https://img.shields.io/badge/platform-%20linux--64%20-red) ![MIT license](https://img.shields.io/badge/License-MIT-blue.svg) [![docs link](https://img.shields.io/badge/docs-jupyter--book-brightgreen)](https://microsoft.github.io/genalog/) -`Genalog` is an open source, cross-platform python package for **gen**erating document images with synthetic noise that mimics scanned an**alog** documents (thus the name `genalog`). You can also add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you create in simple HTML format. +Genalog is an open source, cross-platform python package for **gen**erating document images with synthetic noise that mimics scanned an**alog** documents (thus the name `genalog`). You can also add various text degradations to these images. The purpose of this tool is to provide a fast and efficient way to generate synthetic documents from text data by leveraging layout from templates that you create in simple HTML format. Overview ------------------------------------- @@ -15,11 +15,31 @@ Genalog has various capabilities: The aim of this project is to provide a complete solution for generating synthetic images from any text data rich in natural language and to imitate most of OCR noises founded in scanned text documents. +Please refer to our [Genalog documentation](https://microsoft.github.io/genalog) for more tutorials. + +## Installation +See the [Genalog install guide](https://microsoft.github.io/genalog/installation.html) for more details. + +To install the latest release: + +`pip install genalog` + +### Extra Installation Steps in MacOs and Windows +We have a dependency on [`Weasyprint`](https://weasyprint.readthedocs.io/en/stable/install.html), which in turn has non-python dependencies including `Pango`, `cairo` and `GDK-PixBuf` that need to be installed separately. + +So far, `Pango`, `cairo` and `GDK-PixBuf` libraries are available in `Ubuntu-18.04` and later by default. + +If you are running on Windows, MacOS, or other Linux distributions, please see [installation instructions from WeasyPrint](https://weasyprint.readthedocs.io/en/stable/install.html). + +**NOTE**: If you encounter the errors like `no library called "libcairo-2" was found`, this is probably due to the three extra dependencies missing. + ## Getting Started + The following is a summary of the common applications scenarios of Genalog. Please refer the [Jupyter notebook examples](https://github.com/microsoft/genalog/blob/master/example) that make use of the core code base of Genalog and repository utilities. ### TLDR If you are interested in a full document generation and degration pipeline, please see the following notebook: + ||Description|Indepth Jupyter Notebook Examples| |-|-------------------------|--------| |1|Analog Document Generation Pipeline|[Demo Notebook](https://github.com/microsoft/genalog/blob/master/example/generation_pipeline.ipynb)|[Here is guide to the core components](https://github.com/microsoft/genalog/blob/master/genalog/README.md)| @@ -28,7 +48,7 @@ If you are interested in a full document generation and degration pipeline, plea Else we have in-depth walkthroughs of each of the module in Genalog.

- +

||Steps|Indepth Jupyter Notebook Examples|Quick Start Guides| @@ -42,37 +62,13 @@ Else we have in-depth walkthroughs of each of the module in Genalog. We also provide notebooks for the complete end-to-end scenario of generating a synthetic dataset connecting all the components of genalog:

- +

||Scenario|Indepth Jupyter Notebook| |-|-------------------------|--------| |1|Synthetic Dataset Generation with LABELED NER Dataset|[Demo Notebook](https://github.com/microsoft/genalog/blob/master/example/dataset_generation.ipynb)| -Installation ------------------------------ -We are currently in a pre-release stage. Stable release is currently pushed to the [TestPyPI](https://test.pypi.org/project/genalog/). - -`pip install -i https://test.pypi.org/simple/ genalog --extra-index-url https://pypi.org/simple` - -### Extra Installation Steps in MacOs and Windows -We have a dependency on [`Weasyprint`](https://weasyprint.readthedocs.io/en/stable/install.html), which in turn has non-python dependencies including `Pango`, `cairo` and `GDK-PixBuf` that need to be installed separately. - -So far, `Pango`, `cairo` and `GDK-PixBuf` libraries are available in `Ubuntu-18.04` and later by default. - -If you are running on Windows, MacOS, or other Linux distributions, please see [installation instructions from WeasyPrint](https://weasyprint.readthedocs.io/en/stable/install.html). - -**NOTE**: If you encounter the errors like `no library called "libcairo-2" was found`, this is probably due to the three extra dependencies missing. - -### Installation from Source: - -1. Create and activate the virtual environment you want to install the package: - 1. `python -m venv .env` - 1. `pip install --upgrade pip setuptools` - 1. `source .env/bin/activate` or on Windows `.env/Scripts/activate.bat` -1. `git clone https://github.com/microsoft/genalog.git` -1. `cd genalog` -1. `pip install -e .` ### Other Requirements: diff --git a/RELEASE.md b/RELEASE.md new file mode 100644 index 0000000..75c8792 --- /dev/null +++ b/RELEASE.md @@ -0,0 +1,28 @@ +# Toucan Release Procedure + +Checklist for the release process of `genalog`: + +### Preparation +- [x] Ensure `main` branch contains all relevant changes and PRs relating to the specific release is merged +- [x] Create and switch to a new release branch (i.e. release-X.Y.Z) + +### Package Metadata Update +- [x] Update VERSION.txt with version bump. Please reference [Semantic Versioning](https://semver.org/). +- [x] Update [CHANGELOG.md](./CHANGELOG.md) +- [x] Commit the above changes with title "Release vX.Y.Z" +- [x] Generate a new git tag for the new version (e.g. `git tag -a v0.1.0 -m "Initial Release"`) +- [x] Push the new tag to remote `git push origin v0.1.0` +- [x] Create a new PR with the above changes into `main` branch. + +### Release to PyPI +- [x] Manually trigger the [release pipeline](https://dev.azure.com/genalog-dev/genalog/_build?definitionId=2) in DevOps on the release branch, this will publish latest version of `genalog` to PyPI. + - [x] Select `releaseType` to `Test` to test out the release in [TestPyPI](https://test.pypi.org/project/genalog/) + - [x] Rerun and switch `releaseType` to production if looks good. +- [x] If the pipeline ran successfully, check and publish the draft of this release on [Github Release](https://github.com/microsoft/genalog/releases) +- [x] Latest version is pip-installable with: + - `pip install genalog` + +### Update Documentation on Github Page +- [x] Staying on the release branch, `cd docs && pip install -r requirements-doc.txt` +- [x] Build the jupyter-book with `jupyter-book build --all genalog_docs` +- [x] Preview the HTML files, if looks good [publish to Github Page](https://jupyterbook.org/start/publish.html#publish-your-book-online-with-github-pages): `ghp-import -n -p -f genalog_docs/_build/html` diff --git a/VERSION.txt b/VERSION.txt index 12f8116..5d2c174 100644 --- a/VERSION.txt +++ b/VERSION.txt @@ -1 +1 @@ -0.0.1-alpha3 \ No newline at end of file +0.1.0-rc5 \ No newline at end of file diff --git a/devops/release.yml b/devops/release.yml index c962239..2d93f7d 100644 --- a/devops/release.yml +++ b/devops/release.yml @@ -35,8 +35,9 @@ steps: pip install --upgrade pip pip install setuptools wheel python setup.py bdist_wheel --dist-dir dist + python setup.py sdist --dist-dir dist workingDirectory: $(Build.SourcesDirectory) - displayName: 'Building wheel package' + displayName: 'Building wheel package & sdist' - bash: | pip install twine @@ -47,9 +48,31 @@ steps: inputs: pythonUploadServiceConnection: testpypi condition: ${{eq(parameters.releaseType, 'Test')}} - displayName: 'Twine Authentication for ${{parameters.releaseType}}' + displayName: 'Twine Authentication for Test' + +- task: TwineAuthenticate@1 + inputs: + pythonUploadServiceConnection: pypi + condition: ${{eq(parameters.releaseType, 'Production')}} + displayName: 'Twine Authentication for Production' - bash: | - twine upload --verbose -r genalog --config-file $(PYPIRC_PATH) dist/* + twine upload --verbose -r genalog --config-file $(PYPIRC_PATH) dist/*.whl workingDirectory: $(Build.SourcesDirectory) - displayName: 'Uploading wheel package to ${{parameters.releaseType}} PyPI' \ No newline at end of file + displayName: 'Uploading Wheel to ${{parameters.releaseType}} PyPI' + +- task: GitHubRelease@1 + inputs: + gitHubConnection: 'github.com_laserprec' + repositoryName: 'microsoft/genalog' + action: 'create' + target: '$(Build.SourceVersion)' + tagSource: 'gitTag' + tagPattern: 'v.*' + releaseNotesFilePath: 'CHANGELOG.md' + assets: '$(Build.SourcesDirectory)/dist/*' + isDraft: true + changeLogCompareToRelease: 'lastFullRelease' + changeLogType: 'commitBased' + condition: ${{eq(parameters.releaseType, 'Test')}} + displayName: 'Prepare GitHub Release (Draft)' \ No newline at end of file diff --git a/docs/genalog_docs/index.md b/docs/genalog_docs/index.md index 0fbea5a..38d5043 100644 --- a/docs/genalog_docs/index.md +++ b/docs/genalog_docs/index.md @@ -1,6 +1,6 @@ # Synthetic Document Generator -[![Build Status](https://dev.azure.com/genalog-dev/genalog/_apis/build/status/Nightly-Build?branchName=main)](https://dev.azure.com/genalog-dev/genalog/_build/latest?definitionId=4&branchName=main) ![Azure DevOps tests (compact)](https://img.shields.io/azure-devops/tests/genalog-dev/genalog/4?compact_message) ![Azure DevOps coverage (main)](https://img.shields.io/azure-devops/coverage/genalog-dev/genalog/4/main) ![Python Versions](https://img.shields.io/badge/py-3.6%20%7C%203.7%20%7C%203.8%20-blue) ![Supported OSs](https://img.shields.io/badge/platform-%20linux--64%20-red) ![MIT license](https://img.shields.io/badge/License-MIT-blue.svg) +[![Build Status](https://dev.azure.com/genalog-dev/genalog/_apis/build/status/Nightly-Build?branchName=main)](https://dev.azure.com/genalog-dev/genalog/_build/latest?definitionId=4&branchName=main) ![Azure DevOps tests (compact)](https://img.shields.io/azure-devops/tests/genalog-dev/genalog/4?compact_message) ![Azure DevOps coverage (main)](https://img.shields.io/azure-devops/coverage/genalog-dev/genalog/4/main) ![Python Versions](https://img.shields.io/badge/py-3.6%20%7C%203.7%20%7C%203.8%20-blue) ![Supported OSs](https://img.shields.io/badge/platform-%20linux--64%20-red) ![MIT license](https://img.shields.io/badge/License-MIT-blue.svg) [![docs link](https://img.shields.io/badge/docs-jupyter--book-brightgreen)](https://microsoft.github.io/genalog/) ````{margin} ```sh diff --git a/tests/e2e/test_ocr_e2e.py b/tests/e2e/test_ocr_e2e.py index b93c3ff..7f01dd2 100644 --- a/tests/e2e/test_ocr_e2e.py +++ b/tests/e2e/test_ocr_e2e.py @@ -48,9 +48,13 @@ def test_upload_images(self, use_async): ), f"folder {dst_folder} was not deleted" +@pytest.mark.skip(reason=( + "Flaky test. Going to deprecate the ocr module in favor of the official python SDK:\n" + "https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/quickstarts-sdk/client-library?tabs=visual-studio&pivots=programming-language-python" # noqa:E501 +)) @pytest.mark.azure class TestGROKe2e: - @pytest.mark.parametrize("use_async", [False, True]) + @pytest.mark.parametrize("use_async", [False]) def test_grok_e2e(self, tmpdir, use_async): grok = Grok.create_from_env_var() src_folder = "tests/unit/ocr/data/img" diff --git a/tox.ini b/tox.ini index 77c3b17..d3dbc53 100644 --- a/tox.ini +++ b/tox.ini @@ -57,6 +57,6 @@ application-import-names=genalog, tests # Native flake8 configs max-line-length = 140 exclude = - build, dist + build, dist, docs .env*,.venv* # local virtual environments .tox