This is a simple program for OCR and Machine Translation of PDF documents.
The backend uses Imagemagick for image processing, Tesseract for OCR, and DeepL for translation. The frontend is powered by Streamlit.
Original Document OCR text in original language Translated text
DeepL is used for translation, and an authentication key is needed to access their API. You can get a key for free, but registration is required.
Strictly speaking, you need macOS Catalina (10.15) or higher with a 64-bit processor. However, I've tested that this works on macOS Mojave (10.14).
To start, one option is to clone this repo. Open a terminal and run:
git clone https://github.com/christopher-w-murphy/Digital-HTC-Architecture.git
Alternatively, one may download the code by clicking the green Code button and then Download ZIP.
In either case, move into the repo directory and run the installer script:
cd Digital-HTC-Architecture/
bash macos_installer.sh
While still in the Digital-HTC-Architecture
directory, start the program by running the following command in a terminal:
bash ocr_and_machine_translation.sh
You can now view the Streamlit app in your browser.
To stop running the app enter control+c
in the terminal.
Docker users can pull the image from Dockerhub
docker pull murphycw/digital-htc-architecture
Run the image as a container to start the program
docker run -d -p 8501:8501 murphycw/digital-htc-architecture
Note that the Docker image has Tesseract v4 as opposed to v5, and can only OCR English and French documents.
Programming Historian has instructions for a Windows for an OCR and translation program.