For testing transforms with S3 we are using Minio, which can be installed on Linux, macOS and Windows. Here we are assuming Mac usage, refer to documentation above for other platforms.
The simplest way to install Minio on Mac is using Homebrew. Use the following command:
brew install minio/stable/minio
In addition to the Minio server install the latest stable MinIO cli using
brew install minio/stable/mc
Now you can start Minio server using the following command:
minio server start
When it starts you can connect to the server UI using the following address: http://localhost:9000
The default user name/password is minioadmin|minioadmin
Populating Minio server with test data can be done using mc
. First configure mc to work with the local
Minio server:
mc alias set local http://127.0.0.1:9000 minioadmin minioadmin
This set an alias local
to 'mc' connected to the local Minio server instance. Now we can use our
mc instance to populate server using a set of
commands provided by mc
.
First test the connection to the newly added MinIO deployment using the mc admin info
command:
mc admin info local
To copy the data to Minio, you first need to create a bucket:
mc mb local/test
Once the bucket is created, you can copy files (assuming you are in the transforms directory), using:
mc cp --recursive tools/ingest2parquet/test-data/input/ local/test/ingest2parquet/input
mc cp --recursive code/code_quality/test-data/input/ local/test/code_quality/input
mc cp --recursive code/proglang_select/test-data/input/ local/test/proglang_select/input
mc cp --recursive code/proglang_select/test-data/languages/ local/test/proglang_select/languages
mc cp --recursive code/malware/test-data/input/ local/test/malware/input
mc cp --recursive language/doc_quality/test-data/input/ local/test/doc_quality/input
mc cp --recursive language/lang_id/ray/test-data/input/ local/test/lang_id/input
mc cp --recursive universal/blocklist/test-data/input/ local/test/blocklist/input
mc cp --recursive universal/blocklist/test-data/domains/ local/test/blocklist/domains
mc cp --recursive universal/doc_id/test-data/input/ local/test/doc_id/input
mc cp --recursive universal/ededup/test-data/input/ local/test/ededup/input
mc cp --recursive universal/fdedup/test-data/input/ local/test/fdedup/input
mc cp --recursive universal/filter/test-data/input/ local/test/filter/input
mc cp --recursive universal/noop/test-data/input/ local/test/noop/input
mc cp --recursive universal/resize/test-data/input/ local/test/resize/input
mc cp --recursive universal/tokenization/test-data/ds01/input/ local/test/tokenization/ds01/input
mc cp --recursive universal/tokenization/test-data/ds02/input/ local/test/tokenization/ds02/input
Note, that once the data is copied, Minio is storing it on the local file system, so you do not need to copy it again after cluster restart
The last thing is to add Minio access and secret keys for accessing it. The following command:
mc admin user svcacct add --access-key "localminioaccesskey" --secret-key "localminiosecretkey" local minioadmin
creates both access and secret key for usage by the applications