osdataproc
is a command-line tool for creating an OpenStack cluster with
Apache Spark and Apache Hadoop configured. It comes
with JupyterLab and Hail, a genomic data analysis
library built on Spark installed, as well as Netdata for
monitoring.
The osdataproc
uses Terraform to spin up
cluster in OpenStack, and
Ansible to set up data on cluster
-
Create a Python virtual environment. For example:
python3 -m venv env
-
Download Terraform (1.4, or higher) and unzip it into a location on your path, e.g. into your venv. Make sure to download the appropriate version for your operating system and architecture.
curl https://releases.hashicorp.com/terraform/1.4.6/terraform_1.4.6_darwin_amd64.zip > terraform_1.4.6_darwin_amd64.zip unzip terraform_1.4.6_darwin_amd64.zip
-
Source the environment, clone this repository and install the requirements into the virtual environment:
source env/bin/activate git clone https://github.com/wtsi-hgi/osdataproc.git cd osdataproc pip install -e .
-
Make sure you have created an SSH keypair with
ssh-keygen
if you have not done so before. The default options are OK. Read the notes below if your private key has a passphrase. -
Download your OpenStack project's
openrc.sh
file. You can find the specific file for your project at Project > API Access, and then Download OpenStack RC File > OpenStack RC File on the right. -
Source your
openrc.sh
file:source <project-name>-openrc.sh
You can then run the osdataproc
command as shown in the examples,
below. osdataproc --help
, or osdataproc create --help
, etc. will
show all possible arguments.
Once run, it will ask you for a password. This is for access to the web interfaces, including Jupyter Lab. It is also the password for an encrypted NFS volume (see the NFS documentation). When you first access your cluster via a browser you will be asked for said password.
osdataproc create [--num-workers] <Number of desired worker nodes>
[--public-key] <Path to public key file>
[--flavour] <OpenStack flavour to use>
[--network-name] <OpenStack network to use>
[--lustre-network] <OpenStack Lustre provider network to use>
[--image-name] <OpenStack image to use - Ubuntu images only>
[--nfs-volume] <Name/ID of volume to attach or create as NFS shared volume>
[--volume-size] <Size of OpenStack volume to create>
[--device-name] <Device mountpoint name of volume>
[--floating-ip] <OpenStack floating IP to associate to master node - will automatically create one if not specified>
<cluster_name>
NOTE: Ensure that the image used has python3.9 as the default version of python. The focal images should work.
osdataproc create
will output the public IP of your master node when
the node has been created. You can SSH into this using the public key
provided:
ssh ubuntu@<public_ip>
Note that it will take a few minutes for the configuration to complete. The following services can then be accessed from your browser:
Service | URL |
---|---|
Jupyter Lab | https://<public_ip>/jupyter |
Spark | https://<public_ip>/spark |
Spark History | https://<public_ip>/sparkhist |
HDFS | https://<public_ip>/hdfs |
YARN | https://<public_ip>/yarn |
MapReduce History | https://<public_ip>/mapreduce |
Netdata Metrics | https://<public_ip>/netdata |
You can attach a volume as an NFS share to your cluster creating either
a new volume, or attaching an existing volume. This will mount the
volume on the data
directory and mount the data
directory of your
master node to all of the worker nodes as a shared volume over NFS.
See the NFS documentation for details and creation options.
For Lustre support, you must provide the name of the Lustre provider network that exists in your tenant and an image configured to mount Lustre from this network. For Sanger users, please check with ISG for the details.
osdataproc destroy <cluster_name>
You will be prompted to type 'yes' if you are happy with printed destroy plan.
There is a vars.yml file where default options for creating the cluster can be saved, as well as Spark and Hadoop configuration items tuned. Additional packages and Python modules to install on the cluster can also be specified here.
-
Please check the version of jinja2, make sure it is 3.0.3, otherwise you may have the error message as below.
[WARNING]: Skipping plugin (/lib/python3.9/site-packages/ansible/plugins/filter/mathstuff.py) as it seems to be invalid: cannot import name 'environmentfilter' from 'jinja2.filters' (/lib/python3.9/site-packages/jinja2/filters.py)
-
Your cluster name should never contain underscore characters; valid choices are alphanumeric characters and dashes.
-
If your private key has a passphrase, Ansible will not be able to connect to the created instances unless you add your key to
ssh-agent
first:eval $(ssh-agent) ssh-add
-
You can check the provisioning status of the worker nodes via the master node and checking
/var/log/user_data.log
. For example:ssh -J ubuntu@<public-ip> \ ubuntu@<user>-<cluster_name>-worker-<index> \ tail -f /var/log/user_data.log
-
osdataproc
is configured to use Kryo serialization for use with Hail for up to 10x faster data serialization. However, not allSerializable
types are supported and so it may be necessary to change$SPARK_HOME/conf/spark-defaults.conf
by commenting out or removing thespark.serializer
configuration option. This can also be removed by default in vars.yml when creating a cluster. -
For Sanger users, check the appropriate FCE capacity dashboard under the Sanger metrics.
-
If
spark-submit
is not on PATH (spark-submit: command not found
), activate the venvsource /home/ubuntu/venv/bin/activate
-
If running
spark-submit
you get the following errorTypeError: 'JavaPackage' object is not callable
check that
SPARK_HOME
is set to/opt/spark
export SPARK_HOME=/opt/spark
You can contribute by submitting pull requests to this repository. If
you create a fork you will need to update the REPO
and BRANCH
variables in terraform/user-data.sh.tpl
to the new repository location
for the changes you make to be reflected in the created cluster.
Variables containing main package versions are defined in the vars.yaml
JDK version is defined in ansible/roles/commom/default
All Terraform and Ansible logs are located directly in the /var/log
folder.
By default, Ansible writes data to syslog.
Logs for master node is also saved to the OSDataProc_HOME}/terraform/terraform.tfstate.d/cluster_name/ansible-master.log
To run Ansible playbooks on the worker nodes, Terraform clones it from the public git repository.
To apply updated playbooks on cluster creation, you need to push changes to git branch and use this branch
in the terraform/user-data.sh.tpl
script.
- Refactor/enhance Python CLI
- Move from
openrc.sh
toclouds.yaml
- Incorporate
run
script intoosdataproc
Python CLI - Allow setting of password by environment variable
- Allow multiple public SSH keys
- Allow setting of DNS name servers
- Use exit codes productively if Terraform/Ansible fail
- Machine-readable output to communicate status
- Allow non-interactive destruction
- Correct distribution to recognise multiple Python modules
- Move from
- Refactor Ansible playbooks
- Update to JupyterLab 3
- Resize functionality (larger flavour/more nodes)
- Support for other Linux distributions