This configuration is only valid on
fs.dslc.liacs.nl
.
To access fs.dslc.liacs.nl
you must first be logged into one of the universities services and ssh into the cluster from a secure environment.
# i.e. ssh to silver server
ssh <YOUR_STUDENT_NUMBER>@silver.liacs.nl
# Example: ssh [email protected]
# ssh to ds cluster from silver server
ssh fs.dslc.liacs.nl
This repository contains a dsctl.sh
file which can simply setup a standalone Spark cluster.
You can easily clone this repo in the root of your cluster's master node by running:
git clone https://github.com/maminio/dsctl
If git is not present you can:
wget https://raw.githubusercontent.com/maminio/dsctl/master/dsctl.sh -O dsctl.sh
To view all the available commands run:
bash ./dsctl.sh -h
Usage: ./dsctl.sh [option]
init Setup all initial software and packages.
python_setup Setup all initial software and packages.
start_master Start master node of the cluster
start_worker Start one worker, to start all workers at once please run start_workers
start_workers Start all worker nodes
start Start both the master and worker nodes.
stop_all Stop all current nodes.
Navigate to dsctl
folder and run the below command to initialize all the prerequizite packages and softwares.
bash ./dsctl.sh init
This command will prompt to enter your password to setup and transfer all the shared files between master and worker nodes.
To work with Pyspark or using jupyter, you need to install python.
This file will only install Python 3.7.4
.
To start the installation run:
bash ./dsctl.sh python_setup
This command will take a while to run as it will be building python from the source.
To star the cluster will the master node
and all the available worker node
you can simply run:
bash ./dsctl.sh start
To stop the cluster and all running nodes:
bash ./dsctl.sh stop_all
There are two options available to connect to the cluster.
- You can
ssh port-forward
the wholemaster node
port (8085). - Start
Jupyter Notebook
on the cluster and only port forward theJupyter Notebook
.
- SSH to
fs.dslc.liacs.nl
- Install
Jupyter
python -m pip install jupyter
- Run jupyter notebook
python -m jupyter notebook --no-browser --port=9995
- From your local terminal ssh-port-forward to the cluster
local:
ssh -L 9995:localhost:2366 <YOUR_STDENT_NUMBER>@silver.liacs.nl
ssh -4 -N -L 2366:localhost:9995 <YOUR_STDENT_NUMBER>@fs.dslc.liacs.nl
- Open jupyter on your local browser:
localhost:9995
# checking if a previous Spark session is active
from pyspark import SparkContext
from pyspark.sql import SQLContext
import warnings
try:
# create SparkContext on all CPUs available: in my case I have 4 CPUs on my laptop
sc = SparkContext(appName="SDDM", master='spark://fs.dslc.liacs.nl:8085')
print("Just created a SparkContext")
sqlContext = SQLContext(sc)
print("Just created a SQLContext")
except ValueError:
warnings.warn("SparkContext already exists in this scope")
- Open a terminal and ssh-port-forward to
silver
.
ssh -L 8085:localhost:2348 <YOUR_STDENT_NUMBER>@silver.liacs.nl
- SSH port-forward from
silver
tofs.dslc.liacs.nl
ssh -4 -N -L 2348:fs.dslc.liacs.nl:8085 <YOUR_STDENT_NUMBER>@fs.dslc.liacs.nl
- Connect your local
Jupyter Notebook
to the cluster
from pyspark import SparkContext
from pyspark.sql import SQLContext
import warnings
try:
sc = SparkContext(appName="SDDM", master='spark://localhost:8085')
except ValueError:
warnings.warn("SparkContext already exists in this scope")