Scalable implementation for the course project of Design of Parallel and High Performance Computing HS19
You need mpicxx, mpiexec (any MPI implementation) and also cmake
(install this using sudo apt install cmake g++
)
Then you first run cmake:
cmake .
Compile the program:
make
Run the program:
mpiexec -np <number of processes> ./Distributed_Genetic_Algorithm
Note: you need to be in the ETH network! (use VPN)
ssh <netz>@login.leonhard.ethz.ch
To copy your SSH key, in your command line use:
ssh-copy-id <NETHZ>@login.leonhard.ethz.ch
(Linux and OSX)
cat ~/.ssh/id_rsa.pub | ssh <NETHZ>@login.leonhard.ethz.ch "cat >> ~/.ssh/authorized_keys"
(Windows)
module load gcc/8.2.0
module load openmpi/4.0.1
To avoid typing this in at every login, you can also add it to your .bashrc
file using the following commands (you only need to do this once):
cd ~
echo "module load gcc/8.2.0" >> .bashrc
echo "module load openmpi/4.0.1" >> .bashrc
(if you ever need to remove it again, just open .bashrc
and delete the last two lines again.
cmake .
make
Do not do this exessively as it runs on the login node. Just use for sanity check.
mpirun -np 2 ./Distributed_Genetic_Algorithm
Use the following command to submit a job on Leonhard:
bsub -n 4 -W 00:10 -o log_test -R "rusage[mem=1024]" mpirun ./Distributed_Genetic_Algorithm
-n
specifies the number of cores to use. Note: in the mpirun call you do not need to actually specify -np
.
-W 00:10
specifies that we want to run it for at most 10 minutes (after which the process gets killed). As we should not submit jobs that run longer than 30 minutes, keep this at most at 00:30
, but if the value is lower it might get scheduled faster.
-o log_test
specifies the name of the logfile. If unspecified then the cluster will create a cryptic name.
-R "rusage[mem=1024]"
specifies how much memory is used PER CORE.
bjobs -a
lists all jobs for that user. You can also use bbjobs -a
to get more information.
bpeek -f
shows the console output of the job (if only one is active you do not need to specify the job-id)
vi log_test
Write a .json file that adheres to the specification below to describe what kind of jobs the cluster
To avoid frustration that the jobs did not run correctly, there is an easy way to test what kind of jobs will be scheduled consecutively:
python run_experiments.py --dry_run -e experiment.json
where after -e
follows a list of experiment files to run. The script simply prints all the jobs that will be submitted. If you are happy with the results you can proceed to actually scheduling the jobs
As this python script will run for a long time, we recommend that you use nohup
to avoid any interruptions.
nohup python run_experiments.py -e experiment.json >> output.log &
This script will also create a folder in logs
for each experiment and put any outputs in output.log
.
When executing the command it outputs the PID of the job. You will need this if you ever need to kill the script.
If you ever need to kill the experiment script, you can use the command kill <PID>
where PID is the PID that was outputted when running the experiment. In case you don't know it, use htop
and search for python
to find a process that runs under your own username and use that PID.
After running an experiment to completion, all the logfiles were automatically written to logs/<EXPERIMENT-DIRECTORY>
. You should try to find the right folder first (using ls logs
), then go to that directory (using cd logs/<EXPERIMENT-DIRECTORY>
).
Find the current working directory (using pwd
) and copy it to your clipboard.
Log out from SSH (using Ctrl-D
/CMD-D
) and navigate to the directory where you want to copy the logfiles to. Usually this should be /path/to/Scalable_Genetic_Algorithm/logs
.
Using SCP, copy all the files to the current directory: scp -r <NETHZ>@login.leonhard.ethz.ch:<WORKING-DIR> .
where is the directory that you have saved to your clipboard before. This might take a while.
To plot the results on your local machine navigate to the logs
directory and start a jupyter notebook (using jupyter notebook
, make sure you have it installed with pip install jupyter
).
Open the file plot_results.ipynb
and install the necessary libraries if necessary (by running the first cell).
You can extract the fitness values using the following line:
experiments, dataframes = extract_all_run_values(<EXPERIMENT-DIRECTORY>)
This returns into experiments
a list of names of the experiments that were run. dataframes
is also a list of dataframes of the corresponding experiments. Each element of dataframes
is a dataframe with the following columns:
- epoch: epoch when it was recorded
- fitness: minimum fitness across all ranks
- run: which repetition it the values are from
Plot the values using multiple lines:
ax = sns.lineplot(x="epoch", y="fitness", hue="run", data=dataframes[0])
ax.set_title(experiments[0])
ax
Plot the values using confidence intervals:
ax = sns.lineplot(x="epoch", y="fitness", data=dataframes[0])
ax.set_title(experiments[0])
ax
You can save plots as you usually would using matplotlib (using plt.savefig("fig.png")
)
When writing an experiment specification you need to follow this standard:
{
"name" : <EXPERIMENT_NAME>,
"repetitions" : <NR_REPETITIONS>,
"fixed_params" : {
<FIXED_PARAM_1_NAME> : <FIXED_PARAM_1_VALUE>,
...
<FIXED_PARAM_N_NAME> : <FIXED_PARAM_N_VALUE>
},
"variable_params" : {
<VAR_PARAM_1_NAME> : <VAR_PARAM_1_VALUE>,
...
<VAR_PARAM_M_NAME> : <VAR_PARAM_M_VALUE>
}
}
- <EXPERIMENT_NAME> is any identifier for the experiment. White spaces will be turned into underscores for the folder name
- <NR_REPETITIONS> is the number of repetitions of the experiment
- <FIXED_PARAM_i_NAME> is the string of the argument identifier that should be fixed in the experiment.
- <FIXED_PARAM_i_VALUE> is the integer or string of the value.
- <VAR_PARAM_j_NAME> is the string of the argument identifier that should be varied in the experiment.
- <VAR_PARAM_j_VALUE> is a dict of the elements that should be varied.
- It can be a range from to with stride (if stride is not specified the value 1 is assumed):
{ "type" : "range", "min" : <MIN>, "max" : <MAX>, "stride" : <STRIDE> }
- It can be a list of values <VAL_1>, ..., <VAL_K>:
{ "type" : "list", "list" : [ <VAL_1>, ... <VAL_K> ] }
- It can be a list of tuples (<VAL_1_1>, <VAL_2_1>, ... , <VAL_L_1>), ... , (<VAL_1_K>, <VAL_2_K>, ... , <VAL_L_K>). Note the parameter name will be ignored:
{ "type" : "tuple", "names" : [ <NAME_1>, ... <NAME_L> ], "values" : [ { "value" : [ <VAL_1_1>, ... <VAL_L_1> ] }, ... { "value" : [ <VAL_1_K>, ... <VAL_L_K> ] }, ] }
- It can be a range from to with stride (if stride is not specified the value 1 is assumed):
In this example we try different number of islands of different sizes. We try every combination of using an island of size k where 1 <= k < 4
and the population size p where p ∈ {100, 200, 400, 800}
for 10 repetitions.
{
"name" : "try scaling and population size",
"repetitions" : 10,
"fixed_params" : {
"mode" : "island"
},
"variable_params" : {
"-n": {
"type": "range",
"min": 1,
"max": 4,
"stride": 1
},
"--population": {
"type": "list",
"list": [
100,
200,
400,
800
]
}
}
}