Skip to content

Commit

Permalink
New guides added for scaling R packages
Browse files Browse the repository at this point in the history
  • Loading branch information
sash19 committed Jan 22, 2025
1 parent 52e9fa0 commit 8bfe870
Show file tree
Hide file tree
Showing 3 changed files with 328 additions and 1 deletion.
230 changes: 230 additions & 0 deletions docs/helps_demo.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
A Complete Process of Adding & Scaling a New Application
=========================================================

This is another demo where a new container with a new R package is added to be
scaled appropriately using scalable.

The first step would be to add a new container with the new R package. This
should be done by adding a new target to the Dockerfile which is located in the
home directory where scalable_bootstrap is launched. In this case, the
`HELPS <https://github.com/JGCRI/HELPS>`_ package is added as a target to
make a new container. Since the HELPS package is a R package, and rpy2 is the
primary method of interacting with R applications when using Scalable, it would
be good practice to try adding HELPS through rpy2. This is the new target:

.. code-block:: dockerfile
FROM ubuntu:22.04 AS build_env
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get -y update && apt-get install -y \
wget unzip openjdk-11-jdk-headless build-essential libtbb-dev \
libboost-dev libboost-filesystem-dev libboost-system-dev \
python3 python3-pip libboost-python-dev libboost-numpy-dev \
ssh nano locate curl net-tools netcat-traditional git python3 python3-pip \
python3-dev gcc libboost-python1.74 libboost-numpy1.74 openjdk-11-jre-headless libtbb12 rsync
RUN apt-get -y update && apt -y upgrade
RUN pip3 install --upgrade dask[complete] dask-jobqueue dask_mpi pyyaml joblib versioneer tomli xarray
RUN apt-get -y update && apt -y upgrade
RUN mkdir -p /usr/lib/R/site-library
ENV R_LIBS_USER /usr/lib/R/site-library
RUN chmod a+w /usr/lib/R/site-library
RUN apt-get install -y --no-install-recommends r-base r-base-dev
RUN apt-get install -y --no-install-recommends python3-rpy2
RUN python3 -m pip install -U pip
FROM build_env AS helps
RUN apt-get -y update && apt -y upgrade && apt-get -y -f install
RUN apt install --fix-missing -y software-properties-common
RUN add-apt-repository ppa:ubuntugis/ppa
RUN apt-get update
ENV R_LIBS_USER /usr/lib/R/site-library
RUN chmod a+w /usr/lib/R/site-library
RUN apt-get install -y libcurl4-openssl-dev libssl-dev libxml2-dev libudunits2-dev libproj-dev libavfilter-dev \
libbz2-dev liblzma-dev libfontconfig1-dev libharfbuzz-dev libfribidi-dev libgeos-dev libgdal-dev
RUN git clone https://github.com/JGCRI/HELPS.git /HELPS
RUN echo "import rpy2.robjects.packages as rpackages" >> /install_script.py \
&& echo "utils = rpackages.importr('utils')" >> /install_script.py \
&& echo "utils.install_packages('devtools')" >> /install_script.py \
&& echo "utils.install_packages('arrow')" >> /install_script.py \
&& echo "utils.install_packages('assertthat')" >> /install_script.py
RUN python3 /install_script.py
RUN echo "import rpy2.robjects.packages as rpackages" >> /install_script.py \
&& echo "devtools = rpackages.importr('devtools')" >> /install_script.py \
&& echo "devtools.install_github('JGCRI/HELPS', dependencies=True)" >> /install_script.py
RUN python3 /install_script.py
RUN echo "import rpy2.robjects.packages as rpackages" > /install_script.py \
&& echo "helps = rpackages.importr('HELPS')" >> /install_script.py
RUN python3 /install_script.py
COPY --from=scalable /scalable /scalable
RUN pip3 install /scalable/.
RUN pip3 install --force-reinstall xarray==2024.5.0
RUN pip3 install --force-reinstall numpy==1.26.4
The build_env target is included in the base Dockerfile which is automatically
downloaded. It is recommended to use build_env as the base target for all new
containers. The helps target is manually added. Note that rpy2 isn't needed to
download or install the helps package, it's done here for the sake of
consistency. To learn more about how to convert R code to Python code using
rpy2, this guide :doc:`rpy2` can be referred to.

Now, once the the target is added, it can be selected by running the
scalable_bootstrap and selecting the new target. It will be built locally and
uploaded automatically to the remote system.

The next step would be to make functions to actually run the helps package.
This is relatively simple, just launch python3 in the terminal where the
bootstrap lands and run the following code:

.. code-block:: python
from scalable import *
def run_heatstress(hurs_file_name, tas_file_name, rsds_file_name, target_year):
import rpy2.robjects.packages as rpackages
helps = rpackages.importr('HELPS')
# generate a heat stress raster brick for the desired resolution
heat_stress_raster = helps.HeatStress(
TempRes = "month",
SECTOR = "SUNF_R",
HS = helps.WBGT_ESI,
YEAR_INPUT = target_year,
a=hurs_file_name,
b=tas_file_name,
c=rsds_file_name
)
return heat_stress_raster
def run_physical_work_capacity(heat_stress_raster):
import rpy2.robjects.packages as rpackages
helps = rpackages.importr('HELPS')
# generate physical work capacity raster brick
physical_work_capacity_raster = helps.PWC(
WBGT = heat_stress_raster,
LHR = helps.LHR_Hothaps,
workload = "high"
)
return physical_work_capacity_raster
def run_annualized_physical_work_capacity(physical_work_capacity_raster):
import rpy2.robjects.packages as rpackages
helps = rpackages.importr('HELPS')
# aggregate physical work capacity to annual values and reformat to a data frame
annualized_physical_work_capacity_df = helps.MON2ANN(
input_rack = physical_work_capacity_raster,
SECTOR = "SUNF_R"
)
return annualized_physical_work_capacity_df
def run_country_physical_work_capacity(annualized_physical_work_capacity_df):
import rpy2.robjects.packages as rpackages
import rpy2.robjects as robjects
helps = rpackages.importr('HELPS')
country_raster = robjects.r['country_raster']
# map annual physical work capacity to gridded countries
country_physical_work_capacity_df = helps.G2R(
grid_annual_value = annualized_physical_work_capacity_df,
SECTOR = "SUNF_R",
rast_boundary = country_raster
)
return country_physical_work_capacity_df
def run_basin_physical_work_capacity(annualized_physical_work_capacity_df):
import rpy2.robjects.packages as rpackages
import rpy2.robjects as robjects
helps = rpackages.importr('HELPS')
reg_WB_raster = robjects.r['reg_WB_raster']
# map annual physical work capacity to gridded water basins
basin_physical_work_capacity_df = helps.G2R(
grid_annual_value = annualized_physical_work_capacity_df,
SECTOR = "SUNF_R",
rast_boundary = reg_WB_raster
)
return basin_physical_work_capacity_df
if __name__ == "__main__":
hurs_file_name = "path/to/hurs_file"
tas_file_name = "path/to/tas_file"
rsds_file_name = "path/to/rsds_file"
# all the functions need to be ran for all the target years.
# in this case, target years would be 2015 - 2100 in 5 year increments
target_years = list(range(2015, 2105, 5))
num_years = len(target_years)
## Creating a SlurmCluster object with the required parameters
cluster = SlurmCluster(queue='short', walltime='02:00:00', account='GCIMS', silence_logs=False)
## Adding the helps container specifications (can be changed as needed)
cluster.add_container(tag="helps", cpus=4, memory="8G", dirs={"/path1":"/path1", "/path2":"/path2"})
## Adding workers to the cluster
cluster.add_workers(n=num_years, tag="helps")
# Making a client to submit jobs
sc_client = ScalableClient(cluster)
# run helps
# note that the map function is used here as multiple instances of the
# same function is being ran with different inputs. This is essentially
# parallelization. To make it possible, pass in multiple lists of
# arguments to the map function. If the target function has 2 arguments
# then 2 lists of the same size should be passed. The size of the list
# will be the same as the number of instances of the target function
# that will be ran.
# n = 1 is used as the value for n because n specifies the number of
# workers needed for a single instance of the target function.
heatstress_futures = sc_client.map(run_heatstress, [hurs_file_name]*num_years, [tas_file_name]*num_years,
[rsds_file_name]*num_years, target_years, n=1, tag="helps")
pwc_futures = sc_client.map(run_physical_work_capacity, heatstress_futures, n=1, tag="helps")
annualized_pwc_futures = sc_client.map(run_annualized_physical_work_capacity, pwc_futures, n=1, tag="helps")
country_pwc_futures = sc_client.map(run_country_physical_work_capacity, annualized_pwc_futures, n=1, tag="helps")
basin_pwc_futures = sc_client.map(run_basin_physical_work_capacity, annualized_pwc_futures, n=1, tag="helps")
# now the results can be gathered and then printed or written to a file
heatstress_results = sc_client.gather(heatstress_futures)
pwc_results = sc_client.gather(pwc_futures)
annualized_pwc_results = sc_client.gather(annualized_pwc_futures)
country_pwc_results = sc_client.gather(country_pwc_futures)
basin_pwc_results = sc_client.gather(basin_pwc_futures)
This code will run the HELPS package on the remote HPC system. The entire guide
highlights the process of adding a new container with a new R package and
scaling it using Scalable. Please feel free to reach out for any more help
regarding the same or open an issue on the Scalable GitHub repository.


4 changes: 3 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,11 +44,13 @@ Contents:

cache_hash
container
rpy2

.. toctree::
:caption: Demo
:caption: Demos

demo
helps_demo

.. toctree::
:caption: Common Issues
Expand Down
95 changes: 95 additions & 0 deletions docs/rpy2.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
How to convert R code to Python code using rpy2
===============================================

While using scalable, there may be multiple instances where a model which needs
to be scaled is written in R or can only be interacted with in R. While it is
possible to scale R code by writing it all in an Rscript and calling the script
from Python, it is generally recommended to the R code to Python code using
rpy2. This is to allow for better integration of such code with Python and make
it much easier to modify inputs, get precise outputs, and debug the code.

The full documentation for rpy2 can be found on this
`website <https://rpy2.github.io/doc.html>`_. In this guide, only the important
methods for converting R code to Python code will be discussed which is all
that is needed in most of the cases.

Let's look at an example with R code to convert. The
`HELPS <https://github.com/JGCRI/HELPS>`_ package will be used to demonstrate
the conversion. The R code is as follows:


.. code-block:: r
library(HELPS)
# test
esi.mon <- HeatStress(TempRes = "month", SECTOR = "SUNF_R", HS = WBGT_ESI, YEAR_INPUT = 2024,
"path/hurs_mon_basd_CanESM5_W5E5v2_GCAM_ref_2015.nc",
"path/tas_mon_basd_CanESM5_W5E5v2_GCAM_ref_2015-2100.nc",
"path/rsds_mon_basd_CanESM5_W5E5v2_GCAM_ref_2015.nc")
pwc.mon.hothaps <- PWC(WBGT = esi.mon, LHR = LHR_Hothaps, workload = "high")
pwc.hothaps.ann <- MON2ANN(input_rack = pwc.mon.hothaps, SECTOR = "SUNF_R")
ctry_pwc <- G2R(grid_annual_value = pwc.hothaps.ann, SECTOR = "SUNF_R", rast_boundary = country_raster)
glu_pwc <- G2R(grid_annual_value = pwc.hothaps.ann, SECTOR = "SUNF_R", rast_boundary = reg_WB_raster)
A simple conversion of the above code to Python code using rpy2 is as follows:

.. code-block:: python
import rpy2.robjects.packages as rpackages
import rpy2.robjects as robjects
helps = rpackages.importr('HELPS')
esi_mon = helps.HeatStress(TempRes = "month", SECTOR = "SUNF_R", HS = helps.WBGT_ESI, YEAR_INPUT = 2024,
a="hurs_mon_basd_CanESM5_W5E5v2_GCAM_ref_2015-2100.nc",
b="tas_mon_basd_CanESM5_W5E5v2_GCAM_ref_2015-2100.nc",
c="rsds_mon_basd_CanESM5_W5E5v2_GCAM_ref_2015-2100.nc")
pwc_mon_hothaps = helps.PWC(WBGT = esi_mon, LHR = helps.LHR_Hothaps, workload = "high")
pwc_hothaps_ann = helps.MON2ANN(input_rack = pwc_mon_hothaps, SECTOR = "SUNF_R")
country_raster = robjects.r['country_raster']
ctry_pwc = helps.G2R(grid_annual_value = pwc_hothaps_ann, SECTOR = "SUNF_R", rast_boundary = country_raster)
reg_WB_raster = robjects.r['reg_WB_raster']
glu_pwc = helps.G2R(grid_annual_value = pwc_hothaps_ann, SECTOR = "SUNF_R", rast_boundary = reg_WB_raster)
The above code is a simple conversion of the R code to Python code using rpy2.
A couple of important things to note are:

* The HELPS package is imported at ``helps = rpackages.importr('HELPS')``.
This is similar to importing a package in R using ``library(HELPS)``.
Unlike R, all the functions in a package are not automatically imported.
Instead an object is created whose attributes corresponds to the methods in
that package.

* If there are functions or values in the R code which are either not defined
in any particular package, are defined in the global environment, or it's
simply unknown where exactly they may reside then the ``robjects.r`` object
can be used to access them. For example, in the above code, the
``country_raster`` and ``reg_WB_raster`` functions are not defined in the
HELPS package, and without any knowledge where they may reside, the
``robjects.r`` object can be used to access by just looking up the name of
the function or value in the R code. Another example is if ``pi`` is used
in the R code, then it can be accessed in Python using
``pi = robjects.r['pi']``.

* A more subtle point to note is that the ``HeatStress`` function in R takes
three positional arguments at the end. These are matched to the ``...``
argument in the R code. However, no positional arguments should be passed
to the rpy2 function in python to avoid confusion. To make sure that the
extra arguments get matched with ``...`` internally, arbitrary names can
be used for the arguments. In the above code, such arguments are named
``a``, ``b``, and ``c``.

These are the only two important things to note when converting R code to
Python code using rpy2 in most cases. The rest of the code can be converted
directly by just replacing the R syntax with Python syntax. If any issues arise
or any help is needed converting a specific piece of code, please feel free
to open an issue on the scalable github repo
`here <https://github.com/JGCRI/scalable/issues>`_.

0 comments on commit 8bfe870

Please sign in to comment.