New guides added for scaling R packages

JGCRI · Jan 22, 2025 · 8bfe870 · 8bfe870
1 parent 52e9fa0
commit 8bfe870
Show file tree

Hide file tree

Showing 3 changed files with 328 additions and 1 deletion.
diff --git a/docs/helps_demo.rst b/docs/helps_demo.rst
@@ -0,0 +1,230 @@
+A Complete Process of Adding & Scaling a New Application
+=========================================================
+
+This is another demo where a new container with a new R package is added to be 
+scaled appropriately using scalable. 
+
+The first step would be to add a new container with the new R package. This 
+should be done by adding a new target to the Dockerfile which is located in the 
+home directory where scalable_bootstrap is launched. In this case, the 
+`HELPS <https://github.com/JGCRI/HELPS>`_ package is added as a target to 
+make a new container. Since the HELPS package is a R package, and rpy2 is the 
+primary method of interacting with R applications when using Scalable, it would 
+be good practice to try adding HELPS through rpy2. This is the new target:
+
+.. code-block:: dockerfile
+
+    FROM ubuntu:22.04 AS build_env
+
+    ENV DEBIAN_FRONTEND=noninteractive
+    RUN apt-get -y update && apt-get install -y \ 
+        wget unzip openjdk-11-jdk-headless build-essential libtbb-dev \
+        libboost-dev libboost-filesystem-dev libboost-system-dev \
+        python3 python3-pip libboost-python-dev libboost-numpy-dev \
+        ssh nano locate curl net-tools netcat-traditional git python3 python3-pip \
+        python3-dev gcc libboost-python1.74 libboost-numpy1.74 openjdk-11-jre-headless libtbb12 rsync
+    RUN apt-get -y update && apt -y upgrade
+    RUN pip3 install --upgrade dask[complete] dask-jobqueue dask_mpi pyyaml joblib versioneer tomli xarray
+    RUN apt-get -y update && apt -y upgrade
+    RUN mkdir -p /usr/lib/R/site-library
+    ENV R_LIBS_USER /usr/lib/R/site-library
+    RUN chmod a+w /usr/lib/R/site-library
+    RUN apt-get install -y --no-install-recommends r-base r-base-dev
+    RUN apt-get install -y --no-install-recommends python3-rpy2
+    RUN python3 -m pip install -U pip
+
+    
+    FROM build_env AS helps
+    RUN apt-get -y update && apt -y upgrade && apt-get -y -f install
+    RUN apt install --fix-missing -y software-properties-common
+    RUN add-apt-repository ppa:ubuntugis/ppa
+    RUN apt-get update
+    ENV R_LIBS_USER /usr/lib/R/site-library
+    RUN chmod a+w /usr/lib/R/site-library
+    RUN apt-get install -y libcurl4-openssl-dev libssl-dev libxml2-dev libudunits2-dev libproj-dev libavfilter-dev \
+        libbz2-dev liblzma-dev libfontconfig1-dev libharfbuzz-dev libfribidi-dev libgeos-dev libgdal-dev
+    RUN git clone https://github.com/JGCRI/HELPS.git /HELPS
+    RUN    echo "import rpy2.robjects.packages as rpackages" >> /install_script.py \
+        && echo "utils = rpackages.importr('utils')" >> /install_script.py \
+        && echo "utils.install_packages('devtools')" >> /install_script.py \
+        && echo "utils.install_packages('arrow')" >> /install_script.py \
+        && echo "utils.install_packages('assertthat')" >> /install_script.py 
+    RUN python3 /install_script.py
+    RUN    echo "import rpy2.robjects.packages as rpackages" >> /install_script.py \
+        && echo "devtools = rpackages.importr('devtools')" >> /install_script.py \
+        && echo "devtools.install_github('JGCRI/HELPS', dependencies=True)" >> /install_script.py 
+    RUN python3 /install_script.py
+    RUN    echo "import rpy2.robjects.packages as rpackages" > /install_script.py \
+        && echo "helps = rpackages.importr('HELPS')" >> /install_script.py 
+    RUN python3 /install_script.py
+    COPY --from=scalable /scalable /scalable
+    RUN pip3 install /scalable/.
+    RUN pip3 install --force-reinstall xarray==2024.5.0
+    RUN pip3 install --force-reinstall numpy==1.26.4    
+
+The build_env target is included in the base Dockerfile which is automatically 
+downloaded. It is recommended to use build_env as the base target for all new 
+containers. The helps target is manually added. Note that rpy2 isn't needed to 
+download or install the helps package, it's done here for the sake of 
+consistency. To learn more about how to convert R code to Python code using 
+rpy2, this guide :doc:`rpy2` can be referred to.
+
+Now, once the the target is added, it can be selected by running the 
+scalable_bootstrap and selecting the new target. It will be built locally and 
+uploaded automatically to the remote system. 
+
+The next step would be to make functions to actually run the helps package. 
+This is relatively simple, just launch python3 in the terminal where the 
+bootstrap lands and run the following code:
+
+.. code-block:: python
+
+    from scalable import *
+
+    def run_heatstress(hurs_file_name, tas_file_name, rsds_file_name, target_year):
+        import rpy2.robjects.packages as rpackages
+
+        helps = rpackages.importr('HELPS')
+
+        # generate a heat stress raster brick for the desired resolution
+        heat_stress_raster = helps.HeatStress(
+            TempRes = "month", 
+            SECTOR = "SUNF_R", 
+            HS = helps.WBGT_ESI, 
+            YEAR_INPUT = target_year, 
+            a=hurs_file_name, 
+            b=tas_file_name, 
+            c=rsds_file_name
+        )
+
+        return heat_stress_raster
+
+
+    def run_physical_work_capacity(heat_stress_raster):
+        import rpy2.robjects.packages as rpackages
+
+        helps = rpackages.importr('HELPS')
+
+        # generate physical work capacity raster brick
+        physical_work_capacity_raster = helps.PWC(
+            WBGT = heat_stress_raster,  
+            LHR = helps.LHR_Hothaps, 
+            workload = "high"
+        )
+
+        return physical_work_capacity_raster
+
+
+    def run_annualized_physical_work_capacity(physical_work_capacity_raster):
+        import rpy2.robjects.packages as rpackages
+
+        helps = rpackages.importr('HELPS')
+
+        # aggregate physical work capacity to annual values and reformat to a data frame
+        annualized_physical_work_capacity_df = helps.MON2ANN(
+            input_rack = physical_work_capacity_raster, 
+            SECTOR = "SUNF_R"
+        )
+
+        return annualized_physical_work_capacity_df
+
+
+    def run_country_physical_work_capacity(annualized_physical_work_capacity_df):
+        import rpy2.robjects.packages as rpackages
+        import rpy2.robjects as robjects
+
+        helps = rpackages.importr('HELPS')
+        country_raster = robjects.r['country_raster']
+
+        # map annual physical work capacity to gridded countries
+        country_physical_work_capacity_df = helps.G2R(
+            grid_annual_value = annualized_physical_work_capacity_df, 
+            SECTOR = "SUNF_R", 
+            rast_boundary = country_raster
+        )
+
+        return country_physical_work_capacity_df
+
+
+    def run_basin_physical_work_capacity(annualized_physical_work_capacity_df):
+        import rpy2.robjects.packages as rpackages
+        import rpy2.robjects as robjects
+
+        helps = rpackages.importr('HELPS')
+        reg_WB_raster = robjects.r['reg_WB_raster']
+        
+        # map annual physical work capacity to gridded water basins
+        basin_physical_work_capacity_df = helps.G2R(
+            grid_annual_value = annualized_physical_work_capacity_df, 
+            SECTOR = "SUNF_R", 
+            rast_boundary = reg_WB_raster
+        )
+
+        return basin_physical_work_capacity_df
+
+
+    if __name__ == "__main__":
+
+        hurs_file_name = "path/to/hurs_file"
+        tas_file_name = "path/to/tas_file"
+        rsds_file_name = "path/to/rsds_file"
+
+        # all the functions need to be ran for all the target years.
+        # in this case, target years would be 2015 - 2100 in 5 year increments
+
+        target_years = list(range(2015, 2105, 5))
+
+        num_years = len(target_years)
+
+        ## Creating a SlurmCluster object with the required parameters
+
+        cluster = SlurmCluster(queue='short', walltime='02:00:00', account='GCIMS', silence_logs=False)
+
+        ## Adding the helps container specifications (can be changed as needed)        
+        cluster.add_container(tag="helps", cpus=4, memory="8G", dirs={"/path1":"/path1", "/path2":"/path2"})
+
+        ## Adding workers to the cluster
+        cluster.add_workers(n=num_years, tag="helps")
+
+        # Making a client to submit jobs
+        sc_client = ScalableClient(cluster)
+
+
+        # run helps
+
+        # note that the map function is used here as multiple instances of the 
+        # same function is being ran with different inputs. This is essentially
+        # parallelization. To make it possible, pass in multiple lists of 
+        # arguments to the map function. If the target function has 2 arguments 
+        # then 2 lists of the same size should be passed. The size of the list 
+        # will be the same as the number of instances of the target function 
+        # that will be ran.
+
+        # n = 1 is used as the value for n because n specifies the number of 
+        # workers needed for a single instance of the target function.
+        heatstress_futures = sc_client.map(run_heatstress, [hurs_file_name]*num_years, [tas_file_name]*num_years, 
+                                           [rsds_file_name]*num_years, target_years, n=1, tag="helps")
+
+        pwc_futures = sc_client.map(run_physical_work_capacity, heatstress_futures, n=1, tag="helps")
+
+        annualized_pwc_futures = sc_client.map(run_annualized_physical_work_capacity, pwc_futures, n=1, tag="helps")
+
+        country_pwc_futures = sc_client.map(run_country_physical_work_capacity, annualized_pwc_futures, n=1, tag="helps")
+
+        basin_pwc_futures = sc_client.map(run_basin_physical_work_capacity, annualized_pwc_futures, n=1, tag="helps")
+
+        # now the results can be gathered and then printed or written to a file
+
+        heatstress_results = sc_client.gather(heatstress_futures)
+        pwc_results = sc_client.gather(pwc_futures)
+        annualized_pwc_results = sc_client.gather(annualized_pwc_futures)
+        country_pwc_results = sc_client.gather(country_pwc_futures)
+        basin_pwc_results = sc_client.gather(basin_pwc_futures)
+
+
+This code will run the HELPS package on the remote HPC system. The entire guide 
+highlights the process of adding a new container with a new R package and 
+scaling it using Scalable. Please feel free to reach out for any more help 
+regarding the same or open an issue on the Scalable GitHub repository. 
+
+
diff --git a/docs/index.rst b/docs/index.rst
@@ -44,11 +44,13 @@ Contents:
 
    cache_hash
    container
+   rpy2
 
 .. toctree::
-   :caption: Demo
+   :caption: Demos
 
    demo
+   helps_demo
 
 .. toctree::
    :caption: Common Issues

diff --git a/docs/rpy2.rst b/docs/rpy2.rst
@@ -0,0 +1,95 @@
+How to convert R code to Python code using rpy2
+===============================================
+
+While using scalable, there may be multiple instances where a model which needs 
+to be scaled is written in R or can only be interacted with in R. While it is 
+possible to scale R code by writing it all in an Rscript and calling the script 
+from Python, it is generally recommended to the R code to Python code using 
+rpy2. This is to allow for better integration of such code with Python and make 
+it much easier to modify inputs, get precise outputs, and debug the code.
+
+The full documentation for rpy2 can be found on this
+`website <https://rpy2.github.io/doc.html>`_. In this guide, only the important 
+methods for converting R code to Python code will be discussed which is all 
+that is needed in most of the cases. 
+
+Let's look at an example with R code to convert. The 
+`HELPS <https://github.com/JGCRI/HELPS>`_ package will be used to demonstrate 
+the conversion. The R code is as follows:
+
+
+.. code-block:: r
+
+    library(HELPS)
+
+    # test
+    esi.mon <- HeatStress(TempRes = "month", SECTOR = "SUNF_R", HS = WBGT_ESI, YEAR_INPUT = 2024,
+                        "path/hurs_mon_basd_CanESM5_W5E5v2_GCAM_ref_2015.nc",
+                        "path/tas_mon_basd_CanESM5_W5E5v2_GCAM_ref_2015-2100.nc",
+                        "path/rsds_mon_basd_CanESM5_W5E5v2_GCAM_ref_2015.nc")
+
+
+    pwc.mon.hothaps <- PWC(WBGT = esi.mon,  LHR = LHR_Hothaps, workload = "high")
+    pwc.hothaps.ann <- MON2ANN(input_rack = pwc.mon.hothaps, SECTOR = "SUNF_R")
+    ctry_pwc <- G2R(grid_annual_value = pwc.hothaps.ann, SECTOR = "SUNF_R", rast_boundary = country_raster)
+    glu_pwc <- G2R(grid_annual_value = pwc.hothaps.ann, SECTOR = "SUNF_R", rast_boundary = reg_WB_raster)
+
+
+A simple conversion of the above code to Python code using rpy2 is as follows:
+
+.. code-block:: python
+
+    import rpy2.robjects.packages as rpackages
+    import rpy2.robjects as robjects
+
+    helps = rpackages.importr('HELPS')
+    
+    esi_mon = helps.HeatStress(TempRes = "month", SECTOR = "SUNF_R", HS = helps.WBGT_ESI, YEAR_INPUT = 2024, 
+                                a="hurs_mon_basd_CanESM5_W5E5v2_GCAM_ref_2015-2100.nc",
+                                b="tas_mon_basd_CanESM5_W5E5v2_GCAM_ref_2015-2100.nc", 
+                                c="rsds_mon_basd_CanESM5_W5E5v2_GCAM_ref_2015-2100.nc")
+
+    pwc_mon_hothaps = helps.PWC(WBGT = esi_mon,  LHR = helps.LHR_Hothaps, workload = "high")
+    pwc_hothaps_ann = helps.MON2ANN(input_rack = pwc_mon_hothaps, SECTOR = "SUNF_R")
+    
+    country_raster = robjects.r['country_raster']
+    ctry_pwc = helps.G2R(grid_annual_value = pwc_hothaps_ann, SECTOR = "SUNF_R", rast_boundary = country_raster)
+    
+    reg_WB_raster = robjects.r['reg_WB_raster']
+    glu_pwc = helps.G2R(grid_annual_value = pwc_hothaps_ann, SECTOR = "SUNF_R", rast_boundary = reg_WB_raster)
+
+
+The above code is a simple conversion of the R code to Python code using rpy2. 
+A couple of important things to note are:
+
+*   The HELPS package is imported at ``helps = rpackages.importr('HELPS')``. 
+    This is similar to importing a package in R using ``library(HELPS)``. 
+    Unlike R, all the functions in a package are not automatically imported. 
+    Instead an object is created whose attributes corresponds to the methods in 
+    that package. 
+
+*   If there are functions or values in the R code which are either not defined 
+    in any particular package, are defined in the global environment, or it's 
+    simply unknown where exactly they may reside then the ``robjects.r`` object
+    can be used to access them. For example, in the above code, the 
+    ``country_raster`` and ``reg_WB_raster`` functions are not defined in the 
+    HELPS package, and without any knowledge where they may reside, the 
+    ``robjects.r`` object can be used to access by just looking up the name of 
+    the function or value in the R code. Another example is if ``pi`` is used 
+    in the R code, then it can be accessed in Python using 
+    ``pi = robjects.r['pi']``.
+
+*   A more subtle point to note is that the ``HeatStress`` function in R takes 
+    three positional arguments at the end. These are matched to the ``...`` 
+    argument in the R code. However, no positional arguments should be passed 
+    to the rpy2 function in python to avoid confusion. To make sure that the 
+    extra arguments get matched with ``...`` internally, arbitrary names can 
+    be used for the arguments. In the above code, such arguments are named 
+    ``a``, ``b``, and ``c``.
+
+These are the only two important things to note when converting R code to 
+Python code using rpy2 in most cases. The rest of the code can be converted 
+directly by just replacing the R syntax with Python syntax. If any issues arise 
+or any help is needed converting a specific piece of code, please feel free 
+to open an issue on the scalable github repo 
+`here <https://github.com/JGCRI/scalable/issues>`_.