Skip to content

Commit

Permalink
Merge pull request #12 from JGCRI/develop
Browse files Browse the repository at this point in the history
Releasing v0.6.2
  • Loading branch information
sash19 authored Dec 24, 2024
2 parents 70f27d9 + c9c2012 commit e0aa6a6
Show file tree
Hide file tree
Showing 5 changed files with 73 additions and 5 deletions.
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
# Scalable
[v0.6.1](https://github.com/JGCRI/scalable/tree/0.6.1)
[v0.6.2](https://github.com/JGCRI/scalable/tree/0.6.2)

Scalable is a Python library which aids in running complex workflows on HPCs by orchestrating multiple containers, requesting appropriate HPC jobs to the scheduler, and providing a python environment for distributed computing. It's designed to be primarily used with JGCRI Climate Models but can be easily adapted for any arbitrary uses.

## Documentation

The documentation for Scalable is hosted on [readthedocs](https://scalable.readthedocs.io).

## Installation

Use the package manager [pip](https://pip.pypa.io/en/stable/) to install scalable.
Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
project = 'Scalable'
copyright = '2024, Joint Global Change Research Institute'
author = 'Shashank Lamba, Pralit Patel'
release = '0.6.0'
release = '0.6.2'

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
Expand Down
Binary file added docs/images/error1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 5 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,22 +33,24 @@ Contents:
---------

.. toctree::
:maxdepth: 1
:caption: API

workers
caching
functions

.. toctree::
:maxdepth: 1
:caption: How-tos

cache_hash
container

.. toctree::
:maxdepth: 1
:caption: Demo

demo

.. toctree::
:caption: Common Issues

issues
62 changes: 62 additions & 0 deletions docs/issues.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
Common Issues
=============

Pickling Error
--------------

This page outlines some of the common problems and caveats still present in
the current version of scalable. While some of them are being worked on, others
may be inherent to dask.

To start with, let's look at something which we is already used in the
:doc:`demo`:

.. code-block:: python
@cacheable
def run_stitches(recipe, output_path):
import stitches
import dask
## The dask config is set to synchronous to avoid any issues.
with dask.config.set(scheduler="synchronous"):
outputs = stitches.gridded_stitching(output_path, recipe)
return outputs
The above code is a simple function that runs
`stitches <https://github.com/JGCRI/stitches>`_. The primary code line which
runs sitches is ran under the dask.config.set context manager. The scheduler is
set to synchronous in this case. The alternative would've been to write this
function as:

.. code-block:: python
@cacheable
def run_stitches(recipe, output_path):
import stitches
outputs = stitches.gridded_stitching(output_path, recipe)
return outputs
The above code should've worked well. However, the following error is thrown
when the function is called with a dask client (scalable):

.. image:: images/error1.png
:align: center

The error thrown above is a pickling error. This happens because dask tries to
use multiple different workers to make the dask task graph. However, since our
workers have different environments, the `run_stitches` task cannot be pickled
by other workers. Therefore, whenever this issue is encountered, it is
recommended to set the scheduler to be "synchronous" which means that it will
pickle the task and run it on the same specified worker.

General Errors
--------------

There can also be just general errors which are either thrown by dask or are
manifested in the form of workers which didn't connect or slurm errors. There
are mechanisms within Scalable which should warn about any workers which
couldn't connect for whatever reason. However, as a rule of thumb, restarting
the cluster and the workflow is the best way to resolve any one time errors.
HPC systems can be unreliable and throw unknown errors sometimes. As always,
please feel free to open an issue
`here <https://github.com/JGCRI/scalable/issues>`_ for any persistent issues.

0 comments on commit e0aa6a6

Please sign in to comment.