We use reinforcement learning techniques to optimize continuum controls over chemical systems modelled via ODE systems. The controls are free parameters of the ODE system and the reward is a variable of the system after some time of evolution of the system, which represents a by-product of the chemical reaction.
ODE systems represent an approximation of real-word chemical systems. Here we study an approach to optimize the controls of a system over a simple ODE model to later ease the study of optimal controls over a more complex yet closely related ODE model.
The objective is to find the optimal controls
The initial state is fixed:
Parameters:
Parameters:
We use the classic REINFORCE policy gradient algorithm with a baseline, and the recent Proximal Policy Optimization (PPO) variation, to optimize the controls via stochastic sampling of trajectories. We use the Beta distribution to sample continuum controls to enforce interval constraints over the controls.
- The evolution of the model between subsequent states corresponds to the integration of the ODE system over a fixed fraction of time.
- Each state is composed of the current ODE variables.
- Policies take states and return the mean and variance of a probability distribution over the possible continuous actions available at each time.
- An affine Beta distribution over the acceptable interval
$U \in [0,5]$ .
- An affine Beta distribution over the acceptable interval
- Not too deep Neural Networks serve as policies.
- Simple Neural Networks include in the states the time left until
$t=1$ . - Recurrent Neural Networks take previous hidden states instead.
- Simple Neural Networks include in the states the time left until
- Several sampled episodes are used to estimate the policy gradient.
- REINFORCE algorithm with mean reward baseline.
- PPO with deviation from mean reward as advantage function.
- Starting policy is pretrained to follow a random Chebyshev polynomial on the mean with fixed variance.
A simple transfer learning technique is explored to leverage the inner weights of the policy learned in the simpler model to ease the training needed for the related but more complex model.
A conda environment file is provided for easy reproduction of results:
conda env create -f environment.yml
conda activate batch-reactor
Run python src/main.py
to execute the whole logic (use --help
for a list of available parameters):
- Pretrain policy to yield predefined function forms that satisfy the constraints.
- Train policy for large iterations over simpler model.
- Freeze inner weights of policies and retrain last layers with complex model with fewer iterations.
Relevant data and plots are stored in results/_execution_datetime_/...
. Hyperparameters of the run are stored in a yaml config file inside.
Main parameters (more available in command line interface from main.py
) are:
- method: 'ppo' or 'reinforce'
- episode-batch: number of sample episodes run to estimate loss function
- chained-steps: gradient descent steps taken after episode sampling (usually 1 for REINFORCE and ~3 for PPO)
- iterations: repetitions of sampling and optimize step
Kudos: gifs made with the blazing-fast gifski and transformed via ImageOptim API!