Skip to content

Commit

Permalink
Update to GPflow v2 and include safety extensions (#45)
Browse files Browse the repository at this point in the history
* No special subsequent calls to scipy optimiser

* Added safety specific code in safe_pilco folder

* Added master_alg, experiments and post_process for more systematic evaluation (still under development)

* 4 first experiments run successflly

* Linear Cars done too

* Syncing utils

* Cleaned main repo

* Re-enabled rendering in examples, deleted a few unecessary things

* Running safe_cars when run direclty, deleted bash script

* Updated requirements/minor fixes

* Updated mgpr to 2.0

* Updated smgpr, test_sparse_predictions passes

* Test cascade passes, updated pilco, rewards and controllers

* Fixed test_rewards

* Fixed controllers and test_controllers

* Fixed rbf controller bug. Mountain car and inverted pendulum run successfully with occasional numerical errors

* Got rid of the autoflow wrapper functions which are not useful anymore

* Updated inverted pendulum and pendulum swingup examples. Fixed noise variance for RBF controllers. Modified mgpr priors

* Updated swimmer and double pendulum. Fix in rbf controllers.

* Minor fixes, ready to test all plain pilco envs

* Small updates to improve numerical stability and identify operations that could cause falures

* Updated safe pilco - linear cars seem to work fine now

* Cleaned up unecessary comments and logging

* Updated requirements

* Adding matplotlib dependecy

apparently required by Tensorflow. see travis output on this branch before this commit.

* Remove comments

* Various fixes

* Change coverage settings

* Attempt to cleaup swimmer

* Remove safe pilco include

* Remove self.t from LinearController

* Move examples, utils to examples folder.

* Remove utils import

* Updated imports in safe pilco examples. renamed safe-pilco-extension to safe_pilco_extension

* read.me update for the new version

* Update READMEs

* Update README

Co-authored-by: kyr-pol <[email protected]>
  • Loading branch information
nrontsis and kyr-pol authored Apr 30, 2020
1 parent 7125590 commit 3d96dcd
Show file tree
Hide file tree
Showing 29 changed files with 994 additions and 501 deletions.
2 changes: 1 addition & 1 deletion .coveragerc
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[report]
omit = *tests*, *examples*, setup.py
omit = *tests*, *examples*, setup.py, *safe-pilco-extension*
exclude_lines =
pragma: no cover
raise AssertionError
Expand Down
24 changes: 11 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,39 +2,37 @@
[![Build Status](https://travis-ci.org/nrontsis/PILCO.svg?branch=master)](https://travis-ci.org/nrontsis/PILCO)
[![codecov](https://codecov.io/gh/nrontsis/PILCO/branch/master/graph/badge.svg)](https://codecov.io/gh/nrontsis/PILCO)

A modern \& clean implementation of the [PILCO](https://ieeexplore.ieee.org/abstract/document/6654139/) Algorithm in `TensorFlow`.
A modern \& clean implementation of the [PILCO](https://ieeexplore.ieee.org/abstract/document/6654139/) Algorithm in `TensorFlow v2`.

Unlike PILCO's [original implementation](http://mlg.eng.cam.ac.uk/pilco/) which was written as a self-contained package of `MATLAB`, this repository aims to provide a clean implementation by heavy use of modern machine learning libraries.

In particular, we use `TensorFlow` to avoid the need for hardcoded gradients and scale to GPU architectures. Moreover, we use [`GPflow`](https://github.com/GPflow/GPflow) for Gaussian Process Regression.
In particular, we use `TensorFlow v2` to avoid the need for hardcoded gradients and scale to GPU architectures. Moreover, we use [`GPflow v2`](https://github.com/GPflow/GPflow) for Gaussian Process Regression.

The core functionality is tested against the original `MATLAB` implementation.

## Example of usage
Before using, or installing, PILCO, you need to have `Tensorflow 1.13.1` installed (either the gpu or the cpu version). It is recommended to install everything in a fresh `conda` environment with `python>=3.7`. Given `Tensorflow`, PILCO can be installed as follows
Before using `PILCO` you have to install it by running:
```
git clone https://github.com/nrontsis/PILCO && cd PILCO
python setup.py develop
```
It is recommended to install everything in a fresh conda environment with `python>=3.7`

The examples included in this repo use [`OpenAI gym 0.15.3`](https://github.com/openai/gym#installation) and [`mujoco-py 2.0.2.7`](https://github.com/openai/mujoco-py#install-mujoco). Once these dependencies are installed, you can run one of the examples as follows
The examples included in this repo use [`OpenAI gym 0.15.3`](https://github.com/openai/gym#installation) and [`mujoco-py 2.0.2.7`](https://github.com/openai/mujoco-py#install-mujoco). Theses dependecies should be installed manually. Then, you can run one of the examples as follows
```
python examples/inverted_pendulum.py
```
While running an example, `Tensorflow` might print a lot of warnings, [some of which are deprecated](https://github.com/tensorflow/tensorflow/issues/25996). If necessary, you can suppress them by running
```python
tf.logging.set_verbosity(tf.logging.ERROR)
```
right after including `TensorFlow` in Python.

## Example Extension: Safe PILCO
As an example of the extensibility of the framework, we include in the folder `safe_pilco_extension` an extension of the standard PILCO algorithm that takes safety constraints (defined on the environment's state space) into account as in [https://arxiv.org/abs/1712.05556](https://arxiv.org/pdf/1712.05556.pdf). The `safe_swimmer_run.py` and `safe_cars_run.py` in the `examples` folder demonstrate the use of this extension.

## Credits:

The following people have been involved in the development of this package:
* [Nikitas Rontsis](https://github.com/nrontsis)
* [Kyriakos Polymenakos](https://github.com/kyr-pol)
* [Kyriakos Polymenakos](https://github.com/kyr-pol/)

## References

See the following publications for a description of the algorithm: [1](https://ieeexplore.ieee.org/abstract/document/6654139/), [2](http://mlg.eng.cam.ac.uk/pub/pdf/DeiRas11.pdf),
[3](https://pdfs.semanticscholar.org/c9f2/1b84149991f4d547b3f0f625f710750ad8d9.pdf)

See the following publications for a description of the algorithm: [1](https://ieeexplore.ieee.org/abstract/document/6654139/), [2](http://mlg.eng.cam.ac.uk/pub/pdf/DeiRas11.pdf),
[3](https://pdfs.semanticscholar.org/c9f2/1b84149991f4d547b3f0f625f710750ad8d9.pdf)
60 changes: 29 additions & 31 deletions examples/inv_double_pendulum.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@
from pilco.controllers import RbfController, LinearController
from pilco.rewards import ExponentialReward
import tensorflow as tf
from tensorflow import logging
from utils import rollout, policy
from gpflow import set_trainable
np.random.seed(0)

# Introduces a simple wrapper for the gym environment
Expand All @@ -29,7 +29,7 @@ def state_trans(self, s):

def step(self, action):
ob, r, done, _ = self.env.step(action)
if np.abs(ob[0])> 0.98 or np.abs(ob[-3]) > 0.1 or np.abs(ob[-2]) > 0.1 or np.abs(ob[-1]) > 0.1:
if np.abs(ob[0])> 0.90 or np.abs(ob[-3]) > 0.15 or np.abs(ob[-2]) > 0.15 or np.abs(ob[-1]) > 0.15:
done = True
return self.state_trans(ob), r, done, {}

Expand All @@ -41,32 +41,32 @@ def render(self):
self.env.render()


SUBS = 1
bf = 40
maxiter=80
state_dim = 6
control_dim = 1
max_action=1.0 # actions for these environments are discrete
target = np.zeros(state_dim)
weights = 3.0 * np.eye(state_dim)
weights[0,0] = 0.5
weights[3,3] = 0.5
m_init = np.zeros(state_dim)[None, :]
S_init = 0.01 * np.eye(state_dim)
T = 40
J = 1
N = 12
T_sim = 130
restarts=True
lens = []

with tf.Session() as sess:
if __name__=='__main__':
SUBS = 1
bf = 40
maxiter=10
state_dim = 6
control_dim = 1
max_action=1.0 # actions for these environments are discrete
target = np.zeros(state_dim)
weights = 5.0 * np.eye(state_dim)
weights[0,0] = 1.0
weights[3,3] = 1.0
m_init = np.zeros(state_dim)[None, :]
S_init = 0.005 * np.eye(state_dim)
T = 40
J = 5
N = 12
T_sim = 130
restarts=True
lens = []

env = DoublePendWrapper()

# Initial random rollouts to generate a dataset
X,Y = rollout(env, None, timesteps=T, random=True, SUBS=SUBS)
X, Y, _, _ = rollout(env, None, timesteps=T, random=True, SUBS=SUBS, render=True)
for i in range(1,J):
X_, Y_ = rollout(env, None, timesteps=T, random=True, SUBS=SUBS, verbose=True)
X_, Y_, _, _ = rollout(env, None, timesteps=T, random=True, SUBS=SUBS, verbose=True, render=True)
X = np.vstack((X, X_))
Y = np.vstack((Y, Y_))

Expand All @@ -77,21 +77,19 @@ def render(self):

R = ExponentialReward(state_dim=state_dim, t=target, W=weights)

pilco = PILCO(X, Y, controller=controller, horizon=T, reward=R, m_init=m_init, S_init=S_init)
pilco = PILCO((X, Y), controller=controller, horizon=T, reward=R, m_init=m_init, S_init=S_init)

# for numerical stability
for model in pilco.mgpr.models:
# model.kern.lengthscales.prior = gpflow.priors.Gamma(1,10) priors have to be included before
# model.kern.variance.prior = gpflow.priors.Gamma(1.5,2) before the model gets compiled
model.likelihood.variance = 0.001
model.likelihood.variance.trainable = False
model.likelihood.variance.assign(0.001)
set_trainable(model.likelihood.variance, False)

for rollouts in range(N):
print("**** ITERATION no", rollouts, " ****")
pilco.optimize_models(maxiter=maxiter, restarts=2)
pilco.optimize_policy(maxiter=maxiter, restarts=2)

X_new, Y_new = rollout(env, pilco, timesteps=T_sim, verbose=True, SUBS=SUBS)
X_new, Y_new, _, _ = rollout(env, pilco, timesteps=T_sim, verbose=True, SUBS=SUBS, render=True)

# Since we had decide on the various parameters of the reward function
# we might want to verify that it behaves as expected by inspection
Expand All @@ -102,7 +100,7 @@ def render(self):

# Update dataset
X = np.vstack((X, X_new[:T, :])); Y = np.vstack((Y, Y_new[:T, :]))
pilco.mgpr.set_XY(X, Y)
pilco.mgpr.set_data((X, Y))

lens.append(len(X_new))
print(len(X_new))
Expand Down
55 changes: 25 additions & 30 deletions examples/inverted_pendulum.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,41 +4,36 @@
from pilco.controllers import RbfController, LinearController
from pilco.rewards import ExponentialReward
import tensorflow as tf
from tensorflow import logging
from gpflow import set_trainable
# from tensorflow import logging
np.random.seed(0)

from utils import rollout, policy

with tf.Session(graph=tf.Graph()) as sess:
env = gym.make('InvertedPendulum-v2')
# Initial random rollouts to generate a dataset
X,Y = rollout(env=env, pilco=None, random=True, timesteps=40)
for i in range(1,3):
X_, Y_ = rollout(env=env, pilco=None, random=True, timesteps=40)
X = np.vstack((X, X_))
Y = np.vstack((Y, Y_))
env = gym.make('InvertedPendulum-v2')
# Initial random rollouts to generate a dataset
X,Y, _, _ = rollout(env=env, pilco=None, random=True, timesteps=40, render=True)
for i in range(1,5):
X_, Y_, _, _ = rollout(env=env, pilco=None, random=True, timesteps=40, render=True)
X = np.vstack((X, X_))
Y = np.vstack((Y, Y_))


state_dim = Y.shape[1]
control_dim = X.shape[1] - state_dim
controller = RbfController(state_dim=state_dim, control_dim=control_dim, num_basis_functions=5)
#controller = LinearController(state_dim=state_dim, control_dim=control_dim)
state_dim = Y.shape[1]
control_dim = X.shape[1] - state_dim
controller = RbfController(state_dim=state_dim, control_dim=control_dim, num_basis_functions=10)
# controller = LinearController(state_dim=state_dim, control_dim=control_dim)

pilco = PILCO(X, Y, controller=controller, horizon=40)
# Example of user provided reward function, setting a custom target state
# R = ExponentialReward(state_dim=state_dim, t=np.array([0.1,0,0,0]))
# pilco = PILCO(X, Y, controller=controller, horizon=40, reward=R)
pilco = PILCO((X, Y), controller=controller, horizon=40)
# Example of user provided reward function, setting a custom target state
# R = ExponentialReward(state_dim=state_dim, t=np.array([0.1,0,0,0]))
# pilco = PILCO(X, Y, controller=controller, horizon=40, reward=R)

# Example of fixing a parameter, optional, for a linear controller only
#pilco.controller.b = np.array([[0.0]])
#pilco.controller.b.trainable = False

for rollouts in range(3):
pilco.optimize_models()
pilco.optimize_policy()
import pdb; pdb.set_trace()
X_new, Y_new = rollout(env=env, pilco=pilco, timesteps=100)
print("No of ops:", len(tf.get_default_graph().get_operations()))
# Update dataset
X = np.vstack((X, X_new)); Y = np.vstack((Y, Y_new))
pilco.mgpr.set_XY(X, Y)
for rollouts in range(3):
pilco.optimize_models()
pilco.optimize_policy()
import pdb; pdb.set_trace()
X_new, Y_new, _, _ = rollout(env=env, pilco=pilco, timesteps=100, render=True)
# Update dataset
X = np.vstack((X, X_new)); Y = np.vstack((Y, Y_new))
pilco.mgpr.set_data((X, Y))
34 changes: 34 additions & 0 deletions examples/linear_cars_env.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
import numpy as np
from gym import spaces
from gym.core import Env

class LinearCars(Env):
def __init__(self):
self.action_space = spaces.Box(low=-0.4, high=0.4, shape=(1,))
self.observation_space = spaces.Box(low=-100, high=100, shape=(4,))
self.M = 1 # car mass [kg]
self.b = 0.001 # friction coef [N/m/s]
self.Dt = 0.50 # timestep [s]

self.A = np.array([[0, self.Dt, 0, 0],
[0, -self.b*self.Dt/self.M, 0, 0],
[0, 0, 0, self.Dt],
[0, 0, 0, 0]])

self.B = np.array([0,self.Dt/self.M, 0, 0]).reshape((4,1))

self.initial_state = np.array([-6.0, 1.0, -5.0, 1.0]).reshape((4,1))

def step(self, action):
self.state += self.A @ self.state + self.B * action
#0.1 * np.random.normal(scale=[[1e-3], [1e-3], [1e-3], [0.001]], size=(4,1))

if self.state[0] < 0:
reward = -1
else:
reward = 1
return np.reshape(self.state[:], (4,)), reward, False, None

def reset(self):
self.state = self.initial_state + 0.03 * np.random.normal(size=(4,1))
return np.reshape(self.state[:], (4,))
69 changes: 69 additions & 0 deletions examples/mountain_car.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import numpy as np
import gym
from pilco.models import PILCO
from pilco.controllers import RbfController, LinearController
from pilco.rewards import ExponentialReward
import tensorflow as tf
np.random.seed(0)
from utils import policy, rollout, Normalised_Env


SUBS = 5
T = 25
env = gym.make('MountainCarContinuous-v0')
# Initial random rollouts to generate a dataset
X1,Y1, _, _ = rollout(env=env, pilco=None, random=True, timesteps=T, SUBS=SUBS, render=True)
for i in range(1,5):
X1_, Y1_,_,_ = rollout(env=env, pilco=None, random=True, timesteps=T, SUBS=SUBS, render=True)
X1 = np.vstack((X1, X1_))
Y1 = np.vstack((Y1, Y1_))
env.close()

env = Normalised_Env('MountainCarContinuous-v0', np.mean(X1[:,:2],0), np.std(X1[:,:2], 0))
X = np.zeros(X1.shape)
X[:, :2] = np.divide(X1[:, :2] - np.mean(X1[:,:2],0), np.std(X1[:,:2], 0))
X[:, 2] = X1[:,-1] # control inputs are not normalised
Y = np.divide(Y1 , np.std(X1[:,:2], 0))

state_dim = Y.shape[1]
control_dim = X.shape[1] - state_dim
m_init = np.transpose(X[0,:-1,None])
S_init = 0.5 * np.eye(state_dim)
controller = RbfController(state_dim=state_dim, control_dim=control_dim, num_basis_functions=25)

R = ExponentialReward(state_dim=state_dim,
t=np.divide([0.5,0.0] - env.m, env.std),
W=np.diag([0.5,0.1])
)
pilco = PILCO((X, Y), controller=controller, horizon=T, reward=R, m_init=m_init, S_init=S_init)

best_r = 0
all_Rs = np.zeros((X.shape[0], 1))
for i in range(len(all_Rs)):
all_Rs[i,0] = R.compute_reward(X[i,None,:-1], 0.001 * np.eye(state_dim))[0]

ep_rewards = np.zeros((len(X)//T,1))

for i in range(len(ep_rewards)):
ep_rewards[i] = sum(all_Rs[i * T: i*T + T])

for model in pilco.mgpr.models:
model.likelihood.variance.assign(0.05)
set_trainable(model.likelihood.variance, False)

r_new = np.zeros((T, 1))
for rollouts in range(5):
pilco.optimize_models()
pilco.optimize_policy(maxiter=100, restarts=3)
import pdb; pdb.set_trace()
X_new, Y_new,_,_ = rollout(env=env, pilco=pilco, timesteps=T, SUBS=SUBS, render=True)

for i in range(len(X_new)):
r_new[:, 0] = R.compute_reward(X_new[i,None,:-1], 0.001 * np.eye(state_dim))[0]
total_r = sum(r_new)
_, _, r = pilco.predict(m_init, S_init, T)

print("Total ", total_r, " Predicted: ", r)
X = np.vstack((X, X_new)); Y = np.vstack((Y, Y_new));
all_Rs = np.vstack((all_Rs, r_new)); ep_rewards = np.vstack((ep_rewards, np.reshape(total_r,(1,1))))
pilco.mgpr.set_data((X, Y))
Loading

0 comments on commit 3d96dcd

Please sign in to comment.