Update to GPflow v2 and include safety extensions (#45)

* No special subsequent calls to scipy optimiser * Added safety specific code in safe_pilco folder * Added master_alg, experiments and post_process for more systematic evaluation (still under development) * 4 first experiments run successflly * Linear Cars done too * Syncing utils * Cleaned main repo * Re-enabled rendering in examples, deleted a few unecessary things * Running safe_cars when run direclty, deleted bash script * Updated requirements/minor fixes * Updated mgpr to 2.0 * Updated smgpr, test_sparse_predictions passes * Test cascade passes, updated pilco, rewards and controllers * Fixed test_rewards * Fixed controllers and test_controllers * Fixed rbf controller bug. Mountain car and inverted pendulum run successfully with occasional numerical errors * Got rid of the autoflow wrapper functions which are not useful anymore * Updated inverted pendulum and pendulum swingup examples. Fixed noise variance for RBF controllers. Modified mgpr priors * Updated swimmer and double pendulum. Fix in rbf controllers. * Minor fixes, ready to test all plain pilco envs * Small updates to improve numerical stability and identify operations that could cause falures * Updated safe pilco - linear cars seem to work fine now * Cleaned up unecessary comments and logging * Updated requirements * Adding matplotlib dependecy apparently required by Tensorflow. see travis output on this branch before this commit. * Remove comments * Various fixes * Change coverage settings * Attempt to cleaup swimmer * Remove safe pilco include * Remove self.t from LinearController * Move examples, utils to examples folder. * Remove utils import * Updated imports in safe pilco examples. renamed safe-pilco-extension to safe_pilco_extension * read.me update for the new version * Update READMEs * Update README Co-authored-by: kyr-pol <[email protected]>
nrontsis · Apr 30, 2020 · 3d96dcd · 3d96dcd
1 parent 7125590
commit 3d96dcd
Show file tree

Hide file tree

Showing 29 changed files with 994 additions and 501 deletions.
diff --git a/.coveragerc b/.coveragerc
@@ -1,5 +1,5 @@
 [report]
-omit = *tests*, *examples*, setup.py
+omit = *tests*, *examples*, setup.py, *safe-pilco-extension*
 exclude_lines =
     pragma: no cover
     raise AssertionError

diff --git a/README.md b/README.md
@@ -2,39 +2,37 @@
 [![Build Status](https://travis-ci.org/nrontsis/PILCO.svg?branch=master)](https://travis-ci.org/nrontsis/PILCO)
 [![codecov](https://codecov.io/gh/nrontsis/PILCO/branch/master/graph/badge.svg)](https://codecov.io/gh/nrontsis/PILCO)
 
-A modern \& clean implementation of the [PILCO](https://ieeexplore.ieee.org/abstract/document/6654139/) Algorithm in `TensorFlow`.
+A modern \& clean implementation of the [PILCO](https://ieeexplore.ieee.org/abstract/document/6654139/) Algorithm in `TensorFlow v2`.
 
 Unlike PILCO's [original implementation](http://mlg.eng.cam.ac.uk/pilco/) which was written as a self-contained package of `MATLAB`, this repository aims to provide a clean implementation by heavy use of modern machine learning libraries.
 
-In particular, we use `TensorFlow` to avoid the need for hardcoded gradients and scale to GPU architectures. Moreover, we use [`GPflow`](https://github.com/GPflow/GPflow) for Gaussian Process Regression.
+In particular, we use `TensorFlow v2` to avoid the need for hardcoded gradients and scale to GPU architectures. Moreover, we use [`GPflow v2`](https://github.com/GPflow/GPflow) for Gaussian Process Regression.
 
 The core functionality is tested against the original `MATLAB` implementation.
 
 ## Example of usage
-Before using, or installing, PILCO, you need to have `Tensorflow 1.13.1` installed (either the gpu or the cpu version). It is recommended to install everything in a fresh `conda` environment with `python>=3.7`. Given `Tensorflow`, PILCO can be installed as follows
+Before using `PILCO` you have to install it by running:
 ```
 git clone https://github.com/nrontsis/PILCO && cd PILCO
 python setup.py develop
 ```
+It is recommended to install everything in a fresh conda environment with `python>=3.7`
 
-The examples included in this repo use [`OpenAI gym 0.15.3`](https://github.com/openai/gym#installation) and [`mujoco-py 2.0.2.7`](https://github.com/openai/mujoco-py#install-mujoco). Once these dependencies are installed, you can run one of the examples as follows
+The examples included in this repo use [`OpenAI gym 0.15.3`](https://github.com/openai/gym#installation) and [`mujoco-py 2.0.2.7`](https://github.com/openai/mujoco-py#install-mujoco). Theses dependecies should be installed manually. Then, you can run one of the examples as follows
 ```
 python examples/inverted_pendulum.py
 ```
-While running an example, `Tensorflow` might print a lot of warnings, [some of which are deprecated](https://github.com/tensorflow/tensorflow/issues/25996). If necessary, you can suppress them by running
-```python
-tf.logging.set_verbosity(tf.logging.ERROR)
-```
-right after including `TensorFlow` in Python.
+
+## Example Extension: Safe PILCO
+As an example of the extensibility of the framework, we include in the folder `safe_pilco_extension` an extension of the standard PILCO algorithm that takes safety constraints (defined on the environment's state space) into account as in [https://arxiv.org/abs/1712.05556](https://arxiv.org/pdf/1712.05556.pdf). The `safe_swimmer_run.py` and `safe_cars_run.py` in the `examples` folder demonstrate the use of this extension.
 
 ## Credits:
 
 The following people have been involved in the development of this package:
 * [Nikitas Rontsis](https://github.com/nrontsis)
-* [Kyriakos Polymenakos](https://github.com/kyr-pol)
+* [Kyriakos Polymenakos](https://github.com/kyr-pol/)
 
 ## References
 
-See the following publications for a description of the algorithm: [1](https://ieeexplore.ieee.org/abstract/document/6654139/), [2](http://mlg.eng.cam.ac.uk/pub/pdf/DeiRas11.pdf),
-[3](https://pdfs.semanticscholar.org/c9f2/1b84149991f4d547b3f0f625f710750ad8d9.pdf)
-
+See the following publications for a description of the algorithm: [1](https://ieeexplore.ieee.org/abstract/document/6654139/), [2](http://mlg.eng.cam.ac.uk/pub/pdf/DeiRas11.pdf), 
+[3](https://pdfs.semanticscholar.org/c9f2/1b84149991f4d547b3f0f625f710750ad8d9.pdf)
diff --git a/examples/inv_double_pendulum.py b/examples/inv_double_pendulum.py
@@ -5,8 +5,8 @@
 from pilco.controllers import RbfController, LinearController
 from pilco.rewards import ExponentialReward
 import tensorflow as tf
-from tensorflow import logging
 from utils import rollout, policy
+from gpflow import set_trainable
 np.random.seed(0)
 
 # Introduces a simple wrapper for the gym environment
@@ -29,7 +29,7 @@ def state_trans(self, s):
 
     def step(self, action):
         ob, r, done, _ = self.env.step(action)
-        if np.abs(ob[0])> 0.98 or np.abs(ob[-3]) > 0.1 or  np.abs(ob[-2]) > 0.1 or np.abs(ob[-1]) > 0.1:
+        if np.abs(ob[0])> 0.90 or np.abs(ob[-3]) > 0.15 or  np.abs(ob[-2]) > 0.15 or np.abs(ob[-1]) > 0.15:
             done = True
         return self.state_trans(ob), r, done, {}
 
@@ -41,32 +41,32 @@ def render(self):
         self.env.render()
 
 
-SUBS = 1
-bf = 40
-maxiter=80
-state_dim = 6
-control_dim = 1
-max_action=1.0 # actions for these environments are discrete
-target = np.zeros(state_dim)
-weights = 3.0 * np.eye(state_dim)
-weights[0,0] = 0.5
-weights[3,3] = 0.5
-m_init = np.zeros(state_dim)[None, :]
-S_init = 0.01 * np.eye(state_dim)
-T = 40
-J = 1
-N = 12
-T_sim = 130
-restarts=True
-lens = []
-
-with tf.Session() as sess:
+if __name__=='__main__':
+    SUBS = 1
+    bf = 40
+    maxiter=10
+    state_dim = 6
+    control_dim = 1
+    max_action=1.0 # actions for these environments are discrete
+    target = np.zeros(state_dim)
+    weights = 5.0 * np.eye(state_dim)
+    weights[0,0] = 1.0
+    weights[3,3] = 1.0
+    m_init = np.zeros(state_dim)[None, :]
+    S_init = 0.005 * np.eye(state_dim)
+    T = 40
+    J = 5
+    N = 12
+    T_sim = 130
+    restarts=True
+    lens = []
+
     env = DoublePendWrapper()
 
     # Initial random rollouts to generate a dataset
-    X,Y = rollout(env, None, timesteps=T, random=True, SUBS=SUBS)
+    X, Y, _, _ = rollout(env, None, timesteps=T, random=True, SUBS=SUBS, render=True)
     for i in range(1,J):
-        X_, Y_ = rollout(env, None, timesteps=T, random=True, SUBS=SUBS, verbose=True)
+        X_, Y_, _, _ = rollout(env, None, timesteps=T, random=True, SUBS=SUBS, verbose=True, render=True)
         X = np.vstack((X, X_))
         Y = np.vstack((Y, Y_))
 
@@ -77,21 +77,19 @@ def render(self):
 
     R = ExponentialReward(state_dim=state_dim, t=target, W=weights)
 
-    pilco = PILCO(X, Y, controller=controller, horizon=T, reward=R, m_init=m_init, S_init=S_init)
+    pilco = PILCO((X, Y), controller=controller, horizon=T, reward=R, m_init=m_init, S_init=S_init)
 
     # for numerical stability
     for model in pilco.mgpr.models:
-        # model.kern.lengthscales.prior = gpflow.priors.Gamma(1,10) priors have to be included before
-        # model.kern.variance.prior = gpflow.priors.Gamma(1.5,2)    before the model gets compiled
-        model.likelihood.variance = 0.001
-        model.likelihood.variance.trainable = False
+        model.likelihood.variance.assign(0.001)
+        set_trainable(model.likelihood.variance, False)
 
     for rollouts in range(N):
         print("**** ITERATION no", rollouts, " ****")
         pilco.optimize_models(maxiter=maxiter, restarts=2)
         pilco.optimize_policy(maxiter=maxiter, restarts=2)
 
-        X_new, Y_new = rollout(env, pilco, timesteps=T_sim, verbose=True, SUBS=SUBS)
+        X_new, Y_new, _, _ = rollout(env, pilco, timesteps=T_sim, verbose=True, SUBS=SUBS, render=True)
 
         # Since we had decide on the various parameters of the reward function
         # we might want to verify that it behaves as expected by inspection
@@ -102,7 +100,7 @@ def render(self):
 
         # Update dataset
         X = np.vstack((X, X_new[:T, :])); Y = np.vstack((Y, Y_new[:T, :]))
-        pilco.mgpr.set_XY(X, Y)
+        pilco.mgpr.set_data((X, Y))
 
         lens.append(len(X_new))
         print(len(X_new))

diff --git a/examples/inverted_pendulum.py b/examples/inverted_pendulum.py
@@ -4,41 +4,36 @@
 from pilco.controllers import RbfController, LinearController
 from pilco.rewards import ExponentialReward
 import tensorflow as tf
-from tensorflow import logging
+from gpflow import set_trainable
+# from tensorflow import logging
 np.random.seed(0)
 
 from utils import rollout, policy
 
-with tf.Session(graph=tf.Graph()) as sess:
-    env = gym.make('InvertedPendulum-v2')
-    # Initial random rollouts to generate a dataset
-    X,Y = rollout(env=env, pilco=None, random=True, timesteps=40)
-    for i in range(1,3):
-        X_, Y_ = rollout(env=env, pilco=None, random=True,  timesteps=40)
-        X = np.vstack((X, X_))
-        Y = np.vstack((Y, Y_))
+env = gym.make('InvertedPendulum-v2')
+# Initial random rollouts to generate a dataset
+X,Y, _, _ = rollout(env=env, pilco=None, random=True, timesteps=40, render=True)
+for i in range(1,5):
+    X_, Y_, _, _ = rollout(env=env, pilco=None, random=True,  timesteps=40, render=True)
+    X = np.vstack((X, X_))
+    Y = np.vstack((Y, Y_))
 
 
-    state_dim = Y.shape[1]
-    control_dim = X.shape[1] - state_dim
-    controller = RbfController(state_dim=state_dim, control_dim=control_dim, num_basis_functions=5)
-    #controller = LinearController(state_dim=state_dim, control_dim=control_dim)
+state_dim = Y.shape[1]
+control_dim = X.shape[1] - state_dim
+controller = RbfController(state_dim=state_dim, control_dim=control_dim, num_basis_functions=10)
+# controller = LinearController(state_dim=state_dim, control_dim=control_dim)
 
-    pilco = PILCO(X, Y, controller=controller, horizon=40)
-    # Example of user provided reward function, setting a custom target state
-    # R = ExponentialReward(state_dim=state_dim, t=np.array([0.1,0,0,0]))
-    # pilco = PILCO(X, Y, controller=controller, horizon=40, reward=R)
+pilco = PILCO((X, Y), controller=controller, horizon=40)
+# Example of user provided reward function, setting a custom target state
+# R = ExponentialReward(state_dim=state_dim, t=np.array([0.1,0,0,0]))
+# pilco = PILCO(X, Y, controller=controller, horizon=40, reward=R)
 
-    # Example of fixing a parameter, optional, for a linear controller only
-    #pilco.controller.b = np.array([[0.0]])
-    #pilco.controller.b.trainable = False
-
-    for rollouts in range(3):
-        pilco.optimize_models()
-        pilco.optimize_policy()
-        import pdb; pdb.set_trace()
-        X_new, Y_new = rollout(env=env, pilco=pilco, timesteps=100)
-        print("No of ops:", len(tf.get_default_graph().get_operations()))
-        # Update dataset
-        X = np.vstack((X, X_new)); Y = np.vstack((Y, Y_new))
-        pilco.mgpr.set_XY(X, Y)
+for rollouts in range(3):
+    pilco.optimize_models()
+    pilco.optimize_policy()
+    import pdb; pdb.set_trace()
+    X_new, Y_new, _, _ = rollout(env=env, pilco=pilco, timesteps=100, render=True)
+    # Update dataset
+    X = np.vstack((X, X_new)); Y = np.vstack((Y, Y_new))
+    pilco.mgpr.set_data((X, Y))
diff --git a/examples/linear_cars_env.py b/examples/linear_cars_env.py
@@ -0,0 +1,34 @@
+import numpy as np
+from gym import spaces
+from gym.core import Env
+
+class LinearCars(Env):
+    def __init__(self):
+        self.action_space = spaces.Box(low=-0.4, high=0.4, shape=(1,))
+        self.observation_space = spaces.Box(low=-100, high=100, shape=(4,))
+        self.M = 1 # car mass [kg]
+        self.b = 0.001    # friction coef [N/m/s]
+        self.Dt = 0.50 # timestep [s]
+
+        self.A = np.array([[0, self.Dt, 0, 0],
+                          [0, -self.b*self.Dt/self.M, 0, 0],
+                          [0, 0, 0, self.Dt],
+                          [0, 0, 0, 0]])
+
+        self.B = np.array([0,self.Dt/self.M, 0, 0]).reshape((4,1))
+
+        self.initial_state = np.array([-6.0, 1.0, -5.0, 1.0]).reshape((4,1))
+
+    def step(self, action):
+        self.state += self.A @ self.state + self.B * action
+                      #0.1 * np.random.normal(scale=[[1e-3], [1e-3], [1e-3], [0.001]], size=(4,1))
+
+        if self.state[0] < 0:
+            reward = -1
+        else:
+            reward = 1
+        return np.reshape(self.state[:], (4,)), reward, False, None
+
+    def reset(self):
+        self.state = self.initial_state + 0.03 * np.random.normal(size=(4,1))
+        return np.reshape(self.state[:], (4,))
diff --git a/examples/mountain_car.py b/examples/mountain_car.py
@@ -0,0 +1,69 @@
+import numpy as np
+import gym
+from pilco.models import PILCO
+from pilco.controllers import RbfController, LinearController
+from pilco.rewards import ExponentialReward
+import tensorflow as tf
+np.random.seed(0)
+from utils import policy, rollout, Normalised_Env
+
+
+SUBS = 5
+T = 25
+env = gym.make('MountainCarContinuous-v0')
+# Initial random rollouts to generate a dataset
+X1,Y1, _, _ = rollout(env=env, pilco=None, random=True, timesteps=T, SUBS=SUBS, render=True)
+for i in range(1,5):
+    X1_, Y1_,_,_ = rollout(env=env, pilco=None, random=True,  timesteps=T, SUBS=SUBS, render=True)
+    X1 = np.vstack((X1, X1_))
+    Y1 = np.vstack((Y1, Y1_))
+env.close()
+
+env = Normalised_Env('MountainCarContinuous-v0', np.mean(X1[:,:2],0), np.std(X1[:,:2], 0))
+X = np.zeros(X1.shape)
+X[:, :2] = np.divide(X1[:, :2] - np.mean(X1[:,:2],0), np.std(X1[:,:2], 0))
+X[:, 2] = X1[:,-1] # control inputs are not normalised
+Y = np.divide(Y1 , np.std(X1[:,:2], 0))
+
+state_dim = Y.shape[1]
+control_dim = X.shape[1] - state_dim
+m_init =  np.transpose(X[0,:-1,None])
+S_init =  0.5 * np.eye(state_dim)
+controller = RbfController(state_dim=state_dim, control_dim=control_dim, num_basis_functions=25)
+
+R = ExponentialReward(state_dim=state_dim,
+                      t=np.divide([0.5,0.0] - env.m, env.std),
+                      W=np.diag([0.5,0.1])
+                     )
+pilco = PILCO((X, Y), controller=controller, horizon=T, reward=R, m_init=m_init, S_init=S_init)
+
+best_r = 0
+all_Rs = np.zeros((X.shape[0], 1))
+for i in range(len(all_Rs)):
+    all_Rs[i,0] = R.compute_reward(X[i,None,:-1], 0.001 * np.eye(state_dim))[0]
+
+ep_rewards = np.zeros((len(X)//T,1))
+
+for i in range(len(ep_rewards)):
+    ep_rewards[i] = sum(all_Rs[i * T: i*T + T])
+
+for model in pilco.mgpr.models:
+    model.likelihood.variance.assign(0.05)
+    set_trainable(model.likelihood.variance, False)
+
+r_new = np.zeros((T, 1))
+for rollouts in range(5):
+    pilco.optimize_models()
+    pilco.optimize_policy(maxiter=100, restarts=3)
+    import pdb; pdb.set_trace()
+    X_new, Y_new,_,_ = rollout(env=env, pilco=pilco, timesteps=T, SUBS=SUBS, render=True)
+
+    for i in range(len(X_new)):
+            r_new[:, 0] = R.compute_reward(X_new[i,None,:-1], 0.001 * np.eye(state_dim))[0]
+    total_r = sum(r_new)
+    _, _, r = pilco.predict(m_init, S_init, T)
+
+    print("Total ", total_r, " Predicted: ", r)
+    X = np.vstack((X, X_new)); Y = np.vstack((Y, Y_new));
+    all_Rs = np.vstack((all_Rs, r_new)); ep_rewards = np.vstack((ep_rewards, np.reshape(total_r,(1,1))))
+    pilco.mgpr.set_data((X, Y))