{"title": "Safe Model-based Reinforcement Learning with Stability Guarantees", "book": "Advances in Neural Information Processing Systems", "page_first": 908, "page_last": 918, "abstract": "Reinforcement learning is a powerful paradigm for learning optimal policies from experimental data. However, to find optimal policies, most reinforcement learning algorithms explore all possible actions, which may be harmful for real-world systems. As a consequence, learning algorithms are rarely applied on safety-critical systems in the real world. In this paper, we present a learning algorithm that explicitly considers safety, defined in terms of stability guarantees. Specifically, we extend control-theoretic results on Lyapunov stability verification and show how to use statistical models of the dynamics to obtain high-performance control policies with provable stability certificates. Moreover, under additional regularity assumptions in terms of a Gaussian process prior, we prove that one can effectively and safely collect data in order to learn about the dynamics and thus both improve control performance and expand the safe region of the state space. In our experiments, we show how the resulting algorithm can safely optimize a neural network policy on a simulated inverted pendulum, without the pendulum ever falling down.", "full_text": "Safe Model-based Reinforcement Learning with\n\nStability Guarantees\n\nFelix Berkenkamp\n\nDepartment of Computer Science\n\nETH Zurich\n\nbefelix@inf.ethz.ch\n\nAngela P. Schoellig\n\nInstitute for Aerospace Studies\n\nUniversity of Toronto\n\nschoellig@utias.utoronto.ca\n\nMatteo Turchetta\n\nDepartment of Computer Science,\n\nETH Zurich\n\nmatteotu@inf.ethz.ch\n\nAndreas Krause\n\nDepartment of Computer Science\n\nETH Zurich\n\nkrausea@ethz.ch\n\nAbstract\n\nReinforcement learning is a powerful paradigm for learning optimal policies from\nexperimental data. However, to \ufb01nd optimal policies, most reinforcement learning\nalgorithms explore all possible actions, which may be harmful for real-world sys-\ntems. As a consequence, learning algorithms are rarely applied on safety-critical\nsystems in the real world. In this paper, we present a learning algorithm that\nexplicitly considers safety, de\ufb01ned in terms of stability guarantees. Speci\ufb01cally,\nwe extend control-theoretic results on Lyapunov stability veri\ufb01cation and show\nhow to use statistical models of the dynamics to obtain high-performance control\npolicies with provable stability certi\ufb01cates. Moreover, under additional regularity\nassumptions in terms of a Gaussian process prior, we prove that one can effectively\nand safely collect data in order to learn about the dynamics and thus both improve\ncontrol performance and expand the safe region of the state space. In our experi-\nments, we show how the resulting algorithm can safely optimize a neural network\npolicy on a simulated inverted pendulum, without the pendulum ever falling down.\n\n1\n\nIntroduction\n\nWhile reinforcement learning (RL, [1]) algorithms have achieved impressive results in games, for\nexample on the Atari platform [2], they are rarely applied to real-world physical systems (e.g., robots)\noutside of academia. The main reason is that RL algorithms provide optimal policies only in the\nlong-term, so that intermediate policies may be unsafe, break the system, or harm their environment.\nThis is especially true in safety-critical systems that can affect human lives. Despite this, safety in RL\nhas remained largely an open problem [3].\nConsider, for example, a self-driving car. While it is desirable for the algorithm that drives the\ncar to improve over time (e.g., by adapting to driver preferences and changing environments), any\npolicy applied to the system has to guarantee safe driving. Thus, it is not possible to learn about the\nsystem through random exploratory actions, which almost certainly lead to a crash. In order to avoid\nthis problem, the learning algorithm needs to consider its ability to safely recover from exploratory\nactions. In particular, we want the car to be able to recover to a safe state, for example, driving at a\nreasonable speed in the middle of the lane. This ability to recover is known as asymptotic stability\nin control theory [4]. Speci\ufb01cally, we care about the region of attraction of the closed-loop system\nunder a policy. This is a subset of the state space that is forward invariant so that any state trajectory\nthat starts within this set stays within it for all times and converges to a goal state eventually.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn this paper, we present a RL algorithm for continuous state-action spaces that provides these kind\nof high-probability safety guarantees for policies. In particular, we show how, starting from an initial,\nsafe policy we can expand our estimate of the region of attraction by collecting data inside the safe\nregion and adapt the policy to both increase the region of attraction and improve control performance.\nRelated work Safety is an active research topic in RL and different de\ufb01nitions of safety exist [5, 6].\nDiscrete Markov decision processes (MDPs) are one class of tractable models that have been analyzed.\nIn risk-sensitive RL, one speci\ufb01es risk-aversion in the reward [7]. For example, [8] de\ufb01ne risk as\nthe probability of driving the agent to a set of known, undesirable states. Similarly, robust MDPs\nmaximize rewards when transition probabilities are uncertain [9, 10]. Both [11] and [12] introduce\nalgorithms to safely explore MDPs so that the agent never gets stuck without safe actions. All these\nmethods require an accurate probabilistic model of the system.\nIn continuous state-action spaces, model-free policy search algorithms have been successful. These\nupdate policies without a system model by repeatedly executing the same task [13]. In this set-\nting, [14] introduces safety guarantees in terms of constraint satisfaction that hold in expectation.\nHigh-probability worst-case safety guarantees are available for methods based on Bayesian optimiza-\ntion [15] together with Gaussian process models (GP, [16]) of the cost function. The algorithms\nin [17] and [18] provide high-probability safety guarantees for any parameter that is evaluated on\nthe real system. These methods are used in [19] to safely optimize a parametric control policy on a\nquadrotor. However, resulting policies are task-speci\ufb01c and require the system to be reset.\nIn the model-based RL setting, research has focused on safety in terms of state constraints. In\n[20, 21], a priori known, safe global backup policies are used, while [22] learns to switch between\nseveral safe policies. However, it is not clear how one may \ufb01nd these policies in the \ufb01rst place.\nOther approaches use model predictive control with constraints, a model-based technique where the\ncontrol actions are optimized online. For example, [23] models uncertain environmental constraints,\nwhile [24] uses approximate uncertainty propagation of GP dynamics along trajectories. In this\nsetting, robust feasability and constraint satisfaction can be guaranteed for a learned model with\nbounded errors using robust model predictive control [25]. The method in [26] uses reachability\nanalysis to construct safe regions in the state space. The theoretical guarantees depend on the solution\nto a partial differential equation, which is approximated.\nTheoretical guarantees for the stability exist for the more tractable stability analysis and veri\ufb01cation\nunder a \ufb01xed control policy. In control, stability of a known system can be veri\ufb01ed using a Lyapunov\nfunction [27]. A similar approach is used by [28] for deterministic, but unknown dynamics that are\nmodeled as a GP, which allows for provably safe learning of regions of attraction for \ufb01xed policies.\nSimilar results are shown in [29] for stochastic systems that are modeled as a GP. They use Bayesian\nquadrature to compute provably accurate estimates of the region of attraction. These approaches do\nnot update the policy.\nOur contributions We introduce a novel algorithm that can safely optimize policies in continuous\nstate-action spaces while providing high-probability safety guarantees in terms of stability. Moreover,\nwe show that it is possible to exploit the regularity properties of the system in order to safely learn\nabout the dynamics and thus improve the policy and increase the estimated safe region of attraction\nwithout ever leaving it. Speci\ufb01cally, starting from a policy that is known to stabilize the system\nlocally, we gather data at informative, safe points and improve the policy safely based on the improved\nmodel of the system and prove that any exploration algorithm that gathers data at these points reaches\na natural notion of full exploration. We show how the theoretical results transfer to a practical\nalgorithm with safety guarantees and apply it to a simulated inverted pendulum stabilization task.\n\n2 Background and Assumptions\n\nWe consider a deterministic, discrete-time dynamic system\n\nxt+1 = f (xt, ut) = h(xt, ut) + g(xt, ut),\n\n(1)\nwith states x \u2208 X \u2282 Rq and control actions u \u2208 U \u2282 Rp and a discrete time index t \u2208 N. The true\ndynamics f : X \u00d7 U \u2192 X consist of two parts: h(xt, ut) is a known, prior model that can be\nobtained from \ufb01rst principles, while g(xt, ut) represents a priori unknown model errors. While the\nmodel errors are unknown, we can obtain noisy measurements of f (x, u) by driving the system to\nthe state x and taking action u. We want this system to behave in a certain way, e.g., the car driving\n\n2\n\n\fon the road. To this end, we need to specify a control policy \u03c0 : X \u2192 U that, given the current state,\ndetermines the appropriate control action that drives the system to some goal state, which we set as\nthe origin without loss of generality [4]. We encode the performance requirements of how to drive\nthe system to the origin through a positive cost r(x, u) that is associated with states and actions and\nhas r(0, 0) = 0. The policy aims to minimize the cumulative, discounted costs for each starting state.\nThe goal is to safely learn about the dynamics from measurements and adapt the policy for perfor-\nmance, without encountering system failures. Speci\ufb01cally, we de\ufb01ne the safety constraint on the\nstate divergence that occurs when leaving the region of attraction. This means that adapting the\npolicy is not allowed to decrease the region of attraction and exploratory actions to learn about the\ndynamics f (\u00b7) are not allowed to drive the system outside the region of attraction. The region of\nattraction is not known a priori, but is implicitly de\ufb01ned through the system dynamics and the choice\nof policy. Thus, the policy not only de\ufb01nes performance as in typical RL, but also determines safety\nand where we can obtain measurements.\nModel assumptions In general, this kind of safe learning is impossible without further assumptions.\nFor example, in a discontinuous system even a slight change in the control policy can lead to drastically\ndifferent behavior. Moreover, to expand the safe set we need to generalize learned knowledge about\nthe dynamics to (potentially unsafe) states that we have not visited. To this end, we restrict ourselves\nto the general and practically relevant class of models that are Lipschitz continuous. This is a typical\nassumption in the control community [4]. Additionally, to ensure that the closed-loop system remains\nLipschitz continuous when the control policy is applied, we restrict policies to the rich class of\nL\u03c0-Lipschitz continuous functions \u03a0L, which also contains certain types of neural networks [30].\nAssumption 1 (continuity). The dynamics h(\u00b7) and g(\u00b7) in (1) are Lh- and Lg Lipschitz continuous\nwith respect to the 1-norm. The considered control policies \u03c0 lie in a set \u03a0L of functions that\nare L\u03c0-Lipschitz continuous with respect to the 1-norm.\n\nTo enable safe learning, we require a reliable statistical model. While we commit to GPs for the\nexploration analysis, for safety any suitable, well-calibrated model is applicable.\nAssumption 2 (well-calibrated model). Let \u00b5n(\u00b7) and \u03a3n(\u00b7) denote the posterior mean and covari-\nance matrix functions of the statistical model of the dynamics (1) conditioned on n noisy measurements.\nWith \u03c3n(\u00b7) = trace(\u03a31/2\nn (\u00b7)), there exists a \u03b2n > 0 such that with probability at least (1\u2212 \u03b4) it holds\nfor all n \u2265 0, x \u2208 X , and u \u2208 U that (cid:107)f (x, u) \u2212 \u00b5n(x, u)(cid:107)1 \u2264 \u03b2n\u03c3n(x, u).\nThis assumption ensures that we can build con\ufb01dence intervals on the dynamics that, when scaled by\nan appropriate constant \u03b2n, cover the true function with high probability. We introduce a speci\ufb01c\nstatistical model that ful\ufb01lls both assumptions under certain regularity assumptions in Sec. 3.\nLyapunov function To satisfy the speci\ufb01ed safety constraints for safe learning, we require a tool\nto determine whether individual states and actions are safe. In control theory, this safety is de\ufb01ned\nthrough the region of attraction, which can be computed for a \ufb01xed policy using Lyapunov func-\ntions [4]. Lyapunov functions are continuously differentiable functions v : X \u2192 R\n\u22650 with v(0) = 0\nand v(x) > 0 for all x \u2208 X \\ {0}. The key idea behind using Lyapunov functions to show stability\nof the system (1) is similar to that of gradient descent on strictly quasiconvex functions: if one can\nshow that, given a policy \u03c0, applying the dynamics f on the state maps it to strictly smaller values\non the Lyapunov function (\u2018going downhill\u2019), then the state eventually converges to the equilibrium\npoint at the origin (minimum). In particular, the assumptions in Theorem 1 below imply that v is\nstrictly quasiconvex within the region of attraction if the dynamics are Lipschitz continuous. As a\nresult, the one step decrease property for all states within a level set guarantees eventual convergence\nto the origin.\nTheorem 1 ([4]). Let v be a Lyapunov function, f Lipschitz continuous dynamics, and \u03c0 a policy. If\nv(f (x, \u03c0(x))) < v(x) for all x within the level set V(c) = {x \u2208 X \\ {0}| v(x) \u2264 c}, c > 0, then\nV(c) is a region of attraction, so that x0 \u2208 V(c) implies xt \u2208 V(c) for all t > 0 and limt\u2192\u221e xt = 0.\nIt is convenient to characterize the region of attraction through a level set of the Lyapunov function,\nsince it replaces the challenging test for convergence with a one-step decrease condition on the\nLyapunov function. For the theoretical analysis in this paper, we assume that a Lyapunov function is\ngiven to determine the region of attraction. For ease of notation, we also assume \u2202v(x)/\u2202x (cid:54)= 0 for\nall x \u2208 X \\ 0, which ensures that level sets V(c) are connected if c > 0. Since Lyapunov functions\nare continuously differentiable, they are Lv-Lipschitz continuous over the compact set X .\n\n3\n\n\fIn general, it is not easy to \ufb01nd suitable Lyapunov functions. However, for physical models, like the\nprior model h in (1), the energy of the system (e.g., kinetic and potential for mechanical systems) is a\ngood candidate Lyapunov function. Moreover, it has recently been shown that it is possible to compute\nsuitable Lyapunov functions [31, 32]. In our experiments, we exploit the fact that value functions in\nRL are Lyapunov functions if the costs are strictly positive away from the origin. This follows directly\nfrom the de\ufb01nition of the value function, where v(x) = r(x, \u03c0(x)) + v(f (x, \u03c0(x)) \u2264 v(f (x, \u03c0(x))).\nThus, we can obtain Lyapunov candidates as a by-product of approximate dynamic programming.\nInitial safe policy Lastly, we need to ensure that there exists a safe starting point for the learning\nprocess. Thus, we assume that we have an initial policy \u03c00 that renders the origin of the system in (1)\nasymptotically stable within some small set of states S x\n0 . For example, this policy may be designed\nusing the prior model h in (1), since most models are locally accurate but deteriorate in quality as\nstate magnitude increases. This policy is explicitly not safe to use throughout the state space X \\ S x\n0 .\n3 Theory\n\nIn this section, we use these assumptions for safe reinforcement learning. We start by computing the\nregion of attraction for a \ufb01xed policy under the statistical model. Next, we optimize the policy in order\nto expand the region of attraction. Lastly, we show that it is possible to safely learn about the dynamics\nand, under additional assumptions about the model and the system\u2019s reachability properties, that this\napproach expands the estimated region of attraction safely. We consider an idealized algorithm that is\namenable to analysis, which we convert to a practical variant in Sec. 4. See Fig. 1 for an illustrative\nrun of the algorithm and examples of the sets de\ufb01ned below.\nRegion of attraction We start by computing the region of attraction for a \ufb01xed policy. This is an\nextension of the method in [28] to discrete-time systems. We want to use the Lyapunov decrease condi-\ntion in Theorem 1 to guarantee safety for the statistical model of the dynamics. However, the posterior\nuncertainty in the statistical model of the dynamics means that one step predictions about v(f (\u00b7)) are\nuncertain too. We account for this by constructing high-probability con\ufb01dence intervals on v(f (x, u)):\nQn(x, u) := [v(\u00b5n\u22121(x, u)) \u00b1 Lv\u03b2n\u03c3n\u22121(x, u)]. From Assumption 2 together with the Lipschitz\nproperty of v, we know that v(f (x, u)) is contained in Qn(x, u) with probability at least (1\u2212 \u03b4). For\nour exploration analysis, we need to ensure that safe state-actions cannot become unsafe; that is, an\ninitial set of safe set S0 remains safe (de\ufb01ned later). To this end, we intersect the con\ufb01dence intervals:\nCn(x, u) := Cn\u22121 \u2229 Qn(x, u), where the set C is initialized to C0(x, u) = (\u2212\u221e, v(x) \u2212 L\u2206v\u03c4 )\nwhen (x, u) \u2208 S0 and C0(x, u) = R otherwise. Note that v(f (x, u)) is contained in Cn(x, u) with\nthe same (1 \u2212 \u03b4) probability as in Assumption 2. The upper and lower bounds on v(f (\u00b7)) are de\ufb01ned\nas un(x, u) := maxCn(x, u) and ln(x, u) := minCn(x, u).\nGiven these high-probability con\ufb01dence intervals, the system is stable according to Theorem 1 if\nv(f (x, u)) \u2264 un(x) < v(x) for all x \u2208 V(c). However, it is intractable to verify this condition\ndirectly on the continuous domain without additional, restrictive assumptions about the model.\nInstead, we consider a discretization of the state space X\u03c4 \u2282 X into cells, so that (cid:107)x \u2212 [x]\u03c4(cid:107)1 \u2264 \u03c4\nholds for all x \u2208 X . Here, [x]\u03c4 denotes the point in X\u03c4 with the smallest l1 distance to x. Given this\ndiscretization, we bound the decrease variation on the Lyapunov function for states in X\u03c4 and use the\nLipschitz continuity to generalize to the continuous state space X .\nTheorem 2. Under Assumptions 1 and 2 with L\u2206v := LvLf (L\u03c0 + 1) + Lv, let X\u03c4 be a discretiza-\ntion of X such that (cid:107)x \u2212 [x]\u03c4(cid:107)1 \u2264 \u03c4 for all x \u2208 X . If, for all x \u2208 V(c) \u2229 X\u03c4 with c > 0, u = \u03c0(x),\nand for some n \u2265 0 it holds that un(x, u) < v(x) \u2212 L\u2206v\u03c4, then v(f (x, \u03c0(x))) < v(x) holds for all\nx \u2208 V(c) with probability at least (1 \u2212 \u03b4) and V(c) is a region of attraction for (1) under policy \u03c0.\nThe proof is given in Appendix A.1. Theorem 2 states that, given con\ufb01dence intervals on the statistical\nmodel of the dynamics, it is suf\ufb01cient to check the stricter decrease condition in Theorem 2 on the\ndiscretized domain X\u03c4 to guarantee the requirements for the region of attraction in the continuous\ndomain in Theorem 1. The bound in Theorem 2 becomes tight as the discretization constant \u03c4\nand |v(f (\u00b7)) \u2212 un(\u00b7)| go to zero. Thus, the discretization constant trades off computation costs for\naccuracy, while un approaches v(f (\u00b7)) as we obtain more measurement data and the posterior model\nuncertainty about the dynamics, \u221a\u03b2n\u03c3n decreases. The con\ufb01dence intervals on v(f (x, \u03c0(x)) \u2212 v(x)\nand the corresponding estimated region of attraction (red line) can be seen in the bottom half of Fig. 1.\nPolicy optimization So far, we have focused on estimating the region of attraction for a \ufb01xed policy.\nSafety is a property of states under a \ufb01xed policy. This means that the policy directly determines\n\n4\n\n\f(a) Initial safe set (in red).\n\n(b) Exploration: 15 data points.\n\n(c) Final policy after 30 evaluations.\n\nFigure 1: Example application of Algorithm 1. Due to input constraints, the system becomes unstable\nfor large states. We start from an initial, local policy \u03c00 that has a small, safe region of attraction (red\nlines) in Fig. 1(a). The algorithm selects safe, informative state-action pairs within Sn (top, white\nshaded), which can be evaluated without leaving the region of attraction V(cn) (red lines) of the\ncurrent policy \u03c0n. As we gather more data (blue crosses), the uncertainty in the model decreases\n(top, background) and we use (3) to update the policy so that it lies within Dn (top, red shaded) and\nful\ufb01lls the Lyapunov decrease condition. The algorithm converges to the largest safe set in Fig. 1(c).\nIt improves the policy without evaluating unsafe state-action pairs and thereby without system failure.\n\nwhich states are safe. Speci\ufb01cally, to form a region of attraction all states in the discretizaton X\u03c4\nwithin a level set of the Lyapunov function need to ful\ufb01ll the decrease condition in Theorem 2 that\ndepends on the policy choice. The set of all state-action pairs that ful\ufb01ll this decrease condition is\ngiven by\n\nDn =(cid:8)(x, u) \u2208 X\u03c4 \u00d7 U | un(x, u) \u2212 v(x) < \u2212L\u2206v\u03c4(cid:9),\n\n(2)\nsee Fig. 1(c) (top, red shaded). In order to estimate the region of attraction based on this set, we\nneed to commit to a policy. Speci\ufb01cally, we want to pick the policy that leads to the largest possible\nregion of attraction according to Theorem 2. This requires that for each discrete state in X\u03c4 the\ncorresponding state-action pair under the policy must be in the set Dn. Thus, we optimize the policy\naccording to\n\n\u03c0n, cn = argmax\n\u03c0\u2208\u03a0L,c\u2208R>0\n\nc,\n\nsuch that for all x \u2208 V(c) \u2229 X\u03c4 : (x, \u03c0(x)) \u2208 Dn.\n\n(3)\n\nThe region of attraction that corresponds to the optimized policy \u03c0n according to (3) is given\nby V(cn), see Fig. 1(b). It is the largest level set of the Lyapunov function for which all state-action\npairs (x, \u03c0n(x)) that correspond to discrete states within V(cn) \u2229 X\u03c4 are contained in Dn. This\nmeans that these state-action pairs ful\ufb01ll the requirements of Theorem 2 and V(cn) is a region of\nattraction of the true system under policy \u03c0n. The following theorem is thus a direct consequence\nof Theorem 2 and (3).\nTheorem 3. Let R\u03c0n be the true region of attraction of (1) under the policy \u03c0n. For any \u03b4 \u2208 (0, 1),\nwe have with probability at least (1 \u2212 \u03b4) that V(cn) \u2286 R\u03c0n for all n > 0.\nThus, when we optimize the policy subject to the constraint in (3) the estimated region of attraction is\nalways an inner approximation of the true region of attraction. However, solving the optimization\nproblem in (3) is intractable in general. We approximate the policy update step in Sec. 4.\nCollecting measurements Given these stability guarantees, it is natural to ask how one might obtain\ndata points in order to improve the model of g(\u00b7) and thus ef\ufb01ciently increase the region of attraction.\nThis question is dif\ufb01cult to answer in general, since it depends on the property of the statistical model.\nIn particular, for general statistical models it is often not clear whether the con\ufb01dence intervals\ncontract suf\ufb01ciently quickly. In the following, we make additional assumptions about the model and\nreachability within V(cn) in order to provide exploration guarantees. These assumptions allow us to\nhighlight fundamental requirements for safe data acquisition and that safe exploration is possible.\n\n5\n\nactionupolicy\u03c00statexv(x)l0u0\u03c015statexV(c15)\u03c030S30D30\u03c3(x,u)statex\u2206v(x,\u03c0(x))\u2212L\u2206v\u03c4\f(cid:112)\n\nwell-behaved functions of the form g(z) =(cid:80)\u221ei=0 \u03b1ik(zi, z) de\ufb01ned through representer points zi\n\nWe assume that the unknown model errors g(\u00b7) have bounded norm in a reproducing kernel Hilbert\nspace (RKHS, [33]) corresponding to a differentiable kernel k, (cid:107)g(\u00b7)(cid:107)k \u2264 Bg. These are a class of\nand weights \u03b1i that decay suf\ufb01ciently fast with i. This assumption ensures that g satis\ufb01es the\nLipschitz property in Assumption 1, see [28]. Moreover, with \u03b2n = Bg + 4\u03c3\n\u03b3n + 1 + ln(1/\u03b4) we\ncan use GP models for the dynamics that ful\ufb01ll Assumption 2 if the state if fully observable and the\nmeasurement noise is \u03c3-sub-Gaussian (e.g., bounded in [\u2212\u03c3, \u03c3]), see [34]. Here \u03b3n is the information\ncapacity. It corresponds to the amount of mutual information that can be obtained about g from nq\nmeasurements, a measure of the size of the function class encoded by the model. The information\ncapacity has a sublinear dependence on n for common kernels and upper bounds can be computed\nef\ufb01ciently [35]. More details about this model are given in Appendix A.2.\nIn order to quantify the exploration properties of our algorithm, we consider a discrete action\nspace U\u03c4 \u2282 U. We de\ufb01ne exploration as the number of state-action pairs in X\u03c4 \u00d7 U\u03c4 that we can\nsafely learn about without leaving the true region of attraction. Note that despite this discretization,\nthe policy takes values on the continuous domain. Moreover, instead of using the con\ufb01dence intervals\ndirectly as in (3), we consider an algorithm that uses the Lipschitz constants to slowly expand the safe\nset. We use this in our analysis to quantify the ability to generalize beyond the current safe set. In\npractice, nearby states are suf\ufb01ciently correlated under the model to enable generalization using (2).\nSuppose we are given a set S0 of state-action pairs about which we can learn safely. Speci\ufb01cally, this\nmeans that we have a policy such that, for any state-action pair (x, u) in S0, if we apply action u in\nstate x and then apply actions according to the policy, the state converges to the origin. Such a set\ncan be constructed using the initial policy \u03c00 from Sec. 2 as S0 = {(x, \u03c00(x))| x \u2208 S x\n0 }. Starting\nfrom this set, we want to update the policy to expand the region of attraction according to Theorem 2.\nTo this end, we use the con\ufb01dence intervals on v(f (\u00b7)) for states inside S0 to determine state-action\npairs that ful\ufb01ll the decrease condition. We thus rede\ufb01ne Dn for the exploration analysis to\n\n(cid:8)z(cid:48) \u2208 X\u03c4 \u00d7 U\u03c4 | un(x, u) \u2212 v(x) + L\u2206v(cid:107)z(cid:48) \u2212 (x, u)(cid:107)1 < \u2212L\u2206v\u03c4(cid:9).\n\n(cid:91)\n\nDn =\n\n(4)\n\n(x,u)\u2208Sn\u22121\n\nThis formulation is equivalent to (2), except that it uses the Lipschitz constant to generalize safety.\nGiven Dn, we can again \ufb01nd a region of attraction V(cn) by committing to a policy according to (3).\nIn order to expand this region of attraction effectively we need to decrease the posterior model\nuncertainty about the dynamics of the GP by collecting measurements. However, to ensure safety\nas outlined in Sec. 2, we are not only restricted to states within V(cn), but also need to ensure that\nthe state after taking an action is safe; that is, the dynamics map the state back into the region of\nattraction V(cn). We again use the Lipschitz constant in order to determine this set,\n\n(cid:8)z(cid:48) \u2208 V(cn) \u2229 X\u03c4 \u00d7 U\u03c4 | un(z) + LvLf(cid:107)z \u2212 z(cid:48)(cid:107)1 \u2264 cn}.\n\n(cid:91)\n\nSn =\n\n(5)\n\nz\u2208Sn\u22121\n\nThe set Sn contains state-action pairs that we can safely evaluate under the current policy \u03c0n without\nleaving the region of attraction, see Fig. 1 (top, white shaded).\nWhat remains is to de\ufb01ne a strategy for collecting data points within Sn to effectively decrease model\nuncertainty. We speci\ufb01cally focus on the high-level requirements for any exploration scheme without\ncommitting to a speci\ufb01c method. In practice, any (model-based) exploration strategy that aims to\ndecrease model uncertainty by driving the system to speci\ufb01c states may be used. Safety can be\nensured by picking actions according to \u03c0n whenever the exploration strategy reaches the boundary\nof the safe region V(cn); that is, when un(x, u) > cn. This way, we can use \u03c0n as a backup policy\nfor exploration.\nThe high-level goal of the exploration strategy is to shrink the con\ufb01dence intervals at state-action\npairs Sn in order to expand the safe region. Speci\ufb01cally, the exploration strategy should aim to visit\nstate-action pairs in Sn at which we are the most uncertain about the dynamics; that is, where the\ncon\ufb01dence interval is the largest:\n\n(xn, un) = argmax\n(x,u)\u2208Sn\n\nun(x, u) \u2212 ln(x, u).\n\n(6)\n\nAs we keep collecting data points according to (6), we decrease the uncertainty about the dynamics\nfor different actions throughout the region of attraction and adapt the policy, until eventually we\n\n6\n\n\fAlgorithm 1 SAFELYAPUNOVLEARNING\n1: Input: Initial safe policy \u03c00, dynamics model GP(\u00b5(z), k(z, z(cid:48)))\n2: for all n = 1, . . . do\n3:\n4:\n5:\n6:\n7:\n\nCompute policy \u03c0n via SGD on (7)\ncn = argmaxc c, such that \u2200x \u2208 V(cn) \u2229 X\u03c4 : un(x, \u03c0n(x)) \u2212 v(x) < \u2212L\u2206v\u03c4\nSn = {(x, u) \u2208 V(cn) \u00d7 U\u03c4 | un(x, u) \u2264 cn}\nSelect (xn, un) within Sn using (6) and drive system there with backup policy \u03c0n\nUpdate GP with measurements f (xn, un) + \u0001n\n\n(cid:112)\n\nhave gathered enough information in order to expand it. While (6) implicitly assumes that any\nstate within V(cn) can be reached by the exploration policy, it achieves the high-level goal of any\nexploration algorithm that aims to reduce model uncertainty. In practice, any safe exploration scheme\nis limited by unreachable parts of the state space.\nWe compare the active learning scheme in (6) to an oracle baseline that starts from the same initial\nsafe set S0 and knows v(f (x, u)) up to \u0001 accuracy within the safe set. The oracle also uses knowledge\nabout the Lipschitz constants and the optimal policy in \u03a0L at each iteration. We denote the set that this\nbaseline manages to determine as safe with R\u0001(S0) and provide a detailed de\ufb01nition in Appendix A.3.\nTheorem 4. Assume \u03c3-sub-Gaussian measurement noise and that\nthe model error g(\u00b7)\nin (1) has RKHS norm smaller than Bg.\nUnder the assumptions of Theorem 2,\n\u03b3n + 1 + ln(1/\u03b4), and with measurements collected according to (6), let\nwith \u03b2n = Bg + 4\u03c3\nn\u2217\nn\u2217 \u03b3n\u2217 \u2265 Cq(|R(S0)|+1)\nwhere C = 8/ log(1 + \u03c3\u22122).\nn\u2217 be the smallest positive integer so that\nLet R\u03c0 be the true region of attraction of (1) under a policy \u03c0. For any \u0001 > 0, and \u03b4 \u2208 (0, 1), the\nfollowing holds jointly with probability at least (1 \u2212 \u03b4) for all n > 0:\n(i) V(cn) \u2286 R\u03c0n\nTheorem 4 states that, when selecting data points according to (6), the estimated region of attrac-\ntion V(cn) is (i) contained in the true region of attraction under the current policy and (ii) selected\ndata points do not cause the system to leave the region of attraction. This means that any exploration\nmethod that considers the safety constraint (5) is able to safely learn about the system without leaving\nthe region of attraction. The last part of Theorem 4, (iii), states that after a \ufb01nite number of data\npoints n\u2217 we achieve at least the exploration performance of the oracle baseline, while we do not\nclassify unsafe state-action pairs as safe. This means that the algorithm explores the largest region\nof attraction possible for a given Lyapunov function with residual uncertaint about v(f (\u00b7)) smaller\nthan \u0001. Details of the comparison baseline are given in the appendix. In practice, this means that any\nexploration method that manages to reduce the maximal uncertainty about the dynamics within Sn is\nable to expand the region of attraction.\nAn example run of repeatedly evaluating (6) for a one-dimensional state-space is shown in Fig. 1. It\ncan be seen that, by only selecting data points within the current estimate of the region of attraction,\nthe algorithm can ef\ufb01ciently optimize the policy and expand the safe region over time.\n\n(ii) f (x, u) \u2208 R\u03c0n \u2200(x, u) \u2208 Sn.\n\n(iii) R\u0001(S0) \u2286 Sn \u2286 R0(S0).\n\n\u03b22\n\nL2\n\nv\u00012\n\n4 Practical Implementation and Experiments\n\nIn the previous section, we have given strong theoretical results on safety and exploration for an\nidealized algorithm that can solve (3). In this section, we provide a practical variant of the theoretical\nalgorithm in the previous section. In particular, while we retain safety guarantees, we sacri\ufb01ce\nexploration guarantees to obtain a more practical algorithm. This is summarized in Algorithm 1.\nThe policy optimization problem in (3) is intractable to solve and only considers safety, rather\nthan a performance metric. We propose to use an approximate policy update that that maximizes\napproximate performance while providing stability guarantees. It proceeds by optimizing the policy\n\ufb01rst and then computes the region of attraction V(cn) for the new, \ufb01xed policy. This does not\nimpact safety, since data is still only collected inside the region of attraction. Moreover, should the\noptimization fail and the region of attraction decrease, one can always revert to the previous policy,\nwhich is guaranteed to be safe.\n\n7\n\n\f(a) Estimated safe set.\n\n(b) State trajectory (lower is better).\n\nFigure 2: Optimization results for an inverted pendulum. Fig. 2(a) shows the initial safe set (yellow)\nunder the policy \u03c00, while the green region represents the estimated region of attraction under the\noptimized neural network policy. It is contained within the true region of attraction (white). Fig. 2(b)\nshows the improved performance of the safely learned policy over the policy for the prior model.\n\n(cid:90)\n\n\u03c0n = argmin\n\u03c0\u03b8\u2208\u03a0L\n\nx\u2208X\n\n(cid:16)\n\n(cid:17)\n\nIn our experiments, we use approximate dynamic programming [36] to capture the performance of the\npolicy. Given a policy \u03c0\u03b8 with parameters \u03b8, we compute an estimate of the cost-to-go J\u03c0\u03b8 (\u00b7) for the\nmean dynamics \u00b5n based on the cost r(x, u) \u2265 0. At each state, J\u03c0\u03b8 (x) is the sum of \u03b3-discounted\nrewards encountered when following the policy \u03c0\u03b8. The goal is to adapt the parameters of the policy\nfor minimum cost as measured by J\u03c0\u03b8, while ensuring that the safety constraint on the worst-case\ndecrease on the Lyapunov function in Theorem 2 is not violated. A Lagrangian formulation to this\nconstrained optimization problem is\n\nr(x, \u03c0\u03b8(x)) + \u03b3J\u03c0\u03b8 (\u00b5n\u22121(x, \u03c0\u03b8(x)) + \u03bb\n\nun(x, \u03c0\u03b8(x))\u2212 v(x) + L\u2206v\u03c4\n\n, (7)\n\nwhere the \ufb01rst term measures long-term cost to go and \u03bb \u2265 0 is a Lagrange multiplier for the safety\nconstraint from Theorem 2. In our experiments, we use the value function as a Lyapunov function\ncandidate, v = J with r(\u00b7,\u00b7) \u2265 0, and set \u03bb = 1. In this case, (7) corresponds to an high-probability\nupper bound on the cost-to-go given the uncertainty in the dynamics. This is similar to worst-case\nperformance formulations found in robust MDPs [9, 10], which consider worst-case value functions\ngiven parametric uncertainty in MDP transition model. Moreover, since L\u2206v depends on the Lipschitz\nconstant of the policy, this simultaneously serves as a regularizer on the parameters \u03b8.\nTo verify safety, we use the GP con\ufb01dence intervals ln and un directly, as in (2). We also use\ncon\ufb01dence to compute Sn for the active learning scheme, see Algorithm 1, Line 5. In practice, we\ndo not need to compute the entire set Sn to solve (3), but can use a global optimization method or\neven a random sampling scheme within V(cn) to \ufb01nd suitable state-actions. Moreover, measurements\nfor actions that are far away from the current policy are unlikely to expand V(cn), see Fig. 1(c). As\nwe optimize (7) via gradient descent, the policy changes only locally. Thus, we can achieve better\ndata-ef\ufb01ciency by restricting the exploratory actions u with (x, u) \u2208 Sn to be close to \u03c0n, u \u2208\n[\u03c0n(x) \u2212 \u00afu, \u03c0n(x) + \u00afu] for some constant \u00afu.\nComputing the region of attraction by verifying the stability condition on a discretized domain suffers\nfrom the curse of dimensionality. However, it is not necessary to update policies in real time. In\nparticular, since any policy that is returned by the algorithm is provably safe within some level set,\nany of these policies can be used safely for an arbitrary number of time steps. To scale this method to\nhigher-dimensional system, one would have to consider an adaptive discretization for the veri\ufb01cation\nas in [27].\nExperiments A Python implementation of Algorithm 1 and the experiments based on Tensor-\nFlow [37] and GP\ufb02ow [38] is available at https://github.com/befelix/safe_learning.\nWe verify our approach on an inverted pendulum benchmark problem. The true, continuous-time\ndynamics are given by ml2 \u00a8\u03c8 = gml sin(\u03c8) \u2212 \u03bb \u02d9\u03c8 + u, where \u03c8 is the angle, m the mass, g the\ngravitational constant, and u the torque applied to the pendulum. The control torque is limited, so that\nthe pendulum necessarily falls down beyond a certain angle. We use a GP model for the discrete-time\ndynamics, where the mean dynamics are given by a linearized and discretized model of the true\ndynamics that considers a wrong, lower mass and neglects friction. As a result, the optimal policy for\n\n8\n\n\u22121.0\u22120.50.00.51.0angle[rad]\u2212505angularvelocity[rad/s]unsaferegionV(c0)V(c50)0.00.51.01.5time[s]0.000.050.100.150.200.250.30angle[rad]\u03c00\u03c050safelyoptimizedpolicyinitialpolicy\fthe mean dynamics does not perform well and has a small region of attraction as it underactuates the\nsystem. We use a combination of linear and Mat\u00e9rn kernels in order to capture the model errors that\nresult from parameter and integration errors.\nFor the policy, we use a neural network with two hidden layers and 32 neurons with ReLU activations\neach. We compute a conservative estimate of the Lipschitz constant as in [30]. We use standard\napproximate dynamic programming with a quadratic, normalized cost r(x, u) = xTQx + uTRu,\nwhere Q and R are positive-de\ufb01nite, to compute the cost-to-go J\u03c0\u03b8. Speci\ufb01cally, we use a piecewise-\nlinear triangulation of the state-space as to approximate J\u03c0\u03b8, see [39]. This allows us to quickly\nverify the assumptions that we made about the Lyapunov function in Sec. 2 using a graph search. In\npractice, one may use other function approximators. We optimize the policy via stochastic gradient\ndescent on (7), where we sample a \ufb01nite subset of X and replace the integral in (7) with a sum.\nThe theoretical con\ufb01dence intervals for the GP model are conservative. To enable more data-ef\ufb01cient\nlearning, we \ufb01x \u03b2n = 2. This corresponds to a high-probability decrease condition per-state, rather\nthan jointly over the state space. Moreover, we use local Lipschitz constants of the Lyapunov function\nrather than the global one. While this does not affect guarantees, it greatly speeds up exploration.\nFor the initial policy, we use approximate dynamic programming to compute the optimal policy for\nthe prior mean dynamics. This policy is unstable for large deviations from the initial state and has poor\nperformance, as shown in Fig. 2(b). Under this initial, suboptimal policy, the system is stable within\na small region of the state-space Fig. 2(a). Starting from this initial safe set, the algorithm proceeds to\ncollect safe data points and improve the policy. As the uncertainty about the dynamics decreases, the\npolicy improves and the estimated region of attraction increases. The region of attraction after 50 data\npoints is shown in Fig. 2(a). The resulting set V(cn) is contained within the true safe region of the\noptimized policy \u03c0n. At the same time, the control performance improves drastically relative to the\ninitial policy, as can be seen in Fig. 2(b). Overall, the approach enables safe learning about dynamic\nsystems, as all data points collected during learning are safely collected under the current policy.\n\n5 Conclusion\n\nWe have shown how classical reinforcement learning can be combined with safety constraints in terms\nof stability. Speci\ufb01cally, we showed how to safely optimize policies and give stability certi\ufb01cates\nbased on statistical models of the dynamics. Moreover, we provided theoretical safety and exploration\nguarantees for an algorithm that can drive the system to desired state-action pairs during learning. We\nbelieve that our results present an important \ufb01rst step towards safe reinforcement learning algorithms\nthat are applicable to real-world problems.\n\nAcknowledgments\nThis research was supported by SNSF grant 200020_159557, the Max Planck ETH Center for\nLearning Systems, NSERC grant RGPIN-2014-04634, and the Ontario Early Researcher Award.\n\nReferences\n[1] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: an introduction. MIT press,\n\n1998.\n\n[2] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G.\nBellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe-\ntersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan\nWierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement\nlearning. Nature, 518(7540):529\u2013533, 2015.\n\n[3] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man\u00e9.\n\nConcrete problems in AI safety. arXiv:1606.06565 [cs], 2016.\n\n[4] Hassan K. Khalil and J. W. Grizzle. Nonlinear systems, volume 3. Prentice Hall, 1996.\n[5] Martin Pecka and Tomas Svoboda. Safe exploration techniques for reinforcement learning \u2013\nan overview. In Modelling and Simulation for Autonomous Systems, pages 357\u2013375. Springer,\n2014.\n\n9\n\n\f[6] Javier Garc\u00eda and Fernando Fern\u00e1ndez. A comprehensive survey on safe reinforcement learning.\n\nJournal of Machine Learning Research (JMLR), 16:1437\u20131480, 2015.\n\n[7] Stefano P. Coraluppi and Steven I. Marcus. Risk-sensitive and minimax control of discrete-time,\n\n\ufb01nite-state Markov decision processes. Automatica, 35(2):301\u2013309, 1999.\n\n[8] Peter Geibel and Fritz Wysotzki. Risk-sensitive reinforcement learning applied to control under\n\nconstraints. J. Artif. Intell. Res.(JAIR), 24:81\u2013108, 2005.\n\n[9] Aviv Tamar, Shie Mannor, and Huan Xu. Scaling Up Robust MDPs by Reinforcement Learning.\n\nIn Proc. of the International Conference on Machine Learning (ICML), 2014.\n\n[10] Wolfram Wiesemann, Daniel Kuhn, and Ber\u00e7 Rustem. Robust Markov Decision Processes.\n\nMathematics of Operations Research, 38(1):153\u2013183, 2012.\n\n[11] Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in Markov decision processes. In\nProc. of the International Conference on Machine Learning (ICML), pages 1711\u20131718, 2012.\n\n[12] Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in \ufb01nite markov\n\ndecision processes with gaussian processes. pages 4305\u20134313, 2016.\n\n[13] Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In Proc. of the IEEE/RSJ\n\nInternational Conference on Intelligent Robots and Systems, pages 2219\u20132225, 2006.\n\n[14] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization.\n\nIn Proc. of the International Conference on Machine Learning (ICML), 2017.\n\n[15] Jonas Mockus. Bayesian approach to global optimization, volume 37 of Mathematics and Its\n\nApplications. Springer, Dordrecht, 1989.\n\n[16] Carl Edward Rasmussen and Christopher K.I Williams. Gaussian processes for machine\n\nlearning. MIT Press, Cambridge MA, 2006.\n\n[17] Jens Schreiter, Duy Nguyen-Tuong, Mona Eberts, Bastian Bischoff, Heiner Markert, and Marc\nToussaint. Safe exploration for active learning with Gaussian processes. In Machine Learning\nand Knowledge Discovery in Databases, number 9286, pages 133\u2013149. Springer International\nPublishing, 2015.\n\n[18] Yanan Sui, Alkis Gotovos, Joel W. Burdick, and Andreas Krause. Safe exploration for optimiza-\ntion with Gaussian processes. In Proc. of the International Conference on Machine Learning\n(ICML), pages 997\u20131005, 2015.\n\n[19] Felix Berkenkamp, Angela P. Schoellig, and Andreas Krause. Safe controller optimization for\nquadrotors with Gaussian processes. In Proc. of the IEEE International Conference on Robotics\nand Automation (ICRA), pages 493\u2013496, 2016.\n\n[20] J. Garcia and F. Fernandez. Safe exploration of state and action spaces in reinforcement learning.\n\nJournal of Arti\ufb01cial Intelligence Research, pages 515\u2013564, 2012.\n\n[21] Alexander Hans, Daniel Schneega\u00df, Anton Maximilian Sch\u00e4fer, and Steffen Udluft. Safe\nexploration for reinforcement learning. In Proc. of the European Symposium on Arti\ufb01cial\nNeural Networks (ESANN), pages 143\u2013148, 2008.\n\n[22] Theodore J. Perkins and Andrew G. Barto. Lyapunov design for safe reinforcement learning.\n\nThe Journal of Machine Learning Research, 3:803\u2013832, 2003.\n\n[23] Dorsa Sadigh and Ashish Kapoor. Safe control under uncertainty with Probabilistic Signal\n\nTemporal Logic. In Proc. of Robotics: Science and Systems, 2016.\n\n[24] Chris J. Ostafew, Angela P. Schoellig, and Timothy D. Barfoot. Robust constrained learning-\nbased NMPC enabling reliable mobile robot path tracking. The International Journal of Robotics\nResearch (IJRR), 35(13):1547\u20131536, 2016.\n\n[25] Anil Aswani, Humberto Gonzalez, S. Shankar Sastry, and Claire Tomlin. Provably safe and\n\nrobust learning-based model predictive control. Automatica, 49(5):1216\u20131226, 2013.\n\n10\n\n\f[26] Anayo K. Akametalu, Shahab Kaynama, Jaime F. Fisac, Melanie N. Zeilinger, Jeremy H.\nGillula, and Claire J. Tomlin. Reachability-based safe learning with Gaussian processes. In\nProc. of the IEEE Conference on Decision and Control (CDC), pages 1424\u20131431, 2014.\n\n[27] Ruxandra Bobiti and Mircea Lazar. A sampling approach to \ufb01nding Lyapunov functions for\nnonlinear discrete-time systems. In Proc. of the European Control Conference (ECC), pages\n561\u2013566, 2016.\n\n[28] Felix Berkenkamp, Riccardo Moriconi, Angela P. Schoellig, and Andreas Krause. Safe learning\nof regions of attraction in nonlinear systems with Gaussian processes. In Proc. of the Conference\non Decision and Control (CDC), pages 4661\u20134666, 2016.\n\n[29] Julia Vinogradska, Bastian Bischoff, Duy Nguyen-Tuong, Henner Schmidt, Anne Romer, and\nJan Peters. Stability of controllers for Gaussian process forward models. In Proceedings of the\nInternational Conference on Machine Learning (ICML), pages 545\u2013554, 2016.\n\n[30] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Good-\nfellow, and Rob Fergus. Intriguing properties of neural networks. In Proc. of the International\nConference on Learning Representations (ICLR), 2014.\n\n[31] Huijuan Li and Lars Gr\u00fcne. Computation of local ISS Lyapunov functions for discrete-time sys-\ntems via linear programming. Journal of Mathematical Analysis and Applications, 438(2):701\u2013\n719, 2016.\n\n[32] Peter Giesl and Sigurdur Hafstein. Review on computational methods for Lyapunov functions.\n\nDiscrete and Continuous Dynamical Systems, Series B, 20(8):2291\u20132337, 2015.\n\n[33] Bernhard Sch\u00f6lkopf. Learning with kernels: support vector machines, regularization, optimiza-\ntion, and beyond. Adaptive computation and machine learning. MIT Press, Cambridge, Mass,\n2002.\n\n[34] Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In Proc. of\n\nthe International Conference on Machine Learning (ICML), pages 844\u2013853, 2017.\n\n[35] Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. Gaussian Process\nOptimization in the Bandit Setting: No Regret and Experimental Design. IEEE Transactions on\nInformation Theory, 58(5):3250\u20133265, 2012.\n\n[36] Warren B. Powell. Approximate dynamic programming: solving the curses of dimensionality.\n\nJohn Wiley & Sons, 2007.\n\n[37] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,\nAndrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,\nManjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray,\nChris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul\nTucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden,\nMartin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-Scale\nMachine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467 [cs], 2016.\n\n[38] Alexander G. de G. Matthews, Mark van der Wilk, Tom Nickson, Keisuke Fujii, Alexis\nBoukouvalas, Pablo Le\u00f3n-Villagr\u00e1, Zoubin Ghahramani, and James Hensman. GP\ufb02ow: a\nGaussian process library using TensorFlow. Journal of Machine Learning Research, 18(40):1\u20136,\n2017.\n\n[39] Scott Davies. Multidimensional triangulation and interpolation for reinforcement learning. In\nProc. of the Conference on Neural Information Processing Systems (NIPS), pages 1005\u20131011,\n1996.\n\n[40] Andreas Christmann and Ingo Steinwart. Support Vector Machines. Information Science and\n\nStatistics. Springer, New York, NY, 2008.\n\n11\n\n\f", "award": [], "sourceid": 587, "authors": [{"given_name": "Felix", "family_name": "Berkenkamp", "institution": "ETH Zurich"}, {"given_name": "Matteo", "family_name": "Turchetta", "institution": "ETH Zurich"}, {"given_name": "Angela", "family_name": "Schoellig", "institution": null}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETHZ"}]}