Solving Minimum-Cost Reach Avoid using Reinforcement Learning

Oswin So1*, Cheng Ge1*, Chuchu Fan1
* Both authors contributed equally to this work 1 Massachusetts Institute of Technology

Minimum-Cost Reach Avoid Problem

We tackle the problem of solving minimum-cost reach avoid problems. That is, we wish to minimally modify a given test policy to maintain safety.

Safety Filter Problem

In this work, we construct a safety filter using Control Barrier Functions (CBF).

Using Cubic Splines to Approximae the Maximum over Time

Constructing a CBF for arbitrary input constrained systems is hard. For high relative-degree systems, a common approach is to use Higher-Order CBFs (HOCBFs). However, even on the simplest example of a double-integrator with bounded accelerations, many HOCBF candidate functions fail to satisfy the CBF conditions and are unsafe.

For example, consider a double integrator (\(\dot{p} = v, \dot{v} = u \)) with box-constrained accelerations \(\lvert u \rvert \leq 1\) and a safety constraint for the position to be positive (\( p \geq 0 \)). The HOCBF candidate \( B(x) = -v - \alpha p \) is valid if and only if \( \alpha = 0 \), which deems all negative velocities as unsafe and is overly conservative. Other choices of \( \alpha \) will result in safety violations for some regions of the state space.

Policy CBFs: Constructing CBFs from the Policy Value Function

In this work, we use the insight that the maximum-over-time value function is a CBF for any choice of nominal policy \(\pi\).

\( \displaystyle V^{h,\pi}(x) \coloneqq \sup_{t \geq 0}\, h(x_t^\pi) \).

where the avoid set \( \mathcal{A} \) is described as the superlevel set of some continuous function \(h\):

\( \displaystyle \mathcal{A} = \{ x \mid h(x) > 0 \} \).

Learning the policy value function \(V^{h,\pi}\) for a nominal policy \(\pi\) can be interpreted as policy distillation: \(V^{h,\pi}\) contains knowledge about the invariant set, which can be used as a safety filter for another (potentially unsafe) policy.

Simulation Experiments

F16 Fighter Jet

Avoid
Avoid crashing into the ground. Avoid extreme angles of attack.

Abstract

Current reinforcement-learning methods are unable to directly learn policies that solve the minimum cost reach-avoid problem to minimize cumulative costs subject to the constraints of reaching the goal and avoiding unsafe states, as the structure of this new optimization problem is incompatible with current methods. Instead, a surrogate problem is solved where all objectives are combined with a weighted sum. However, this surrogate objective results in suboptimal policies that do not directly minimize the cumulative cost.

In this work, we propose RCPPO, a reinforcement-learning-based method for solving the minimum-cost reach-avoid problem by using connections to Hamilton-Jacobi reachability. Empirical results demonstrate that RCPPO learns policies with comparable goal-reaching rates to while achieving up to 57% lower cumulative costs compared to existing methods on a suite of minimum-cost reach-avoid benchmarks on the Mujoco simulator.