Abstract:
Safe reinforcement learning (Safe RL) seeks to acquire policies that maximize the cumulative reward under stringent safety constraints during training and deployment. Most current solutions, e.g., Lyapunov- and barrier-based methods, are not sufficiently adaptable in dealing with nonlinear dynamics or are based on analytically hard-coded safety certificates. To overcome these limitations, we introduce Neural-barrier Lyapunov-constrained Proximal Policy Optimization (NBLC-PPO), a general architecture that combines data-driven neural control barrier functions, Lyapunov stability filters, and trust-region policy updates with PPO. The approach allows per-step safe action enforcement with stability and constraint satisfaction guarantees in nonlinear environments. NBLC-PPO learns safety certificates and policy parameters simultaneously, enforcing dynamic feasibility by using differentiable constraints in the optimization loop. A set of empirical tests proves that NBLC-PPO attains state-of-the-art safety-performance trade-offs in constrained control tasks. It attains a 24-step cumulative reward, outperforming Lyapunov-PPO (∼19) and PPO (∼17.5), but with an average violation of only 0.04–0.06. It also attains more than 98.5% of the safety rate, training stability of almost 0.95, and converges 33% more quickly than baseline PPO. It also provides a reward-to-constraint ratio of over 500, which is a 66% improvement over Lyapunov-PPO and 2.5× that of baseline PPO. All these findings affirm the effectiveness of NBLC-PPO in facilitating secure, stable, and high-performing RL in real-world constrained environments.