Achieving Zero Constraint Violation for Concave Utility Constrained Reinforcement Learning via Primal-Dual Approach
Main Article Content
Abstract
Reinforcement learning (RL) is widely used in applications where one needs to perform sequential decision-making while interacting with the environment. The standard RL problem with safety constraints is generally mathematically modeled by constrained Markov Decision Processes (CMDP), which is linear in objective and rules in occupancy measure space, where the problem becomes challenging in the case where the model is unknown apriori. The problem further becomes challenging when the decision requirement includes optimizing a concave utility while satisfying some nonlinear safety constraints. To solve such a nonlinear problem, we propose a conservative stochastic primal-dual algorithm (CSPDA) via a randomized primal-dual approach. By leveraging a generative model, we prove that CSPDA not only exhibits Õ(1/ε2)sample complexity, but also achieves zero constraint violations for the concave utility CMDP. Compared with the previous works, the best available sample complexity for CMDP with zero constraint violation is Õ(1/ε5). Hence, the proposed algorithm provides a significant improvement as compared to the state-of-the-art.