r-PLBP: Temporal Logic for Reasoning about Safety and Rewards of Bounded Policies under Uncertainty
Summary
Temporal logics can be used to reason about how environments change over time. When environments have an element of uncertainty, agents with the power to make decisions, or an opportunity for gaining rewards, reasoning about such factors requires a temporal logic with the vocabulary to address them.
We propose rPLBP, a reward-based extension of Probabilistic Logic of Bounded Policies that can reason about the rewards gained by an agent in a probabilistic environment. This logic includes expressions to reason about rewards accumulated along a finite path, as well as the expected reward given a finite strategy (a sort of playbook describing which actions the agent will take for a finite number of future steps). We show that the model checking problem for rPLBP is PSPACE-complete. Moreover, we prove that the satisfiability problem for rPLBP is in 2EXPSPACE.
We explore applications of rPLBP in the field of safe reinforcement learning. In shielded learning, safety constraints are used to block unsafe actions during the learning phase of an agent. Unlike the temporal logics usually used for such constraints -- e.g. Linear Temporal Logic and Probabilistic Computation Tree Logic -- the logic rPLBP allows us to express to what extent safety guidelines may be relaxed based on expected reward.