Pure Past Action Masking for Safe Multi Agent Reinforcement Learning
Summary
Reinforcement Learning is a machine learning paradigm for solving sequential decision-making problems in non-deterministic environments. However, vanilla RLalgorithms, which rely on trial-and-error learning, exhibit unsafe behaviour, as each action may be repeatedly executed during the exploration phase before the agent learns to avoid it. In Multi-Agent Reinforcement Learning, ensuring safety is further complicated, requiring coordination among multiple agents. In this thesis, we study how temporal logics, such as LTL, can be used to design safer algorithms and how it is possible to extend safe algorithms to the multi-agent setting. This thesis introduces Multi-Agent Pure Past Action Masking, a novel approach for provably safe MARL that leverages Pure Past Linear Temporal Logic to specify and enforce non-Markovian safety constraints. Our contribution is twofold: first, we synthesise a centralised mask, using PPLTL formulas to define safe joint actions; second, we propose a decomposition algorithm that enables decentralised, communication-free execution by individual agents. Finally, we formally prove that the individual masks generated by the decomposition algorithm maintain the safety guarantees of the centralised mask, and we further validate our results with an experimental evaluation.