Temporal Logic Monitoring Rewards via Transducers

Giuseppe De Giacomo; Marco Favorito; Luca Iocchi; Fabio Patrizi; Alessandro Ronca

doi:10.24963/kr.2020/89

KR2020

Proceedings of the 17th International Conference on Principles of Knowledge Representation and Reasoning

Rhodes, Greece. September 12-18, 2020.

Edited by

ISSN: 2334-1033
ISBN: 978-0-9992411-7-2

Temporal Logic Monitoring Rewards via Transducers

Giuseppe De Giacomo(University of Rome "La Sapienza")
Marco Favorito(University of Rome "La Sapienza")
Luca Iocchi(University of Rome "La Sapienza")
Fabio Patrizi(University of Rome "La Sapienza")
Alessandro Ronca(University of Rome "La Sapienza")

PDF

BibTeX

https://doi.org/10.24963/kr.2020/89

Keywords

Symbolic reinforcement learning-General
Reasoning about actions and change, action languages-General

Abstract

In Markov Decision Processes (MDPs), rewards are assigned according to a function of the last state and action. This is often limiting, when the considered domain is not naturally Markovian, but becomes so after careful engineering of extended state space. The extended states record information from the past that is sufficient to assign rewards by looking just at the last state and action. Non-Markovian Reward Decision Processes (NRMDPs) extend MDPs by allowing for non-Markovian rewards, which depend on the history of states and actions. Non-Markovian rewards can be specified in temporal logics on finite traces such as LTLf/LDLf, with the great advantage of a higher abstraction and succinctness; they can then be automatically compiled into an MDP with an extended state space. We contribute to the techniques to handle temporal rewards and to the solutions to engineer them. We first present an approach to compiling temporal rewards which merges the formula automata into a single transducer, sometimes saving up to an exponential number of states. We then define monitoring rewards, which add a further level of abstraction to temporal rewards by adopting the four-valued conditions of runtime monitoring; we argue that our compilation technique allows for an efficient handling of monitoring rewards. Finally, we discuss application to reinforcement learning.