Learning General Policies with Policy Gradient Methods

Simon Ståhlberg; Blai Bonet; Hector Geffner

doi:10.24963/kr.2023/63

KR2023

Proceedings of the 20th International Conference on Principles of Knowledge Representation and Reasoning

Rhodes, Greece. September 2-8, 2023.

Edited by

ISSN: 2334-1033
ISBN: 978-1-956792-02-7

Learning General Policies with Policy Gradient Methods

Simon Ståhlberg(Linköping University)
Blai Bonet(Universitat Pompeu Fabra)
Hector Geffner(RWTH Aachen University, Linköping University)

PDF

BibTeX

https://doi.org/10.24963/kr.2023/63

Keywords

Reasoning about actions and change, action languages
Learning action theories
Symbolic reinforcement learning

Abstract

While reinforcement learning methods have delivered remarkable results in a number of settings, generalization, i.e., the ability to produce policies that generalize in a reliable and systematic way, has remained a challenge. The problem of generalization has been addressed formally in classical planning where provable correct policies that generalize over all instances of a given domain have been learned using combinatorial methods. The aim of this work is to bring these two research threads together to illuminate the conditions under which (deep) reinforcement learning approaches, and in particular, policy optimization methods, can be used to learn policies that generalize like combinatorial methods do. We draw on lessons learned from previous combinatorial and deep learning approaches, and extend them in a convenient way. From the former, we model policies as state transition classifiers, as (ground) actions are not general and change from instance to instance. From the latter, we use graph neural networks (GNNs) adapted to deal with relational structures for representing value functions over planning states, and in our case, policies. With these ingredients in place, we find that actor-critic methods can be used to learn policies that generalize almost as well as those obtained using combinatorial approaches while avoiding the scalability bottleneck and the use of feature pools. Moreover, the limitations of the DRL methods on the benchmarks considered have little to do with deep learning or reinforcement learning algorithms, and result from the well-understood expressive limitations of GNNs, and the tradeoff between optimality and generalization (general policies cannot be optimal in some domains). Both of these limitations are addressed without changing the basic DRL methods by adding derived predicates and an alternative cost structure to optimize.