KR2023Proceedings of the 20th International Conference on Principles of Knowledge Representation and ReasoningProceedings of the 20th International Conference on Principles of Knowledge Representation and Reasoning

Rhodes, Greece. September 2-8, 2023.

Edited by

ISSN: 2334-1033
ISBN: 978-1-956792-02-7

Sponsored by
Published by

Copyright © 2023 International Joint Conferences on Artificial Intelligence Organization

Learning General Policies with Policy Gradient Methods

  1. Simon Ståhlberg(Linköping University)
  2. Blai Bonet(Universitat Pompeu Fabra)
  3. Hector Geffner(RWTH Aachen University, Linköping University)


  1. Reasoning about actions and change, action languages
  2. Learning action theories
  3. Symbolic reinforcement learning


While reinforcement learning methods have delivered remarkable results in a number of settings, generalization, i.e., the ability to produce policies that generalize in a reliable and systematic way, has remained a challenge. The problem of generalization has been addressed formally in classical planning where provable correct policies that generalize over all instances of a given domain have been learned using combinatorial methods. The aim of this work is to bring these two research threads together to illuminate the conditions under which (deep) reinforcement learning approaches, and in particular, policy optimization methods, can be used to learn policies that generalize like combinatorial methods do. We draw on lessons learned from previous combinatorial and deep learning approaches, and extend them in a convenient way. From the former, we model policies as state transition classifiers, as (ground) actions are not general and change from instance to instance. From the latter, we use graph neural networks (GNNs) adapted to deal with relational structures for representing value functions over planning states, and in our case, policies. With these ingredients in place, we find that actor-critic methods can be used to learn policies that generalize almost as well as those obtained using combinatorial approaches while avoiding the scalability bottleneck and the use of feature pools. Moreover, the limitations of the DRL methods on the benchmarks considered have little to do with deep learning or reinforcement learning algorithms, and result from the well-understood expressive limitations of GNNs, and the tradeoff between optimality and generalization (general policies cannot be optimal in some domains). Both of these limitations are addressed without changing the basic DRL methods by adding derived predicates and an alternative cost structure to optimize.