Dec 10, 2023
Many problems in Reinforcement Learning (RL) have an optimal policy that is stochastic; these include problems in randomized allocation of resource such as placement of security resources, emergency response units, etc. A challenge in this setting is that the underlying action space is categorical (discrete and unordered) and large. Existing RL methods do not perform well in such large categorical action spaces. Also, these problems require validity of the realized action (allocation); this validity constraint is often difficult to express compactly in a closed mathematical form. In this work, we address these issues by (1) using a (state) conditional normalizing flow to compactly represent the stochastic policy; the compactness arises due to the network only producing one sampled action and log probability of the action, which is then used by an actor-critic method. (2) using an invalid action rejection method (by using a valid action oracle) to modify the base policy. The action rejection is enabled by a modified policy gradient that we derive. We show the scalability of our approach compared to prior methods and the ability to enforce arbitrary state-conditional constraints on the support of the distribution of actions in any state in our experiments.
Professional recording and live streaming, delivered globally.
Presentations on similar topic, category or speaker