Conservative Exploration in Bandits and Reinforcement Learning

by · Jul 17, 2020 · 35 views ·

ICML 2020

A major challenge in deploying machine learning algorithms for decision-making problems is the lack of guarantee for the performance of their resulting policies, especially those generated during the initial exploratory phase of these algorithms. Online decision-making algorithms, such as those in bandits and reinforcement learning (RL), learn a policy while interacting with the real system. Although these algorithms will eventually learn a good or an optimal policy, there is no guarantee for the performance of their intermediate policies, especially at the very beginning, when they perform a large amount of exploration. Thus, in order to increase their applicability, it is important to control their exploration and to make it more conservative. To address this issue, we define a notion of safety that we refer to as safety w.r.t. a baseline. In this definition, a policy considered to be safe if it performs at least as well as a baseline, which is usually the current strategy of the company. We formulate this notion of safety in bandits and RL and show how it can be integrated into these algorithms as a constraint that must be satisfied uniformly in time. We derive contextual linear bandits and RL algorithms that minimize their regret, while ensure that at any given time, their expected sum of rewards remains above a fixed percentage of the expected sum of rewards of the baseline policy. This fixed percentage depends on the amount of risk that the manager of the system is willing to take. We prove regret bounds for our algorithms and show that the cost of satisfying the constraint (conservative exploration) can be controlled. Finally, we report experimental results to validate our theoretical analysis. We conclude the talk by discussing a few other constrained bandit formulations.