# Reinforcement Learning Theory

by

• ## zhengyao jiang

· Jun 11, 2019 · 458 views ·

## ICML 2019

Safe Policy Improvement with Baseline Bootstrapping This paper considers Safe Policy Improvement (SPI) in Batch Reinforcement Learning (Batch RL): from a fixed dataset and without direct access to the true environment, train a policy that is guaranteed to perform at least as well as the baseline policy used to collect the data. Our approach, called SPI with Baseline Bootstrapping (SPIBB), is inspired by the knows-what-it-knows paradigm: it bootstraps the trained policy with the baseline when the uncertainty is high. Our first algorithm, $\Pib−SPI BB,comes with SPI the oretical guarantees.We also implement a variant, \Pi{\leq b}$-SPIBB, that is even more efficient in practice. We apply our algorithms to a motivational stochastic gridworld domain and further demonstrate on randomly generated MDPs the superiority of SPIBB with respect to existing algorithms, not only in safety but also in mean performance. Finally, we implement a model-free version of SPIBB and show its benefits on a navigation task with deep RL implementation called SPIBB-DQN, which is, to the best of our knowledge, the first RL algorithm relying on a neural network representation able to train efficiently and reliably from batch data, without any interaction with the environment. Distributional Reinforcement Learning for Efficient Exploration In distributional reinforcement learning (RL), the estimated distribution of value functions model both the parametric and intrinsic uncertainties. We propose a novel and efficient exploration method for deep RL that has two components. The first is a decaying schedule to suppress the intrinsic uncertainty. The second is an exploration bonus calculated from the upper quantiles of the learned distribution. In Atari 2600 games, our method achieves 483 % average gain across 49 games in cumulative rewards over QR-DQN. We also compared our algorithm with QR-DQN in a challenging 3D driving simulator (CARLA). Results show that our algorithm achieves nearoptimal safety rewards twice faster than QRDQN. Optimistic Policy Optimization via Multiple Importance Sampling Policy Search (PS) is an effective approach to Reinforcement Learning for solving control tasks with continuous state-action spaces. In this paper, we address the exploration-exploitation trade-off in PS by proposing an approach based on Optimism in Face of Uncertainty. We cast the PS problem as a suitable Multi Armed Bandit problem, defined over the policy parameter space, and we propose a class of algorithms that effectively exploit the problem structure, by leveraging Multiple Importance Sampling to perform an off-policy estimation of expected return. We show that the regret of the proposed approach is bounded by ˜O(√T) for both discrete and continuous parameter spaces. Finally, we evaluate our algorithms on tasks of varying difficulty, comparing them with existing MAB and RL algorithms. Neural Logic Reinforcement Learning Deep reinforcement learning (DRL) has achieved significant breakthroughs in various tasks. However, most DRL algorithms suffer a problem of generalising the learned policy which makes the learning performance largely affected even by minor modifications of the training environment. Except that, the use of deep neural networks makes the learned policies hard to be interpretable. To tackle these two challenges, we propose a novel algorithm named Neural Logic Reinforcement Learning (NLRL) to represent the policies in the reinforcement learning by first order logic. NLRL is based on policy gradient methods and differentiable inductive logic programming that have demonstrated significant advantages in terms of interpretability and generalisability in supervised tasks. Extensive experiments conducted on cliff-walking and blocks manipulation tasks demonstrate that NLRL can induce interpretable policies achieving near-optimal performance, while demonstrating good generalisability to environments of different initial states and problem sizes. Learning to Collaborate in Markov Decision Processes We consider a two-agent MDP framework where agents repeatedly solve a task in a collaborative setting. We study the problem of designing a learning algorithm for the first agent (A1) that facilitates a successful collaboration even in cases when the second agent (A2) is adapting its policy in an unknown way. The key challenge in our setting is that the presence of the second agent leads to non-stationarity and non-obliviousness of rewards and transitions for the first agent. We design novel online learning algorithms for agent A1 whose regret decays as O(T1−37⋅α) with T learning episodes provided that the magnitude of agent A2's policy changes between any two consecutive episodes are upper bounded by O(T−α). Here, the parameter α is assumed to be strictly greater than 0, and we show that this assumption is necessary provided that the {\em learning parity with noise} problem is computationally hard. We show that sub-linear regret of agent A1 further implies near-optimality of the agents' joint return for MDPs that manifest the properties of a {\em smooth} game. Predictor-Corrector Policy Optimization We present a predictor-corrector framework, called PicCoLO, that can transform a first-order model-free reinforcement or imitation learning algorithm into a new hybrid method that leverages predictive models to accelerate policy learning. The new PicCoLOed'' algorithm optimizes a policy by recursively repeating two steps: In the Prediction Step, the learner uses a model to predict the unseen future gradient and then applies the predicted estimate to update the policy; in the Correction Step, the learner runs the updated policy in the environment, receives the true gradient, and then corrects the policy using the gradient error. Unlike previous algorithms, PicCoLO corrects for the mistakes of using imperfect predicted gradients and hence does not suffer from model bias. The development of PicCoLO is made possible by a novel reduction from predictable online learning to adversarial online learning, which provides a systematic way to modify existing first-order algorithms to achieve the optimal regret with respect to predictable information. We show, in both theory and simulation, that the convergence rate of several first-order model-free algorithms can be improved by PicCoLO. Learning a Prior over Intent via Meta-Inverse Reinforcement Learning A significant challenge for the practical application of reinforcement learning to real world problems is the need to specify an oracle reward function that correctly defines a task. Inverse reinforcement learning (IRL) seeks to avoid this challenge by instead inferring a reward function from expert demonstrations. While appealing, it can be impractically expensive to collect datasets of demonstrations that cover the variation common in the real world (e.g. opening any type of door). Thus in practice, IRL must commonly be performed with only a limited set of demonstrations where it can be exceedingly difficult to unambiguously recover a reward function. In this work, we exploit the insight that demonstrations from other tasks can be used to constrain the set of possible reward functions by learning a ''prior'' that is specifically optimized for the ability to infer expressive reward functions from limited numbers of demonstrations. We demonstrate that our method can efficiently recover rewards from images for novel tasks and provide intuition as to how our approach is analogous to learning a prior. DeepMDP: Learning Continuous Latent Space Models for Representation Learning Many reinforcement learning tasks provide the agent with high-dimensional observations that can be simplified into low-dimensional continuous states. To formalize this process, we introduce the concept of a \textit{DeepMDP}, a Markov Decision Process (MDP) parameterized by neural networks that is able to recover these representations. We mathematically develop several desirable notions of similarity between the original MDP and the DeepMDP based on two main objectives: (1) modeling the dynamics of an MDP, and (2) learning a useful abstract representation of the states of an MDP. While the motivation for each of these notions is distinct, we find that they are intimately related. Specifically, we derive tractable training objectives of the DeepMDP components which simultaneously and provably encourage \textit{all} notions of similarity. We validate our theoretical findings by showing that we are able to learn DeepMDPs and recover the latent structure underlying high-dimensional observations on a synthetic environment. Finally, we show that learning a DeepMDP as an auxiliary task in the Atari domain leads to large performance improvements. Importance Sampling Policy Evaluation with an Estimated Behavior Policy We consider the problem of off-policy evaluation in Markov decision processes. Off-policy evaluation is the task of evaluating the expected return of one policy with data generated by a different, behavior policy. Importance sampling is a technique for off-policy evaluation that re-weights off-policy returns to account for differences in the likelihood of the returns between the two policies. In this paper, we study importance sampling with an estimated behavior policy where the behavior policy estimate comes from the same set of data used to compute the importance sampling estimate. We find that this estimator often lowers the mean squared error of off-policy evaluation compared to importance sampling with the true behavior policy or using a behavior policy that is estimated from a separate data set. Intuitively, estimating the behavior policy in this way corrects for error due to sampling in the action-space. Our empirical results also extend to other popular variants of importance sampling and show that estimating a non-Markovian behavior policy can further lower large-sample mean squared error even when the true behavior policy is Markovian. Learning from a Learner In this paper, we propose a novel setting for Inverse Reinforcement Learning (IRL), namely "Learning from a Learner" (LfL). As opposed to standard IRL, it does not consist in learning a reward by observing an optimal agent but from observations of another learning (and thus sub-optimal) agent. To do so, we leverage the fact that the observed agent's policy is assumed to improve over time. The ultimate goal of this approach is to recover the actual environment's reward and to allow the observer to outperform the learner. To recover that reward in practice, we propose methods based on the entropy-regularized policy iteration framework. We discuss different approaches to learn solely from trajectories in the state-action space. We demonstrate the genericity of our method by observing agents implementing various reinforcement learning algorithms. Finally, we show that, on both discrete and continuous state/action tasks, the observer's performance (that optimizes the recovered reward) can surpass those of the observed agent.