10. prosince 2023
Řečník · 1 sledující
Řečník · 0 sledujících
We study distributionally robust offline reinforcement learning (RL), which seeks to find an optimal robust policy purely from an offline dataset that can perform well in perturbed environments. We propose a generic algorithm framework Doubly Pessimistic Model-based Policy Optimization (P^2MPO) for robust offline RL, which features a novel combination of a flexible model estimation subroutine and a doubly pessimistic policy optimization step. Here the double pessimism principle is crucial to overcome the distribution shift incurred by i) the mismatch between behavior policy and the family of target policies; and ii) the perturbation of the nominal model. Under certain accuracy assumptions on the model estimation subroutine, we show that P^2MPO is provably sample-efficient with robust partial coverage data, which means that the offline dataset has good coverage of the distributions induced by the optimal robust policy and perturbed models around the nominal model. By tailoring specific model estimation subroutines for concrete examples including tabular Robust Markov Decision Process (RMDP), factored RMDP, and RMDP with kernel and neural function approximations, we show that P^2MPO enjoys a 𝒪̃(n^-1/2) convergence rate, where n is the number of trajectories in the offline dataset. Notably, these models, except for the tabular case, are first identified and proven tractable by this paper. To the best of our knowledge, we first propose a general learning principle — double pessimism — for robust offline RL and show that it is provably efficient in the context of general function approximations.We study distributionally robust offline reinforcement learning (RL), which seeks to find an optimal robust policy purely from an offline dataset that can perform well in perturbed environments. We propose a generic algorithm framework Doubly Pessimistic Model-based Policy Optimization (P^2MPO) for robust offline RL, which features a novel combination of a flexible model estimation subroutine and a doubly pessimistic policy optimization step. Here the double pessimism principle is crucial to ove…
Účet · 648 sledujících
Profesionální natáčení a streamování po celém světě.
Prezentace na podobné téma, kategorii nebo přednášejícího
Pro uložení prezentace do věčného trezoru hlasovalo 0 diváků, což je 0.0 %
Weikang Bian, …
Pro uložení prezentace do věčného trezoru hlasovalo 0 diváků, což je 0.0 %
Pro uložení prezentace do věčného trezoru hlasovalo 0 diváků, což je 0.0 %
Zhijin Ge, …
Pro uložení prezentace do věčného trezoru hlasovalo 0 diváků, což je 0.0 %
Xiaomeng Hu, …
Pro uložení prezentace do věčného trezoru hlasovalo 0 diváků, což je 0.0 %
Pro uložení prezentace do věčného trezoru hlasovalo 0 diváků, což je 0.0 %