Adaptive Reward-Free Exploration

9. Březen 2021

Řečníci

O prezentaci

Reward-free exploration is a reinforcement learning setting recently studied by (Jin et al. 2020), who address it by running several algorithms with regret guarantees in parallel. In our work, we instead propose a more natural adaptive approach for reward-free exploration which directly reduces upper bounds on the maximum MDP estimation error. We show that, interestingly, our reward-free UCRL algorithm can be seen as a variant of an algorithm by Fiechter from 1994, originally proposed for a different objective that we call best-policy identification. We prove that RF-UCRL needs of order \epsilon + \delta + S episodes to output, with probability delta, an \epsilon-approximation of the optimal policy for any reward function. This bound improves over existing sample complexity bounds in both the small \epsilon and the small \delta regimes. We further investigate the relative complexities of reward-free exploration and best policy identification.

Organizátor

Kategorie

O organizátorovi (ALT 2021)

The 32nd International Conference on Algorithmic Learning Theory

Uložení prezentace

Měla by být tato prezentace uložena po dobu 1000 let?

Jak ukládáme prezentace

Pro uložení prezentace do věčného trezoru hlasovalo 0 diváků, což je 0.0 %

Sdílení

Doporučená videa

Prezentace na podobné téma, kategorii nebo přednášejícího

Zajímají Vás podobná videa? Sledujte ALT 2021