Horizon-Free Reinforcement Learning in Polynomial Time: the Power of Stationary Policies

Jul 2, 2022

Speakers

About

This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys regret independent on the planning horizon. Specifically, we consider tabular MDP with S states, A actions, a planning horizon H, total reward bounded by 1, and the agent plays for K episodes. We design an algorithm that achieves an O(poly(S,A,log K)√(K)) regret in contrast to existing bounds which either has an additional polylog(H) or has an exponential dependency on S. Our key technical contributions are (1) a new explicit exploration algorithm and (2) a sequence of new results establishing the approximation power, stability, and concentration property of stationary policies, which may be of independent interest.

Organizer

About COLT

The conference is held annually since 1988 and has become the leading conference on Learning theory by maintaining a highly selective process for submissions. It is committed in high-quality articles in all theoretical aspects of machine learning and related topics.

Like the format? Trust SlidesLive to capture your next event!

Professional recording and live streaming, delivered globally.

Sharing

Recommended Videos

Presentations on similar topic, category or speaker

Interested in talks like this? Follow COLT