Jul 2, 2022
This paper gives the first polynomial-time algorithm for tabular Markov Decision Processes (MDP) that enjoys regret independent on the planning horizon. Specifically, we consider tabular MDP with S states, A actions, a planning horizon H, total reward bounded by 1, and the agent plays for K episodes. We design an algorithm that achieves an O(poly(S,A,log K)√(K)) regret in contrast to existing bounds which either has an additional polylog(H) or has an exponential dependency on S. Our key technical contributions are (1) a new explicit exploration algorithm and (2) a sequence of new results establishing the approximation power, stability, and concentration property of stationary policies, which may be of independent interest.
The conference is held annually since 1988 and has become the leading conference on Learning theory by maintaining a highly selective process for submissions. It is committed in high-quality articles in all theoretical aspects of machine learning and related topics.
Professional recording and live streaming, delivered globally.
Presentations on similar topic, category or speaker