The Emphatic Approach to Average-Reward Policy Evaluation

Dec 2, 2022

Speakers

About

Off-policy policy evaluation has been a longstanding problem in reinforcement learning. This paper looks at this problem under the average-reward formulation with function approximation. Differential temporal-difference (TD) learning has been proposed recently and has shown great potential compared to previous average-reward learning algorithms. In the tabular setting, off-policy differential TD is guaranteed to converge. However, the convergence guarantee cannot be carried through the function approximation setting. To address the instability of off-policy differential TD, we investigate the emphatic approach proposed for the discounted formulation. Specifically, we introduce average emphatic trace for average-reward off-policy learning. We further show that without any variance reduction techniques, the new trace suffers from slow learning due to high variance of importance sampling ratios. Finally, we show that differential emphatic TD(β), extended from the discounted setting, can save us from the high variance while introducing bias. Experimental results on a counterexample show that differential emphatic TD(β) performs better than an existing competitive off-policy algorithm.

Organizer

Like the format? Trust SlidesLive to capture your next event!

Professional recording and live streaming, delivered globally.

Sharing

Recommended Videos

Presentations on similar topic, category or speaker

Interested in talks like this? Follow NeurIPS 2022