Apr 14, 2021
We investigate the misspecified linear contextual bandit (MLCB) problem, which is a generalization of the linear contextual bandit (LCB) problem. The MLCB problem is a decision-making problem in which a learner observes $d$-dimensional feature vectors, called arms, chooses an arm from them, and then obtains a reward of the chosen arm in each round. The learner aims to maximize the sum of the rewards over $T$ rounds. In contrast to the LCB problem, the rewards in the MLCB problem may not be represented by a linear function in feature vectors, but is approximated by a linear function with additive approximation parameter $\varepsilon \geq 0$. In this paper, we propose an algorithm that achieves regret bound $\tilde{O}(d\sqrt{T} + \eps\sqrt{d}T)$, where $\tilde{O}(\cdot)$ ignores polylogarithmic factors in $d$ and $T$. This is the first algorithm for the MLCB problem that works without knowing the approximation parameter $\varepsilon$ in advance. The obtained regret bound matches the regret bound by Lattimore et al. (2020) when $\varepsilon$ is known in advance. We remark that the proposed algorithm is optimal when $\varepsilon = 0$, i.e., for the case of the LCB problem.
The 24th International Conference on Artificial Intelligence and Statistics was held virtually from Tuesday, 13 April 2021 to Thursday, 15 April 2021.
Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%
Presentations on similar topic, category or speaker