Jul 24, 2023
Speaker · 0 followers
Speaker · 0 followers
Speaker · 0 followers
We consider a contextual combinatorial bandit problem where in each round a learning agent selects a subset of arms and receives feedback on the selected arms according to their score. The score of an arm is an unknown function of the arm's feature. Approximating this unknown score function with deep neural networks, we propose algorithms: Combinatorial Neural UCB (CN-UCB) and Combinatorial Neural Thompson Sampling (CN-TS). We prove that CN-UCB achieves 𝒪̃(d̃√(T)) or 𝒪̃(√(d̃ T K)) regret, where d̃ is the effective dimension of a neural tangent kernel matrix, K is the size of a subset of arms, and T is the time horizon. For CN-TS, we adapt an optimistic sampling technique to ensure the optimism of the sampled combinatorial action, establish a worst-case (frequentist) regret of 𝒪̃(d̃√(TK)).To the best of our knowledge, these are the first combinatorial neural bandit algorithms with regret performance guarantees. In particular, CN-TS is the first Thompson sampling algorithm with the worst-case regret guarantees for the general contextual combinatorial bandit problem. The numerical experiments demonstrate the superior performances of our proposed algorithms.We consider a contextual combinatorial bandit problem where in each round a learning agent selects a subset of arms and receives feedback on the selected arms according to their score. The score of an arm is an unknown function of the arm's feature. Approximating this unknown score function with deep neural networks, we propose algorithms: Combinatorial Neural UCB (CN-UCB) and Combinatorial Neural Thompson Sampling (CN-TS). We prove that CN-UCB achieves 𝒪̃(d̃√(T)) or 𝒪̃(√(d̃ T K)) regret, wher…
Professional recording and live streaming, delivered globally.
Presentations on similar topic, category or speaker
Zifeng Wang, …
Murad Tukan, …
Kevin Gmelin, …
Yubo Zhuang, …
Zico Kolter, …