10. Prosinec 2023
We consider the adversarial linear contextual bandit setting, whichallows for the loss functions associated with each of K arms to changeover time without restriction. Assuming the d-dimensional contexts aredrawn from a fixed known distribution, the worst-case expected regretover the course of T rounds is known to scale as Õ(√(KdT)). Under the additional assumption that the density of the contextsis log-concave, we obtain a second-order bound of order (K√(d V_T)) in terms of the cumulative second moment of thelearner's losses V_T, and a closely related first-order bound of orderÕ(K√(d L_T^*)) in terms of the cumulative loss of the bestpolicy L_T^*. Since V_T or L_T^* may be significantly smaller thanT, these improve over the worst-case regret whenever the environmentis relatively benign. Our results are obtained using a truncated versionof the continuous exponential weights algorithm over the probabilitysimplex, which we analyse by exploiting a novel connection to the linearbandit setting without contexts.
Pro uložení prezentace do věčného trezoru hlasovalo 0 diváků, což je 0.0 %
Prezentace na podobné téma, kategorii nebo přednášejícího