Dez 10, 2023
We consider the adversarial linear contextual bandit setting, whichallows for the loss functions associated with each of K arms to changeover time without restriction. Assuming the d-dimensional contexts aredrawn from a fixed known distribution, the worst-case expected regretover the course of T rounds is known to scale as Õ(√(KdT)). Under the additional assumption that the density of the contextsis log-concave, we obtain a second-order bound of order (K√(d V_T)) in terms of the cumulative second moment of thelearner's losses V_T, and a closely related first-order bound of orderÕ(K√(d L_T^*)) in terms of the cumulative loss of the bestpolicy L_T^*. Since V_T or L_T^* may be significantly smaller thanT, these improve over the worst-case regret whenever the environmentis relatively benign. Our results are obtained using a truncated versionof the continuous exponential weights algorithm over the probabilitysimplex, which we analyse by exploiting a novel connection to the linearbandit setting without contexts.
Ewigspeicher-Fortschrittswert: 0 = 0.0%
Präsentationen, deren Thema, Kategorie oder Sprecher:in ähnlich sind