# Reinforcement Learning and Bandits

by

• ## Zheng Wen

· Jun 12, 2019 · 327 views ·

## ICML 2019

Optimal Algorithms for Lipschitz Bandits with Heavy-tailed Rewards We study Lipschitz bandits, where a learner repeatedly plays one arm from an infinite arm set and then receives a stochastic reward whose expectation is a Lipschitz function of the chosen arm. Most of existing work assume the reward distributions are bounded or at least sub-Gaussian, and thus do not apply to heavy-tailed rewards arising in many real-world scenarios such as web advertising and financial markets. To address this limitation, in this paper we relax the assumption on rewards to allow arbitrary distributions that have finite (1+ϵ)-th moments for some ϵ∈(0,1], and propose algorithms that enjoy a sublinear regret of ~O(T(dϵ+1)/(dϵ+ϵ+1)) where T is the time horizon and d is the zooming dimension. The key idea is to exploit the Lipschitz property of the expected reward function by adaptively discretizing the arm set, and employ upper confidence bound policies with robust mean estimators designed for heavy-tailed distributions. Furthermore, we present a lower bound for Lipschitz bandits with heavy-tailed rewards, and show that our algorithms are optimal in terms of T. Finally, we conduct numerical experiments to demonstrate the effectiveness of our algorithms. Target Tracking for Contextual Bandits: Application to Demand Side Management We propose a contextual-bandit approach for demand side management by offering price incentives. More precisely, a target mean consumption is set at each round and the mean consumption is modeled as a complex function of the distribution of prices sent and of some contextual variables such as the temperature, weather, and so on. The performance of our strategies is measured in quadratic losses through a regret criterion. We offer √T upper bounds on this regret (up to poly-logarithmic terms), for strategies inspired by standard strategies for contextual bandits (like LinUCB, see Li et al., 2010). Simulations on a real data set gathered by UK Power Networks, in which price incentives were offered, show that our strategies are effective and may indeed manage demand response by suitably picking the price levels. Correlated bandits or: How to minimize mean-squared error online While the objective in traditional multi-armed bandit problems is to find the arm with the highest mean, in many settings, finding an arm that best captures information about other arms is of interest. This objective, however, requires learning the underlying correlation structure and not just the means. Sensors placement for industrial surveillance and cellular network monitoring are a few applications, where the underlying correlation structure plays an important role. Motivated by such applications, we formulate the correlated bandit problem, where the objective is to find the arm with the lowest mean-squared error (MSE) in estimating all the arms. To this end, we derive first an MSE estimator based on sample variances/covariances and show that our estimator exponentially concentrates around the true MSE. Under a best-arm identification framework, we propose a successive rejects type algorithm and provide bounds on the probability of error in identifying the best arm. Using minimax theory, we also derive fundamental performance limits for the correlated bandit problem. Stay With Me: Lifetime Maximization Through Heteroscedastic Linear Bandits With Reneging Sequential decision making for lifetime maximization is a critical problem in many real-world applications, such as medical treatment and portfolio selection. In these applications, a reneging'' phenomenon, where participants may disengage from future interactions after observing an unsatisfiable outcome, is rather prevalent. To address the above issue, this paper proposes a model of heteroscedastic linear bandits with reneging. The model allows each participant to have a distinctsatisfaction level," with any interaction outcome falling short of that level resulting in that participant reneging. Moreover, it allows the variance of the outcome to be context-dependent. Based on this model, we develop a UCB-type policy, called HR-UCB, and prove that with high probability it achieves O(√T(log(T))3/2) regret. Finally, we validate the performance of HR-UCB via simulations. Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits We propose a bandit algorithm that explores by randomizing its history of observations. The algorithm estimates the value of the arm from a non-parametric bootstrap sample of its history, which is augmented with pseudo observations. Our novel design of pseudo observations guarantees that the bootstrap estimates are optimistic with a high probability. We call our algorithm Giro, which stands for garbage in, reward out. We analyze Giro in a Bernoulli bandit and prove a O(KΔ−1logn) bound on its n-round regret, where K is the number of arms and Δ is the difference in the expected rewards of the optimal and the best suboptimal arms. The key advantage of our exploration design is that it can be easily applied to structured problems. To show this, we propose contextual Giro with an arbitrary non-linear reward generalization model. We evaluate Giro and its contextual variant on multiple synthetic and real-world problems, and observe that Giro performs well. Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously We develop the first general semi-bandit algorithm that simultaneously achieves O(logT) regret for stochastic environments and O(√T) regret for adversarial environments without knowledge of the regime or the number of rounds T. The leading problem-dependent constants of our bounds are not only optimal in some worst-case sense studied previously, but also optimal for two concrete instances of semi-bandit problems. Our algorithm and analysis extend the recent work of (Zimmert & Seldin, 2019) for the special case of multi-armed bandit, but importantly requires a novel hybrid regularizer designed specifically for semi-bandit. Experimental results on synthetic data show that our algorithm indeed performs well uniformly over different environments. We finally provide a preliminary extension of our results to the full bandit feedback. Bilinear Bandits with Low-rank Structure We introduce the bilinear bandit problem with low-rank structure in which an action takes the form of a pair of arms from two different entity types, and the reward is a bilinear function of the known feature vectors of the arms. The problem is motivated by numerous applications in which the learner must recommend two different entity types as a single action, such as a male / female pair in an online dating service. The unknown in the problem is a $d1byd2matrix\mathbf{\Theta}^thatdefinesthereward,andhaslowrakr \ll \min{d_1,d_2}. Determinationof \mathbf{\Theta}^$ with this low-rank structure poses a significant challenge in finding the right exploration-exploitation tradeoff. In this work, we propose a new two-stage algorithm called Explore-Subspace-Then-Refine'' (ESTR). The first stage is an explicit subspace exploration, while the second stage is a linear bandit algorithm calledalmost-low-dimensional OFUL'' (LowOFUL) that exploits and further refines the estimated subspace via a regularization technique. We show that the regret of ESTR is $\widetilde{\mathcal{O}}((d1+d2)^{3/2} \sqrt{r T})where\widetilde{\mathcal{O}}hideslogarithmicfactorsandTisthetimehorizon.Thisimprovesupontheregretof \widetilde{\mathcal{O}}(d1d2\sqrt{T})$ attained for a na\"ive linear bandit reduction. We conjecture that the regret bound of ESTR is unimprovable up to polylogarithmic factors. A preliminary experiment shows that ESTR outperforms a na\"ive linear bandit reduction. Online Learning to Rank with Features We introduce a new model for online ranking in which the click probability factors into an examination and attractiveness function and the attractiveness function is a linear function of a feature vector and an unknown parameter. Only relatively mild assumptions are made on the examination function. A novel algorithm for this setup is analysed, showing that the dependence on the number of items is replaced by a dependence on the dimension, allowing the new algorithm to handle a large number of items. When reduced to the orthogonal case, the regret of the algorithm improves on the state-of-the-art. On the Design of Estimators for Bandit Off-Policy Evaluation Off-policy evaluation is the problem of estimating the value of a target policy using data collected under a different policy. Given a base estimator for bandit off-policy evaluation and a parametrized class of control variates, we address the problem of computing a control variate in that class that reduces the risk of the base estimator. We derive the population risk as a function of the class parameters and we establish conditions that guarantee risk improvement. We present our main results in the context of multi-armed bandits, and we propose a simple design for contextual bandits that gives rise to an estimator that is shown to perform well in multi-class cost-sensitive classification datasets. Dynamic Learning with Frequent New Product Launches: A Sequential Multinomial Logit Bandit Problem Motivated by the phenomenon that companies introduce new products to keep abreast with customers' rapidly changing tastes, we consider a novel online learning setting where a profit-maximizing seller needs to learn customers' preferences through offering recommendations, which may contain existing products and new products that are launched in the middle of a selling period. We propose a sequential multinomial logit (SMNL) model to characterize customers' behavior when product recommendations are presented in tiers. For the offline version with known customers' preferences, we propose a polynomial-time algorithm and characterize the properties of the optimal tiered product recommendation. For the online problem, we propose a learning algorithm and quantify its regret bound. Moreover, we extend the setting to incorporate a constraint which ensures every new product is learned to a given accuracy. Our results demonstrate the tier structure can be used to mitigate the risks associated with learning new products.