Learning Linear-Quadratic Regulators Efficiently with only √T Regret We present the first computationally-efficient algorithm with ˜O(√T) regret for learning in Linear Quadratic Control systems with unknown dynamics. By that, we resolve an open question of Abbasi-Yadkori and Szepesvari (2011) and Dean,Mania, Matni, Recht, and Tu (2018). Learning from Delayed Outcomes via Proxies with Applications to Recommender Systems Predicting delayed outcomes is an important problem in recommender systems (e.g., will customers finish reading an ebook?). We formalize the problem as an adversarial, delayed online learning problem and consider how a proxy for the delayed outcome (e.g., if customers read a third of the book in 24 hours) can help minimize regret, even though the proxy is not available when making a prediction. Motivated by our regret analysis, we propose two neural network architectures: Factored Forecaster (FF) which is ideal if the proxy is informative of the outcome in hindsight, and Residual Factored Forecaster (RFF) that is robust to a non-informative proxy. Experiments on two real world datasets for predicting human behavior show that RFF outperforms both FF and a direct forecaster that does not make use of the proxy. Our results suggest that exploiting proxies by factorization is a promising way to mitigate the impact of long delays in human behavior prediction tasks. Adaptive Regret of Convex and Smooth Functions We investigate online convex optimization in changing environments, and choose the adaptive regret as the performance measure. The goal is to achieve a small regret over every interval so that the comparator is allowed to change over time. Different from previous works that only utilize the convexity condition, this paper further exploits smoothness to improve the adaptive regret. To this end, we develop novel adaptive algorithms for convex and smooth functions, and establish problem-dependent regret bounds over any interval. Our regret bounds are comparable to existing results in the worst case, and become much tighter when the comparator has a small loss. Online Adaptive Principal Component Analysis and Its extensions We propose algorithms for online principal component analysis (PCA) and variance minimization for adaptive settings. Previous literature has focused on upper bounding the static adversarial regret, whose comparator is the optimal fixed action in hindsight. However, static regret is not an appropriate metric when the underlying environment is changing. Instead, we adopt the adaptive regret metric from the previous literature and propose online adaptive algorithms for PCA and variance minimization, that have sub-linear adaptive regret guarantees. We demonstrate both theoretically and experimentally that the proposed algorithms can adapt to the changing environments. POLITEX: Regret Bounds for Policy Iteration using Expert Prediction We present POLITEX (POLicy ITeration using EXperts), a model-free reinforcement learning (RL) algorithm that uses linear function approximation for continuing RL problems. POLITEX can be thought of as a “soft” variant of policy iteration, where the policy in each iteration corresponds to a Boltzmann distribution over the sum of previous action-value functions. We show that in uniformly mixing Markov Decision Processes (MDPs), for a time-horizon of T and a worst-case value function approximation error ε where linear function approximation is used with d-dimensional features, the regret of POLITEX scales as O˜(d^(1/2)T^(3/4) + εT). Under a uniform mixing assumption, we provide the first regret result for a practical model-free method that uses function approximation and where the regret does not scale with the size of the underlying MDP. We also provide a new finite sample analysis of the LSPE algorithm, used by POLITEX to estimate the value functions, which may be of independent interest. Experimental results on a queuing problem confirm that POLITEX is competitive with some of its alternatives, while preliminary results on Ms Pacman (one of the standard Atari benchmark problems) confirm the viability of POLITEX beyond linear function approximation. Anytime Online-to-Batch, Optimism and Acceleration A standard way to obtain convergence guarantees in stochastic convex optimization is to run an online learning algorithm and then output the average of its iterates: the actual iterates of the online learning algorithm do not come with individual guarantees. We close this gap by introducing a black-box modification to any online learning algorithm whose iterates converge to the optimum in stochastic scenarios. We then consider the case of smooth losses, and show that combining our approach with optimistic online learning algorithms immediately yields a fast convergence rate of O(L/T3/2+σ/√T) on L-smooth problems with σ2 variance in the gradients. Finally, we provide a reduction that converts any adaptive online algorithm into one that obtains the optimal accelerated rate of ~O(L/T2+σ/√T), while still maintaining ~O(1/√T) convergence in the non-smooth setting. Importantly, these algorithms adapt to L and σ automatically: they do not need to know either to obtain these rates. Cautious Regret Minimization: Online Optimization with Long-Term Budget Constraints We study a class of online convex optimization problems with long-term budget constraints that arise naturally as reliability guarantees or total consumption constraints. In this general setting, prior work by Mannor et al. (2009) has shown that achieving no regret is impossible if the functions defining the agent's budget are chosen by an adversary. To overcome this obstacle, we refine the agent's regret metric by introducing the notion of a "K-benchmark", i.e., a comparator which meets the problem's allotted budget over any window of length K. The impossibility analysis of Mannor et al. (2009) is recovered when K=T; however, for K=o(T), we show that it is possible to minimize regret while still meeting the problem's long-term budget constraints. We achieve this via an online learning policy based on Cautious Online Lagrangiant Descent (COLD) for which we derive explicit bounds, in terms of both the incurred regret and the residual budget violations. Optimal Kronecker-Sum Approximation of Real Time Recurrent Learning One of the central goals of Recurrent Neural Networks (RNNs) is to learn long-term dependencies in sequential data. Nevertheless, the most popular training method, Truncated Backpropagation through Time (TBPTT), categorically forbids learning dependencies beyond the truncation horizon. In contrast, the online training algorithm Real Time Recurrent Learning (RTRL) provides untruncated gradients, with the disadvantage of impractically large computational costs. Recently published approaches reduce these costs by providing noisy approximations of RTRL. We present a new approximation algorithm of RTRL, Optimal Kronecker-Sum Approximation (OK). We prove that OK is optimal for a class of approximations of RTRL, which includes all approaches published so far. Additionally, we show that OK has empirically negligible noise: Unlike previous algorithms it matches TBPTT in a real world task (character-level Penn TreeBank) and can exploit online parameter updates to outperform TBPTT in a synthetic string memorization task. Adaptive Sensor Placement for Continuous Spaces We consider the problem of adaptively placing sensors along an interval to detect stochastically-generated events. We present a new formulation of the problem as a continuum-armed bandit problem with feedback in the form of partial observations of realisations of an inhomogeneous Poisson process. We design a solution method by combining Thompson sampling with nonparametric inference via increasingly granular Bayesian histograms and derive an ~O(T2/3) bound on the Bayesian regret in T rounds. This is coupled with the design of an efficent optimisation approach to select actions in polynomial time. In simulations we demonstrate our approach to have substantially lower and less variable regret than competitor algorithms. Scale-free adaptive planning for deterministic dynamics & discounted rewards We address the problem of planning in an environment with deterministic dynamics and stochastic discounted rewards under a limited numerical budget where the ranges of both rewards and noise are unknown. We introduce \platypoos, an adaptive, robust and efficient alternative to the \OLOP (open-loop optimistic planning) algorithm. Whereas \OLOP requires apriori knowledge of the ranges of both rewards and noise, \platypoos dynamically adapts its behavior to both. This allows \platypoos to be immune to two vulnerabilities of \OLOP: failure when given underestimated ranges of noise and rewards and inefficiency when these are overestimated. \Platypoos additionally adapts to the global smoothness of the value function. We assess \platypoos’s performance in terms of the simple regret, the expected loss resulting from choosing our algorithm’s recommended action rather than an optimal one. We show that \platypoos acts in a provably more efficient manner vs \OLOP when \OLOP is given an overestimated reward and show that in the case of no noise, \platypoos learns exponentially faster than \OLOP.