Monge blunts Bayes: Hardness Results for Adversarial Training The last few years have seen a staggering number of empirical studies of the robustness of neural networks in a model of adversarial perturbations of their inputs. Most rely on an adversary which carries out local modifications within prescribed balls. None however has so far questioned the broader picture: how to frame a \textit{resource-bounded} adversary so that it can be \textit{severely detrimental} to learning, a non-trivial problem which entails at a minimum the choice of loss and classifiers. We suggest a formal answer for losses that satisfy the minimal statistical requirement of being \textit{proper}. We pin down a simple sufficient property for any given class of adversaries to be detrimental to learning, involving a central measure of harmfulness'' which generalizes the well-known class of integral probability metrics. A key feature of our result is that it holds for \textit{all} proper losses, and for a popular subset of these, the optimisation of this central measure appears to be \textit{independent of the loss}. When classifiers are Lipschitz -- a now popular approach in adversarial training --, this optimisation resorts to \textit{optimal transport} to make a low-budget compression of class marginals. Toy experiments reveal a finding recently separately observed: training against a sufficiently budgeted adversary of this kind \textit{improves} generalization. Better generalization with less data using robust gradient descent For learning tasks where the data (or losses) may be heavy-tailed, algorithms based on empirical risk minimization may require a substantial number of observations in order to perform well off-sample. In pursuit of stronger performance under weaker assumptions, we propose a technique which uses a cheap and robust iterative estimate of the risk gradient, which can be easily fed into any steepest descent procedure. Finite-sample risk bounds are provided under weak moment assumptions on the loss gradient. The algorithm is simple to implement, and empirical tests using simulations and real-world data illustrate that more efficient and reliable learning is possible without prior knowledge of the loss tails. Near optimal finite time identification of arbitrary linear dynamical systems We derive finite time error bounds for estimating general linear time-invariant (LTI) systems from a single observed trajectory using the method of least squares. We provide the first analysis of the general case when eigenvalues of the LTI system are arbitrarily distributed in three regimes: stable, marginally stable, and explosive. Our analysis yields sharp upper bounds for each of these cases separately. We observe that although the underlying process behaves quite differently in each of these three regimes, the systematic analysis of a self--normalized martingale difference term helps bound identification error up to logarithmic factors of the lower bound. On the other hand, we demonstrate that the least squares solution may be statistically inconsistent under certain conditions even when the signal-to-noise ratio is high. Lossless or Quantized Boosting with Integer Arithmetic In supervised learning, efficiency often starts with the choice of a good loss: support vector machines popularised Hinge loss, Adaboost popularised the exponential loss, etc. Recent trends in machine learning have highlighted the necessity for training routines to meet tight requirements on communication, bandwidth, energy, operations, encoding, among others. Fitting the often decades-old state of the art training routines into these new constraints does not go without pain and uncertainty or reduction in the original guarantees. Our paper starts with the design of a new strictly proper canonical, twice differentiable loss called the Q-loss. Importantly, its mirror update over (arbitrary) rational inputs uses only integer arithmetics -- more precisely, the sole use of +,−,/,×,|.|+,−,/,×,|.|. We build a learning algorithm which is able, under mild assumptions, to achieve a lossless boosting-compliant training. We give conditions for a quantization of its main memory footprint, weights, to be done while keeping the whole algorithm boosting-compliant. Experiments display that the algorithm can achieve a fast convergence during the early boosting rounds compared to AdaBoost, even with a weight storage that can be 30+ times smaller. Lastly, we show that the Bayes risk of the Q-loss can be used as node splitting criterion for decision trees and guarantees optimal boosting convergence. Orthogonal Random Forest for Causal Inference We propose orthogonal random forest, an algorithm that incorporates double machine learning---a method of using Neyman-orthogonal moments to reduce sensitivity with respect to nuisance parameters to estimate the target parameter---with generalized random forests---a flexible non-parametric method for statistical estimation of conditional moment models using random forests. We provide a consistency rate and establish asymptotic normality for our estimator. We show that under mild assumption on the consistency rate of the nuisance estimator, we can achieve the same error rate as an oracle with a priori knowledge of these nuisance parameters. We show that when the nuisance functions have a locally sparse parametrization, then a local ℓ1ℓ1-penalized regression achieves the required rate. We apply our method to estimate heterogeneous treatment effects from observational data with discrete treatments or continuous treatments, and we show that, unlike prior work, our method provably allows to control for a high-dimensional set of variables under standard sparsity conditions. We also provide a comprehensive empirical evaluation of our algorithm on both synthetic data and real data, and show that it consistently outperforms baseline approaches. MONK -- Outlier-Robust Mean Embedding Estimation by Median-of-Means Mean embeddings provide an extremely flexible and powerful tool in machine learning and statistics to represent probability distributions and define a semi-metric (MMD, maximum mean discrepancy; also called N-distance or energy distance), with numerous successful applications. The representation is constructed as the expectation of the feature map defined by a kernel. As a mean, its classical empirical estimator, however, can be arbitrary severely affected even by a single outlier in case of unbounded features. To the best of our knowledge, unfortunately even the consistency of the existing few techniques trying to alleviate this serious sensitivity bottleneck is unknown. In this paper, we show how the recently emerged principle of median-of-means can be used to design estimators for kernel mean embedding and MMD with excessive resistance properties to outliers, and optimal sub-Gaussian deviation bounds under mild assumptions. The advantages of multiple classes for reducing overfitting from test set reuse Excessive reuse of holdout data can lead to overfitting. Yet, there is no concrete evidence of significant overfitting due to holdout reuse in popular multiclass benchmarks. Known results show that, in the worst-case, revealing the accuracy of kk adaptively chosen classifiers on a data set of size nn allows to create a classifier with bias of Θ(√k/n)Θ(k/n) for any binary prediction problem. We show a new upper bound of ~O(max√klog(n)/(mn),k/n)O~(maxklog⁡(n)/(mn),k/n) on the bias that any attack with k≥~Ω(m)k≥Ω~(m) queries can achieve in a prediction problem with mm classes. Moreover, we show a natural attack that, under plausible technical condition, achieves the nearly matching bias of Ω(√k/(mn))Ω(k/(mn)). Complementing our theoretical work, we give new practical attacks to stress test multiclass benchmarks by aiming to create as large a bias as possible with a given number of queries. Through extensive experiments, we show that the additional uncertainty of prediction with a large number of classes indeed mitigates the effect of our best attacks. Our work extends important developments in understanding of overfitting in adaptive data analysis to multiclass prediction problems. In addition it bears out the surprising fact that multiclass prediction problems are significantly more robust to overfitting from reusing the test set. This helps to explain why popular multiclass prediction benchmarks, such as ImageNet, may enjoy a longer lifespan than what intuition from the binary case would have suggested. On the statistical rate of nonlinear recovery in generative models with heavy-tailed data We consider estimating a high-dimensional vector from non-linear measurements where the unknown vector is represented by a generative model G:Rk→RdG:Rk→Rd with k≪dk≪d. Such a model poses structural priors on the unknown vector without having a dedicated basis, and in particular allows new and efficient approaches solving recovery problems with number of measurements far less than the ambient dimension of the vector. While progresses have been made recently regarding theoretical understandings on the linear Gaussian measurements, much less is known when the model is possibly misspecified and the measurements are non-Gaussian. In this paper, we make a step towards such a direction by considering the scenario where the measurements are non-Gaussian, subject to possibly unknown nonlinear transformations and the responses are heavy-tailed. We then propose new estimators via score functions based on the first and second order Stein's identity, and prove the sample size bound of m=O(kε−2log(L/ε))m=O(kε−2log⁡(L/ε)) achieving an εεerror in the form of exponential concentration inequalities. Furthermore, for the special case of multi-layer ReLU generative model, we improve the sample bound by a logarithm factor to m=O(kε−2log(d))m=O(kε−2log⁡(d)), matching the state-of-art statistical rate in compressed sensing for estimating kk-sparse vectors. On the technical side, we develop new chaining methods bounding heavy-tailed processes, which could be of independent interest. Phase transition in PCA with missing data: Reduced signal-to-noise ratio, not sample size! How does missing data affect our ability to learn signal structures? It has been shown that learning signal structure in terms of principal components is dependent on the ratio of sample size and dimensionality and that a critical number of observations is needed before learning starts (Biehl and Mietzner, 1993). Here we generalize this analysis to include missing data. Probabilistic principal component analysis is regularly used for estimating signal structures in datasets with missing data. Our analytic result suggest that the effect of missing data is to effectively reduce signal-to-noise ratio rather than - as generally believed - to reduce sample size. The theory predicts a phase transition in the learning curves and this is indeed found both in simulation data and in real datasets. On Medians of (Randomized) Pairwise Means Tournament procedures, recently introduced in \cite{lugosi2016risk}, offer an appealing alternative, from a theoretical perspective at least, to the principle of \textit{Empirical Risk Minimization} in machine learning. Statistical learning by Median-of-Means (MoM) basically consists in segmenting the training data into blocks of equal size and comparing the statistical performance of every pair of candidate decision rules on each data block: that with highest performance on the majority of the blocks is declared as the winner. In the context of nonparametric regression, functions having won all their duels have been shown to outperform empirical risk minimizers w.r.t. the mean squared error under minimal assumptions, while exhibiting robustness properties. It is the purpose of this paper to extend this approach, in order to address other learning problems in particular, for which the performance criterion takes the form of an expectation over pairs of observations rather than over one single observation, as may be the case in pairwise ranking, clustering or metric learning. Precisely, it is proved here that the bounds achieved by MoM are essentially conserved when the blocks are built by means of independent sampling without replacement schemes instead of a simple segmentation. These results are next extended to situations where the risk is related to a pairwise loss function and its empirical counterpart is of the form of a U-statistic. Beyond theoretical results guaranteeing the performance of the learning/estimation methods proposed, some numerical experiments provide empirical evidence of their relevance in practice.