Statistical Learning Theory

Jun 11, 2019



Natural Analysts in Adaptive Data Analysis Adaptive data analysis is frequently criticized for its pessimistic generalization guarantees. The source of these pessimistic bounds is a model that permits arbitrary, possibly adversarial analysts that optimally use information to bias results. While being a central issue in the field, still lacking are notions of natural analysts that allow for more optimistic bounds faithful to the reality that typical analysts aren't adversarial. In this work, we propose notions of natural analysts that smoothly interpolate between the optimal non-adaptive bounds and the best-known adaptive generalization bounds. To accomplish this, we model the analyst's knowledge as evolving according to the rules of an unknown dynamical system that takes in revealed information and outputs new statistical queries to the data. This allows us to restrict the analyst through different natural control-theoretic notions. One such notion corresponds to a recency bias, formalizing an inability to arbitrarily use distant information. Another complementary notion formalizes an anchoring bias, a tendency to weight initial information more strongly. Both notions come with quantitative parameters that smoothly interpolate between the non-adaptive case and the fully adaptive case, allowing for a rich spectrum of intermediate analysts that are neither non-adaptive nor adversarial. Natural not only from a cognitive perspective, we show that our notions also capture standard optimization methods, like gradient descent in various settings. This gives a new interpretation to the fact that gradient descent tends to overfit much less than its adaptive nature might suggest. CapsAndRuns: An Improved Method for Approximately Optimal Algorithm Configuration We consider the problem of configuring general-purpose solvers to run efficiently on problem instances drawn from an unknown distribution, a problem of major interest in solver autoconfiguration. Following previous work, we focus on designing algorithms that find a configuration with near-optimal expected capped runtime while doing the least amount of work, with the cap chosen in a configuration-specific way so that most instances are solved. In this paper we present a new algorithm, CapsAndRuns, which finds a near-optimal configuration while using time that scales (in a problem dependent way) with the optimal expected capped runtime, significantly strengthening previous results which could only guarantee a bound that scaled with the potentially much larger optimal expected uncapped runtime. The new algorithm is simpler and more intuitive than the previous methods: first it estimates the optimal runtime cap for each configuration, then it uses a Bernstein race to find a near optimal configuration given the caps. Experiments verify that our method can significantly outperform its competitors. Leveraging Low-Rank Relations Between Surrogate Tasks in Structured Prediction We study the interplay between surrogate methods for structured prediction and techniques from multitask learning designed to leverage relationships between surrogate outputs. We propose an efficient algorithm based on trace norm regularization which, differently from previous methods, does not require explicit knowledge of the coding/decoding functions of the surrogate framework. As a result, our algorithm can be applied to the broad class of problems in which the surrogate space is large or even infinite dimensional. We study excess risk bounds for trace norm regularized structured prediction proving the consistency and learning rates for our estimator. We also identify relevant regimes in which our approach can enjoy better generalization performance than previous methods. Numerical experiments on ranking problems indicate that enforcing low-rank relations among surrogate outputs may indeed provide a significant advantage in practice. Training Well-Generalizing Classifiers for Fairness Metrics and Other Data-Dependent Constraints Classifiers can be trained with data-dependent constraints to satisfy fairness goals, reduce churn, achieve a targeted false positive rate, or other policy goals. We study the generalization performance for such constrained optimization problems, in terms of how well the constraints are satisfied at evaluation time, given that they are satisfied at training time. To improve generalization, we frame the problem as a two-player game where one player optimizes the model parameters on a training dataset, and the other player enforces the constraints on an independent validation dataset. We build on recent work in two-player constrained optimization to show that if one uses this two-dataset approach, then constraint generalization can be significantly improved. As we illustrate experimentally, this approach works not only in theory, but also in practice. Optimality Implies Kernel Sum Classifiers are Statistically Efficient We propose a novel combination of optimization tools with learning theory bounds in order to analyze the sample complexity of optimal classifiers. This contrasts the typical learning theoretic results which hold for all (potentially suboptimal) classifiers. Our work also justifies assumptions made in prior work on multiple kernel learning. As a byproduct of this analysis, we provide a new form of Rademacher hypothesis sets for considering optimal classifiers. The Implicit Fairness Criterion of Unconstrained Learning We clarify what fairness guarantees we can and cannot expect to follow from unconstrained machine learning. Specifically, we show that in many settings, unconstrained learning on its own implies group calibration, that is, the outcome variable is conditionally independent of group membership given the score. A lower bound confirms the optimality of our upper bound. Moreover, we prove that as the excess risk of the learned score decreases, the more strongly it violates separation and independence, two other standard fairness criteria. Our results challenge the view that group calibration necessitates an active intervention, suggesting that often we ought to think of it as a byproduct of unconstrained machine learning. Weak Detection of Signal in the Spiked Wigner Model We consider the problem of detecting the presence of the signal in a rank-one signal-plus-noise data matrix. In case the signal-to-noise ratio is under the threshold below which a reliable detection is impossible, we propose a hypothesis test based on the linear spectral statistics of the data matrix. The error of the proposed test is optimal as it matches the error of the likelihood ratio test that minimizes the sum of the Type-I and Type-II errors. The test is data-driven and does not depend on the distribution of the signal or the noise. If the density of the noise is known, it can be further improved by an entrywise transformation to lower the error of the test. Rademacher Complexity for Adversarially Robust Generalization Many machine learning models are vulnerable to adversarial attacks; for example, adding adversarial perturbations that are imperceptible to humans can often make machine learning models produce wrong predictions with high confidence. Moreover, although we may obtain robust models on the training dataset via adversarial training, in some problems the learned models cannot generalize well to the test data. In this paper, we focus on $\ell\inftyattacks,andstudytheadversariallyrobustgeneralizationproblemthroughthelensofRademachercomplexity.Forbinarylinearclassifiers,weprovetightboundsfortheadversarialRademachercomplexity,andshowthattheadversarialRademachercomplexityisneversmallerthanitsnaturalcounterpart,andithasanunavoidabledimensiondependence,unlesstheweightvectorhasboundedattacks,andstudytheadversariallyrobustgeneralizationproblemthroughthelensofRademachercomplexity.Forbinarylinearclassifiers,weprovetightboundsfortheadversarialRademachercomplexity,andshowthattheadversarialRademachercomplexityisneversmallerthanitsnaturalcounterpart,andithasanunavoidabledimensiondependence,unlesstheweightvectorhasbounded\ell1norm.Theresultsalsoextendtomulti−classlinearclassifiers.For(nonlinear)neuralnetworks,weshowthatthedimensiondependenceintheadversarialRademachercomplexityalsoexists.Wefurtherconsiderasurrogateadversariallossforone−hiddenlayerReLUnetworkandprovemarginboundsforthissetting.Ourresultsindicatethathavingnorm.Theresultsalsoextendtomulti−classlinearclassifiers.For(nonlinear)neuralnetworks,weshowthatthedimensiondependenceintheadversarialRademachercomplexityalsoexists.Wefurtherconsiderasurrogateadversariallossforone−hiddenlayerReLUnetworkandprovemarginboundsforthissetting.Ourresultsindicatethathaving\ell_1$ norm constraints on the weight matrices might be a potential way to improve generalization in the adversarial setting. We demonstrate experimental results that validate our theoretical findings. Provably efficient RL with Rich Observations via Latent State Decoding We study the exploration problem in episodic MDPs with rich observations generated from a small number of latent states. Under certain identifiability assumptions, we demonstrate how to estimate a mapping from the observations to latent states inductively through a sequence of regression and clustering steps---where previously decoded latent states provide labels for later regression problems---and use it to construct good exploration policies. We provide finite-sample guarantees on the quality of the learned state decoding function and exploration policies, and complement our theory with an empirical evaluation on a class of hard exploration problems. Our method exponentially improves over Q-learning with na\"ive exploration, even when Q-learning has cheating access to latent states. Information-Theoretic Considerations in Batch Reinforcement Learning Value-function approximation methods that operate in batch mode have foundational importance to reinforcement learning (RL). Finite sample guarantees for these methods often crucially rely on two types of assumptions: (1) mild distribution shift, and (2) representation conditions that are stronger than realizability. However, the necessity (“why do we need them?”) and the naturalness (“when do they hold?”) of such assumptions have largely eluded the literature. In this paper, we revisit these assumptions and provide theoretical results towards answering the above questions, and make steps towards a deeper understanding of value-function approximation.



About ICML 2019

The International Conference on Machine Learning (ICML) is the premier gathering of professionals dedicated to the advancement of the branch of artificial intelligence known as machine learning. ICML is globally renowned for presenting and publishing cutting-edge research on all aspects of machine learning used in closely related areas like artificial intelligence, statistics and data science, as well as important application areas such as machine vision, computational biology, speech recognition, and robotics. ICML is one of the fastest growing artificial intelligence conferences in the world. Participants at ICML span a wide range of backgrounds, from academic and industrial researchers, to entrepreneurs and engineers, to graduate students and postdocs.

Store presentation

Should this presentation be stored for 1000 years?

How do we store presentations

Total of 1 viewers voted for saving the presentation to eternal vault which is 0.1%


Recommended Videos

Presentations on similar topic, category or speaker

Interested in talks like this? Follow ICML 2019