Transfer and Multitask Learning

Jun 12, 2019

Speakers

About

Domain Agnostic Learning with Disentangled Representations Unsupervised model transfer has the potential to greatly improve the generalizability of deep models to novel domains. Yet the current literature assumes that the separation of target data into distinct domains is known a priori. In this paper, we propose the task of Domain-Agnostic Learning (DAL): How to transfer knowledge from a labeled source domain to unlabeled data from arbitrary target domains? To tackle this problem, we devise a novel Deep Adversarial Disentangled Autoencoder (DADA) capable of disentangling domain-specific features from class identity. We demonstrate experimentally that when the target domain labels are unknown, DADA leads to state-of-the-art performance on several image classification datasets. Composing Value Functions in Reinforcement Learning An important property for lifelong-learning agents is the ability to combine existing skills to solve new unseen tasks. In general, however, it is unclear how to compose existing skills in a principled manner. We show that optimal value function composition can be achieved in entropy-regularised reinforcement learning (RL), and then extend this result to the standard RL setting. Composition is demonstrated in a high-dimensional video game environment, where an agent with an existing library of skills is immediately able to solve new tasks without the need for further learning. Fast Context Adaptation via Meta-Learning We propose CAVIA, a meta-learning method for fast adaptation that is scalable, flexible, and easy to implement. CAVIA partitions the model parameters into two parts: context parameters that serve as additional input to the model and are adapted on individual tasks, and shared parameters that are meta-trained and shared across tasks. At test time, the context parameters are updated with one or several gradient steps on a task-specific loss that is backpropagated through the shared part of the network. Compared to approaches that adjust all parameters on a new task (e.g., MAML), CAVIA can be scaled up to larger networks without overfitting on a single task, is easier to implement, and is more robust to the inner-loop learning rate. We show empirically that CAVIA outperforms MAML on regression, classification, and reinforcement learning problems. Provable Guarantees for Gradient-Based Meta-Learning We study the problem of meta-learning through the lens of online convex optimization, developing a meta-algorithm bridging the gap between popular gradient-based meta-learning and classical regularization-based multi-task transfer methods. Our method is the first to simultaneously satisfy good sample efficiency guarantees in the convex setting, with generalization bounds that improve with task-similarity, while also being computationally scalable to modern deep learning architectures and the many-task setting. Despite its simplicity, the algorithm matches, up to a constant factor, a lower bound on the performance of any such parameter-transfer method under natural task similarity assumptions. We use experiments in both convex and deep learning settings to verify and demonstrate the applicability of our theory. Towards Understanding Knowledge Distillation Knowledge distillation, i.e., one classifier being trained on the outputs of another classifier, is an empirically very successful technique for knowledge transfer between classifiers. It has even been observed that classifiers learn much faster and more reliably if trained with the outputs of another classifier as soft labels, instead of from ground truth data. So far, however, there is no satisfactory theoretical explanation of this phenomenon. In this work, we provide the first insights into the working mechanisms of distillation by studying the special case of linear and deep linear classifiers. Specifically, we prove a generalization bound that establishes fast convergence of the expected risk of a distillation-trained linear classifier. From the bound and its proof we extract three key factors that determine the success of distillation: * data geometry -- geometric properties of the data distribution, in particular class separation, has a direct influence on the convergence speed of the risk; * optimization bias -- gradient descent optimization finds a very favorable minimum of the distillation objective; and * strong monotonicity -- the expected risk of the student classifier always decreases when the size of the training set grows. Transferable Adversarial Training: A General Approach to Adapting Deep Classifiers Domain adaptation enables knowledge transfer from a labeled source domain to an unlabeled target domain. A mainstream approach is adversarial feature adaptation, which learns domain-invariant representations through aligning the feature distributions of both domains. However, a theoretical prerequisite of domain adaptation is the adaptability measured by the expected risk of an ideal joint hypothesis over the source and target domains. In this respect, adversarial feature adaptation may potentially deteriorate the adaptability, since it distorts the original feature distributions when suppressing domain-specific variations. To this end, we propose transferable adversarial training (TAT) to enable the adaptation of deep classifiers. The approach generates transferable examples to fill in the gap between the source and target domains, and adversarially trains the deep classifiers to make consistent predictions over transferable examples. Without learning domain-invariant representations at the expense of distorting the feature distributions, the adaptability in the theoretical learning bound is algorithmically guaranteed. A series of experiments validate that our approach advances the state-of-the-arts on a variety of domain adaptation tasks in vision and NLP, including object recognition, learning from synthetic to real, and sentiment classification. Transferability vs. Discriminability: Batch Spectral Penalization for Adversarial Domain Adaptation Adversarial domain adaptation has made remarkable advances in learning transferable representations for knowledge transfer across domains. While adversarial learning strengthens the feature transferability which the community focuses on, its impact on the feature discriminability has not been fully explored. In this paper, a series of experiments based on spectral analysis of the feature representations have been conducted, revealing an unexpected deterioration of the discriminability while learning transferable features adversarially. Our key finding is that the eigenvectors with the largest singular values will dominate the feature transferability. As a consequence, the transferability is enhanced at the expense of over penalization of other eigenvectors that embody rich structures crucial for discriminability. Towards this problem, we present Batch Spectral Penalization (BSP), a general approach to penalizing the largest singular values so that other eigenvectors can be relatively strengthened to boost the feature discriminability. Experiments show that the approach significantly improves upon representative adversarial domain adaptation methods to achieve state-of-art results. Learning-to-Learn Stochastic Gradient Descent with Biased Regularization We study the problem of learning-to-learn: inferring a learning algorithm that works well on tasks sampled from an unknown distribution. As class of algorithms we consider Stochastic Gradient Descent on the true risk regularized by the square euclidean distance to a bias vector. We present an average excess risk bound for such a learning algorithm. This result quantifies the potential benefit of using a bias vector with respect to the unbiased case. We then address the problem of estimating the bias from a sequence of tasks. We propose a meta-algorithm which incrementally updates the bias, as new tasks are observed. The low space and time complexity of this approach makes it appealing in practice. We provide guarantees on the learning ability of the meta-algorithm. A key feature of our results is that, when the number of tasks grows and their variance is relatively small, our learning-to-learn approach has a significant advantage over learning each task in isolation by Stochastic Gradient Descent without a bias term. We report on numerical experiments which demonstrate the effectiveness of our approach. BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning Multi-task learning allows the sharing of useful information between multiple related tasks. In natural language processing several recent approaches have successfully leveraged unsupervised pre-training on large amounts of data to perform well on various tasks, such as those in the GLUE benchmark. These results are based on fine-tuning on each task separately. We explore the multi-task learning setting for the recent BERT model on the GLUE benchmark, and how to best add task-specific parameters to a pre-trained BERT network, with a high degree of parameter sharing between tasks. We introduce new adaptation modules, PALs or ‘projected attention layers’, which use a low-dimensional multi-head attention mechanism, based on the idea that it is important to include layers with inductive biases useful for the input domain. By using PALs in parallel with BERT layers, we match the performance of fine-tuned BERT on the GLUE benchmark with ≈7 times fewer parameters, and obtain state-of-the-art results on the Recognizing Textual Entailment dataset. Towards Accurate Model Selection in Deep Unsupervised Domain Adaptation Wed Jun 12th 03:15 -- 03:20 PM @ Room 201 in Transfer and Multitask Learning » Deep unsupervised domain adaptation (Deep UDA) methods successfully leverage readily-accessible labeled source data to boost the performance on relevant but unlabeled target data. However, algorithm comparison is cumbersome in Deep UDA due to the lack of a satisfying and standardized model selection method, posing an obstacle to further advances in the field. Existing model selection methods for Deep UDA are either highly biased, constrained, unstable, or controversial (requiring labeled target data). To this end, we propose Deep Embedded Validation (\textbf{DEV}), which embeds adapted feature representation into the validation procedure to obtain unbiased target risk estimation with bounded variance. Variance is further reduced by the technique of control variate. The effectiveness of the proposed method is validated both theoretically and empirically.

Organizer

Categories

About ICML 2019

The International Conference on Machine Learning (ICML) is the premier gathering of professionals dedicated to the advancement of the branch of artificial intelligence known as machine learning. ICML is globally renowned for presenting and publishing cutting-edge research on all aspects of machine learning used in closely related areas like artificial intelligence, statistics and data science, as well as important application areas such as machine vision, computational biology, speech recognition, and robotics. ICML is one of the fastest growing artificial intelligence conferences in the world. Participants at ICML span a wide range of backgrounds, from academic and industrial researchers, to entrepreneurs and engineers, to graduate students and postdocs.

Like the format? Trust SlidesLive to capture your next event!

Professional recording and live streaming, delivered globally.

Sharing

Recommended Videos

Presentations on similar topic, category or speaker

Interested in talks like this? Follow ICML 2019