Doubly Robust Joint Learning for Recommendation on Data Missing Not at Random In recommender systems, usually the ratings of a user to most items are missing and a critical problem is that the missing ratings are often missing not at random (MNAR) in reality. It is widely acknowledged that MNAR ratings make it difficult to accurately predict the ratings and unbiasedly estimate the performance of rating prediction. Recent approaches use imputed errors to recover the prediction errors for missing ratings, or weight observed ratings with the propensities of being observed. These approaches can still be severely biased in performance estimation or suffer from the variance of the propensities. To overcome these limitations, we first propose an estimator that integrates the imputed errors and propensities in a doubly robust way to obtain unbiased performance estimation and alleviate the effect of the propensity variance. To achieve good performance guarantees, based on this estimator, we propose joint learning of rating prediction and error imputation, which outperforms the state-of-the-art approaches on four real-world datasets. Linear-Complexity Data-Parallel Earth Mover's Distance Approximations The Earth Mover's Distance (EMD) is a state-of-the art metric for comparing discrete probability distributions. The high distinguishability offered by the EMD comes at a high cost in computational complexity. Therefore, linear-complexity approximation algorithms have been proposed to improve its scalability. However, these algorithms are either limited to vector spaces with only a few dimensions or they become ineffective when the degree of overlap between the probability distributions is high. We propose novel approximation algorithms that overcome both of these limitations, yet still achieve linear time complexity. All our algorithms are data parallel, and therefore, we can take advantage of massively parallel computing engines, such as Graphics Processing Units (GPUs). On the popular text-based 20 Newsgroups dataset, the new algorithms are four orders of magnitude faster than a multi-threaded CPU implementation of Word Mover's Distance and match its search accuracy. On MNIST images, the new algorithms are four orders of magnitude faster than Cuturi's GPU implementation of the Sinkhorn's algorithm while offering a slightly higher search accuracy. Model Comparison for Semantic Grouping We introduce a probabilistic framework for quantifying the semantic similarity between two groups of embeddings. We formulate the task of semantic similarity as a model comparison task in which we contrast a generative model which jointly models two sentences versus one that does not. We illustrate how this framework can be used for the Semantic Textual Similarity tasks using clear assumptions about how the embeddings of words are generated. We apply model comparison that utilises information criteria to address some of the shortcomings of Bayesian model comparison, whilst still penalising model complexity. We achieve competitive results by applying the proposed framework with an appropriate choice of likelihood on the STS datasets. RaFM: Rank-Aware Factorization Machines Fatorization machines (FM) are a popular model class to learn pairwise interactions by a low-rank approximation. Different from existing FM-based approaches which use a fixed rank for all features, this paper proposes a Rank-Aware FM (RaFM) model which adopts pairwise interactions from FMs with different ranks. On one hand, the proposed model achieves a better performance on real-world datasets where different features usually have significantly varying frequencies of occurrences. On the other hand, we prove that the RaFM model can be stored, evaluated, and trained as efficiently as one single FM, and under some reasonable conditions it can be even significantly more efficient than FM. RaFM improves the performance of FMs in both regression tasks and classification tasks while incurring less computational burden, therefore also has attractive potential in industrial applications. CAB: Continuous Adaptive Blending for Policy Evaluation and Learning The ability to perform offline A/B-testing and off-policy learning using logged contextual bandit feedback is highly desirable in a broad range of applications, including recommender systems, search engines, ad placement, and personalized health care. Both offline A/B-testing and off-policy learning require a counterfactual estimator that evaluates how some new policy would have performed, if it had been used instead of the logging policy. In this paper, we identify a family of counterfactual estimators which subsumes most such estimators proposed to date. Our analysis of this family identifies a new estimator - called Continuous Adaptive Blending (CAB) - which enjoys many advantageous theoretical and practical properties. In particular, it can be substantially less biased than clipped Inverse Propensity Score (IPS) weighting and the Direct Method, and it can have less variance than Doubly Robust and IPS estimators. In addition, it is sub-differentiable such that it can be used for learning, unlike the SWITCH estimator. Experimental results show that CAB provides excellent evaluation accuracy and outperforms other counterfactual estimators in terms of learning performance. MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement Adversarial loss in a conditional generative adversarial network (GAN) is not designed to directly optimize evaluation metrics of a target task, and thus, may not always guide the generator in a GAN to generate data with improved metric scores. To overcome this issue, we propose a novel MetricGAN approach with an aim to optimize the generator with respect to one or multiple evaluation metrics. Moreover, based on MetricGAN, the metric scores of the generated data can also be arbitrarily specified by users. We tested the proposed MetricGAN on a speech enhancement task, which is particularly suitable to verify the proposed approach because there are multiple metrics measuring different aspects of speech signals. Moreover, these metrics are generally complex and could not be fully optimized by Lp or conventional adversarial losses. Neural Separation of Observed and Unobserved Distributions Separating mixed distributions is a long standing challenge for machine learning and signal processing. Most current methods either rely on making strong assumptions on the source distributions or rely on having training samples of each source in the mixture. In this work, we introduce a new method---Neural Egg Separation---to tackle the scenario of extracting a signal from an unobserved distribution additively mixed with a signal from an observed distribution. Our method iteratively learns to separate the known distribution from progressively finer estimates of the unknown distribution. In some settings, Neural Egg Separation is initialization sensitive, we therefore introduce Latent Mixture Masking which ensures a good initialization. Extensive experiments on audio and image separation tasks show that our method outperforms current methods that use the same level of supervision, and often achieves similar performance to full supervision. Almost Unsupervised Text to Speech and Automatic Speech Recognition Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech processing and both achieve impressive performance thanks to the recent advance in deep learning and large amount of aligned speech and text data. However, the lack of aligned data poses a major practical problem for TTS and ASR on low-resource languages. In this paper, by leveraging the dual nature of the two tasks, we propose an almost unsupervised learning method that only leverages few hundreds of paired data and extra unpaired data for TTS and ASR. Our method consists of the following components: (1) denoising auto-encoder, which reconstructs speech and text sequences respectively to develop the capability of language modeling both in speech and text domain; (2) dual transformation, where the TTS model transforms the text y into speech x^ , and the ASR model leverages the transformed pair (x^,y) for training, and vice versa, to boost the accuracy of the two tasks; (3) bidirectional sequence modeling, which address the error propagation problem especially in the long speech and text sequence when training with few paired data; (4) a unified model structure, which combines all the above components for TTS and ASR based on Transformer model. Our method achieves 99.84\% in terms of word level intelligible rate and 2.68 MOS for TTS, and 11.7\% PER for ASR on LJSpeech dataset, by leveraging only 200 paired speech and text data (about 20 minutes audio), together with extra unpaired speech and text data. AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss Despite the progress in voice conversion, many-to-many voice conversion trained on non-parallel data, as well as zero-shot voice conversion, remains under-explored. Deep style transfer algorithms, generative adversarial networks (GAN) in particular, are being applied as new solutions in this field. However, GAN training is very sophisticated and difficult, and there is no strong evidence that its generated speech is of good perceptual quality. In this paper, we propose a new style transfer scheme that involves only an autoencoder with a carefully designed bottleneck. We formally show that this scheme can achieve distribution-matching style transfer by training only on self-reconstruction loss. Based on this scheme, we proposed AutoVC, which achieves state-of-the-art results in many-to-many voice conversion with non-parallel data, and which is the first to perform zero-shot voice conversion. A fully differentiable beam search decoder We introduce a new beam search decoder that is fully differentiable, making it possible to optimize at training time through the inference procedure. Our decoder allows us to combine models which operates at different granularity (e.g. acoustic and language models). It also handles an arbitrary number of target sequence candidates, making it suitable in a context where labeled data is not aligned to input sequences. We demonstrate our approach scales by applying it to speech recognition, jointly training acoustic and word-level language models. The system is end-to-end, with gradients flowing through the whole architecture from the word-level transcriptions. Recent research efforts have shown that deep neural networks with attention-based mechanisms are powerful enough to successfully train an acoustic model from the final transcription, while implicitly learning a language model. Instead, we show that it is possible to discriminatively train an acoustic model jointly with an \emph{explicit} and possibly pre-trained language model.

Sledujte SlidesLive na mobilních zařízeních

© SlidesLive s.r.o.