Supervised learning: no loss no cry

by · Jul 12, 2020 · 201 views ·

Supervised learning requires the specification of a loss function to minimise. While the theory of admissible losses from both a computational and statistical perspective is well-developed, these offer a panoply of different choices. In practice, this choice is typically made in an ad hoc manner. In hopes of making this procedure more principled, the problem of learning the loss function for a downstream task (e.g., classification) has garnered recent interest. However, works in this area have been generally empirical in nature. In this paper, we revisit the SLIsotron algorithm of Kakade et al. (2011) through a novel lens, derive a generalisation based on Bregman divergences, and show how it provides a principled procedure for learning the loss. In detail, we cast SLIsotron as learning a loss from a family of composite square losses. By interpreting this through the lens of proper losses, we derive a generalisation of SLIsotron based on Bregman divergences. The resulting BregmanTron algorithm jointly learns the loss along with the classifier. It comes equipped with a simple guarantee of convergence for the loss it learns, and its set of possible outputs comes with a guarantee of agnostic approximability of Bayes rule. Experiments indicate that the BregmanTron significantly outperforms the SLIsotron, and that the loss it learns can be minimized by other algorithms for different tasks, thereby opening the interesting problem of loss transfer between domains.