Apr 4, 2021
Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden. We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for a heavily-used deep learning accelerator. We train a neural network over kernel-level sub-graphs from the corpus and find that the learned model outperforms a heavily-optimized analytical performance model used in the production XLA compiler on the tile-size selection task. We contribute a brand new performance model for the XLA fusion autotuner, which reduces tuning time on the hardware accelerator.
The Conference on Machine Learning and Systems targets research at the intersection of machine learning and systems. The conference aims to elicit new connections amongst these fields, including identifying best practices and design principles for learning systems, as well as developing novel learning methods and theory tailored to practical machine learning workflows.
Professional recording and live streaming, delivered globally.
Presentations on similar topic, category or speaker