Can Machine Learning Pipelines Be Better Configured?

Dec 5, 2023

Speakers

About

A Machine Learning (ML) pipeline configures the workflow of a learning task using the APIs provided by ML libraries. However, a pipeline’s performance can vary significantly across different configurations of ML library versions. Misconfigured pipelines can result in inferior performance, such as poor \emph{execution time} and \emph{memory usage}, \emph{numeric errors} and even \emph{crashes}. A pipeline is subject to misconfiguration if it exhibits significantly inconsistent performance upon changes in the versions of its configured libraries or the combination of these libraries. We refer to such performance inconsistency as a \emph{pipeline configuration (PLC) issue}. There is no prior systematic study on the pervasiveness, impact and root causes of PLC issues. A systematic understanding of these issues helps configure effective ML pipelines and identify misconfigured ones. In this paper, we conduct the first empirical study of PLC issues. To better dig into the problem, we propose \textsc{Piecer}, an infrastructure that automatically generates a set of pipeline variants by varying different version combinations of ML libraries and compares their performance inconsistencies. We apply \textsc{Piecer} to the 3,380 pipelines that can be deployed out of the 11,363 ML pipelines collected from multiple ML competitions at \textsc{Kaggle} platform. The empirical study results show that 1,092 (32.3%) of the 3,380 pipelines manifest significant performance inconsistencies on at least one variant. We find that 399, 243 and 440 pipelines can achieve better competition scores, execution time and memory usage, respectively, by adopting a different configuration. Based on our empirical findings, we construct a repository containing 164 defective APIs and 106 API combinations from 418 library versions. The defective API repository facilitates future studies of automated detection techniques for PLC issues. Leveraging the repository, we captured PLC issues in 309 real-world ML pipelines.

Organizer

Categories

Store presentation

Should this presentation be stored for 1000 years?

How do we store presentations

Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%

Sharing

Recommended Videos

Presentations on similar topic, category or speaker

Interested in talks like this? Follow ESEC-FSE