Oral: In-network Aggregation for Shared Machine Learning Clusters

Apr 4, 2021

Speakers

About

We present PANAMA, a network architecture for machine learning (ML) workloads on shared clusters where a variety of training jobs co-exist.PANAMA consists of two key components: (i) an efficient in-network hardware accelerator designed to accelerate large data-parallel training transfers; and (ii) a lightweight congestion control protocol to enable fair sharing of network resources across different flows. Our congestion control protocol exploits the unique communication pattern in training to ensure large in-network aggregation transfers do not negatively impact short latency-sensitive flows. To evaluate the feasibility of PANAMA, we build an FPGA-based prototype with 10 Gbps transceivers and show that our hardware datapath achieves line-rate aggregation. Our large-scale simulations demonstrate that PANAMA improves the mean and 99%-tile completion time of latency-sensitive short flows by a factor of 2–4.5 while reducing the average training time of large jobs by a factor of 1.25.

Organizer

Categories

About MLSys 2021

The Conference on Machine Learning and Systems targets research at the intersection of machine learning and systems. The conference aims to elicit new connections amongst these fields, including identifying best practices and design principles for learning systems, as well as developing novel learning methods and theory tailored to practical machine learning workflows.

Store presentation

Should this presentation be stored for 1000 years?

How do we store presentations

Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%

Sharing

Recommended Videos

Presentations on similar topic, category or speaker

Interested in talks like this? Follow MLSys 2021