Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference

Shuqing Luo, Pingzhi Li, Jie Peng, Hanrui Wang, Yang (Katie) Zhao, Yu (Kevin) Cao, Yu Cheng, Tianlong Chen
UNC, UCLA, University of Minnesota Twin Cities
(* indicates equal contribution)

News

Waiting for more news.

Awards

No items found.

Competition Awards

No items found.

Abstract

Mixture-of-experts (MoE) architectures could achieve impressive computational efficiency with expert parallelism, which relies heavily on all-to-all communication across devices. Unfortunately, such communication overhead typically constitutes a significant portion of the total runtime, hampering the scalability of distributed training and inference for modern MoE models (consuming over $40\%$ of the runtime in large-scale training). In this paper, we first define \textit{collaborative communication} to illustrate this intrinsic limitation, and then propose system- and algorithm-level innovations to reduce communication costs. Specifically, given a pair of experts co-activated by one token, we call them as \textit{collaborated}, which comprises $2$ cases as \textit{intra-} and \textit{inter-collaboration}, depending on whether they are kept on the same device. Our pilot investigations reveal that augmenting the proportion of intra-collaboration can accelerate expert parallel at scale. It motivates us to strategically \uline{\texttt{o}}ptimize \uline{\texttt{c}}ollaborative \uline{\texttt{c}}omm\uline{\texttt{u}}nication for acce\uline{\texttt{l}}era\uline{\texttt{t}}ed MoE training and inference, dubbed \textbf{\texttt{Occult}}. Our designs are capable of \uline{either} delivering exact results with reduced communication cost, \uline{or} controllably minimizing the cost with collaboration pruning, materialized by modified fine-tuning. Comprehensive experiments on various MoE-LLMs demonstrate that \texttt{Occult} can be faster than popular state-of-the-art inference or training frameworks (over $50\%$ speed up across multiple tasks and models) with comparable or superior quality compared to the standard fine-tuning.

Video

Citation

{}

Media

No media articles found.

Acknowledgment

Team Members