Abstract
Mixture-of-Experts (MoE) is widely adopted to deploy Large Language Models (LLMs) on edge devices with limited memory budgets. Although MoE is, in theory, an inborn memory-friendly architecture requiring only a few activated experts to reside in the memory for inference, current MoE architectures cannot effectively fulfill this advantage and will yield intolerable inference latencies of LLMs on memory-constrained devices. Our investigation pinpoints the essential cause as the remarkable temporal inconsistencies of inter-token expert activations, which generate overly frequent expert swapping demands dominating the latencies. To this end, we propose a novel MoE architecture, Oracle-MoE, to fulfill the real on-device potential of MoE-based LLMs. Oracle-MoE route tokens in a highly compact space suggested by attention scores, termed the oracle space, to effectively maintain the semantic locality across consecutive tokens to reduce expert activation variations, eliminating massive swapping demands. Theoretical analysis proves that Oracle-MoE is bound to provide routing decisions with better semantic locality and, there-fore, better expert activation consistencies. Experiments on the pretrained GPT-2 architectures of different sizes (200M, 350M, 790M, and 2B) and downstream tasks demonstrate that without compromising task performance, our Oracle-MoE has achieved state-of-the-art inference speeds across varying memory budgets, revealing its substantial potential for LLM deployments in industry.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of Machine Learning Research |
| Editors | A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, J. Zhu |
| Publisher | ML Research Press |
| Pages | 78633-78650 |
| Number of pages | 18 |
| Volume | 267 |
| Publication status | Published - 19 Jul 2025 |
| Event | 42nd International Conference on Machine Learning - Canada, Vancouver Duration: 13 Jul 2025 → 19 Jul 2025 Conference number: 42 https://icml.cc/ |
Conference
| Conference | 42nd International Conference on Machine Learning |
|---|---|
| Abbreviated title | ICML 2025 |
| City | Vancouver |
| Period | 13/07/25 → 19/07/25 |
| Internet address |
Funding
Yujiang Wang was supported by a Basic Research Program of Jiangsu (BK20240414) and a Leadership Talent Program (Science and Education) of Suzhou Industrial Park (KJQ2024204).
Fingerprint
Dive into the research topics of 'Oracle-MoE: Locality-preserving Routing in the Oracle Space for Memory-constrained Large Language Model Inference'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS