TY - GEN
T1 - Train Faster, Perform Better
T2 - Modular Adaptive Training in Over-Parameterized Models
AU - Shi, Y.
AU - Chen, Y.
AU - Dong, M.
AU - Yang, X.
AU - Li, D.
AU - Wang, Y.
AU - Dick, R.
AU - Lv, Q.
AU - Zhao, Y.
AU - Yang, F.
AU - Lu, T.
AU - Gu, N.
AU - Shang, L.
PY - 2023/12/31
Y1 - 2023/12/31
N2 - Despite their prevalence in deep-learning communities, over-parameterized models convey high demands of computational costs for proper training. This work studies the fine-grained, modular-level learning dynamics of over-parameterized models to attain a more efficient and fruitful training strategy. Empirical evidence reveals that when scaling down into network modules, such as heads in self-attention models, we can observe varying learning patterns implicitly associated with each module's trainability. To describe such modular-level learning capabilities, we introduce a novel concept dubbed modular neural tangent kernel (mNTK), and we demonstrate that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue λ max. A large λ max indicates that the module learns features with better convergence, while those miniature ones may impact generalization negatively. Inspired by the discovery, we propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their λ max exceeding a dynamic threshold selectively, concentrating the model on learning common features and ignoring those inconsistent ones. Unlike most existing training schemes with a complete BP cycle across all network modules, MAT can significantly save computations by its partially-updating strategy and can further improve performance. Experiments show that MAT nearly halves the computational cost of model training and outperforms the accuracy of baselines.
AB - Despite their prevalence in deep-learning communities, over-parameterized models convey high demands of computational costs for proper training. This work studies the fine-grained, modular-level learning dynamics of over-parameterized models to attain a more efficient and fruitful training strategy. Empirical evidence reveals that when scaling down into network modules, such as heads in self-attention models, we can observe varying learning patterns implicitly associated with each module's trainability. To describe such modular-level learning capabilities, we introduce a novel concept dubbed modular neural tangent kernel (mNTK), and we demonstrate that the quality of a module's learning is tightly associated with its mNTK's principal eigenvalue λ max. A large λ max indicates that the module learns features with better convergence, while those miniature ones may impact generalization negatively. Inspired by the discovery, we propose a novel training strategy termed Modular Adaptive Training (MAT) to update those modules with their λ max exceeding a dynamic threshold selectively, concentrating the model on learning common features and ignoring those inconsistent ones. Unlike most existing training schemes with a complete BP cycle across all network modules, MAT can significantly save computations by its partially-updating strategy and can further improve performance. Experiments show that MAT nearly halves the computational cost of model training and outperforms the accuracy of baselines.
UR - http://www.scopus.com/inward/record.url?eid=2-s2.0-85191192275&partnerID=MN8TOARS
UR - https://www.scopus.com/pages/publications/85191192275
M3 - Chapter in a published conference proceeding
BT - Advances in Neural Information Processing Systems, NeurIPS 2023
PB - NeurIPS Proceedings
ER -