Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization

Martin Burger, Samira Kabri, Yury Korolev, Tim Roith, Lukas Weigand

Research output: Contribution to journalArticlepeer-review

Abstract

The aim of this article is to provide a mathematical analysis of transformer architectures using a self-attention mechanism with layer normalization. In particular, observed patterns in such architectures resembling either clusters or uniform distributions pose a number of challenging mathematical questions. We focus on a special case that admits a gradient flow formulation in the spaces of probability measures on the unit sphere under a special metric, which allows us to give at least partial answers in a rigorous way. The arising mathematical problems resemble those recently studied in aggregation equations but with additional challenges emerging from restricting the dynamics to the sphere and the particular form of the interaction energy. We provide a rigorous framework for studying the gradient flow, which also suggests a possible metric geometry to study the general case (i.e. one that is not described by a gradient flow). We further analyse the stationary points of the induced self-attention dynamics. The latter are related to stationary points of the interaction energy in the Wasserstein geometry, and we further discuss energy minimizers and maximizers in different parameter settings. This article is part of the theme issue 'Partial differential equations in data science'.

Original languageEnglish
Article number20240233
JournalPhilosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences
Volume383
Issue number2298
Early online date5 Jun 2025
DOIs
Publication statusPublished - 5 Jun 2025

Data Availability Statement

This article has no additional data.

Funding

M.B. and T.R. acknowledge funding by the German Ministry of Science and Technology (BMBF) under grant agreement No. 01IS24072A (COMFORT). M.B., S.K., T.R. and L.W. acknowledge support from DESY (Hamburg, Germany), a member of the Helmholtz Association HGF. This research was supported in part through the Maxwell computational resources operated at Deutsches Elektronen-Synchrotron DESY, Hamburg, Germany. M.B. and S.K. acknowledge support from the German Research Foundation, project BU 2327/19-1. M.B. and L.W. acknowledge support from the German Research Foundation, project BU 2327/20-1. Y.K. acknowledges support from the German Research Foundation as visiting fellow within the priority programme Foundations of Deep Learning. Part of this study was carried out while S.K. and T.R. were visiting the California Institute of Technology, supported by the DAAD grant for project 57698811 'Bayesian Computations for Large-scale (Nonlinear) Inverse Problems in Imaging'. Y.K. acknowledges the support of the EPSRC (Fellowship EP/V003615/2 and Programme Grant EP/V026259/1). S.K. and Y.K. are grateful for the hospitality of the University of Bath during the workshop 'Machine Learning in Infinite Dimensions', sponsored by the ICMS, LMS, IMI Bath, ProbAI and Maths4DL, where part of this work was undertaken.

FundersFunder number
Engineering and Physical Sciences Research CouncilEP/V003615/2, EP/V026259/1

Keywords

  • gradient flows
  • interaction energies
  • self-attention dynamics
  • stationary states
  • transformer architectures

ASJC Scopus subject areas

  • General Mathematics
  • General Engineering
  • General Physics and Astronomy

Fingerprint

Dive into the research topics of 'Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization'. Together they form a unique fingerprint.

Cite this