SpectFormer: Frequency and Attention is what you need in a Vision Transformer

Badri N. Patro, Vinay P. Namboodiri, Vijay S. Agneeswaran

Research output: Chapter or section in a book/report/conference proceedingChapter in a published conference proceeding

3 Citations (SciVal)

Abstract

Vision transformers have been applied successfully for image recognition tasks. There have been either multi-headed self-attention based (ViT [12], DelT [54]) simi-lar to the original work in textual models or more re-cently based on spectral layers (Fnet [29], GFNet [46], AFNO [15]). We hypothesize that spectral layers cap-ture high-frequency information such as lines and edges, while attention layers capture token interactions. We inves-tigate this hypothesis through this work and observe that indeed mixing spectral and multi-headed attention layers provides a better transformer architecture. We thus pro-pose the novel Spectformer architecture for vision trans-formers that has initial spectral and deeper multi-headed attention layers. We believe that the resulting representation allows the transformer to capture the feature repre-sentation appropriately and it yields improved performance over other transformer representations. For instance, it im-proves the top-1 accuracy by 2% on ImageNet compared to both GFNet-H and LiT. SpectFormer-H-S reaches 84.25% top-1 accuracy on ImageNet-1 K (state of the art for small version). Further, Spectformer-H-L achieves 85.7% which is the state of the art for the comparable base version of the transformers. We further validated the SpectFormer per-formance in other scenarios such as transfer learning on standard datasets such as ClFAR-10, ClFAR-100, Oxford-lIlT-flower, and Standford Car datasets. We then investigate its use in downstream tasks such as object detection and instance segmentation on the MS-COCO dataset and ob-serve that Spectformer shows consistent performance that is comparable to the best backbones and can be further optimized and improved. The source code is available on this website https://github.com/badripatro/SpectFormers.

Original languageEnglish
Title of host publicationProceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025
Place of PublicationU. S. A.
PublisherIEEE
Pages9543-9554
Number of pages12
ISBN (Electronic)9798331510831
DOIs
Publication statusPublished - 8 Apr 2025
Event2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025 - Tucson, USA United States
Duration: 28 Feb 20254 Mar 2025

Publication series

NameProceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025

Conference

Conference2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025
Country/TerritoryUSA United States
CityTucson
Period28/02/254/03/25

Keywords

  • attention
  • fft
  • spectral gating network
  • transformer
  • vision transformer
  • vit

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Computer Vision and Pattern Recognition
  • Human-Computer Interaction
  • Modelling and Simulation
  • Radiology Nuclear Medicine and imaging

Fingerprint

Dive into the research topics of 'SpectFormer: Frequency and Attention is what you need in a Vision Transformer'. Together they form a unique fingerprint.

Cite this