TY - GEN
T1 - SpectFormer
T2 - 2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025
AU - Patro, Badri N.
AU - Namboodiri, Vinay P.
AU - Agneeswaran, Vijay S.
PY - 2025/4/8
Y1 - 2025/4/8
N2 - Vision transformers have been applied successfully for image recognition tasks. There have been either multi-headed self-attention based (ViT [12], DelT [54]) simi-lar to the original work in textual models or more re-cently based on spectral layers (Fnet [29], GFNet [46], AFNO [15]). We hypothesize that spectral layers cap-ture high-frequency information such as lines and edges, while attention layers capture token interactions. We inves-tigate this hypothesis through this work and observe that indeed mixing spectral and multi-headed attention layers provides a better transformer architecture. We thus pro-pose the novel Spectformer architecture for vision trans-formers that has initial spectral and deeper multi-headed attention layers. We believe that the resulting representation allows the transformer to capture the feature repre-sentation appropriately and it yields improved performance over other transformer representations. For instance, it im-proves the top-1 accuracy by 2% on ImageNet compared to both GFNet-H and LiT. SpectFormer-H-S reaches 84.25% top-1 accuracy on ImageNet-1 K (state of the art for small version). Further, Spectformer-H-L achieves 85.7% which is the state of the art for the comparable base version of the transformers. We further validated the SpectFormer per-formance in other scenarios such as transfer learning on standard datasets such as ClFAR-10, ClFAR-100, Oxford-lIlT-flower, and Standford Car datasets. We then investigate its use in downstream tasks such as object detection and instance segmentation on the MS-COCO dataset and ob-serve that Spectformer shows consistent performance that is comparable to the best backbones and can be further optimized and improved. The source code is available on this website https://github.com/badripatro/SpectFormers.
AB - Vision transformers have been applied successfully for image recognition tasks. There have been either multi-headed self-attention based (ViT [12], DelT [54]) simi-lar to the original work in textual models or more re-cently based on spectral layers (Fnet [29], GFNet [46], AFNO [15]). We hypothesize that spectral layers cap-ture high-frequency information such as lines and edges, while attention layers capture token interactions. We inves-tigate this hypothesis through this work and observe that indeed mixing spectral and multi-headed attention layers provides a better transformer architecture. We thus pro-pose the novel Spectformer architecture for vision trans-formers that has initial spectral and deeper multi-headed attention layers. We believe that the resulting representation allows the transformer to capture the feature repre-sentation appropriately and it yields improved performance over other transformer representations. For instance, it im-proves the top-1 accuracy by 2% on ImageNet compared to both GFNet-H and LiT. SpectFormer-H-S reaches 84.25% top-1 accuracy on ImageNet-1 K (state of the art for small version). Further, Spectformer-H-L achieves 85.7% which is the state of the art for the comparable base version of the transformers. We further validated the SpectFormer per-formance in other scenarios such as transfer learning on standard datasets such as ClFAR-10, ClFAR-100, Oxford-lIlT-flower, and Standford Car datasets. We then investigate its use in downstream tasks such as object detection and instance segmentation on the MS-COCO dataset and ob-serve that Spectformer shows consistent performance that is comparable to the best backbones and can be further optimized and improved. The source code is available on this website https://github.com/badripatro/SpectFormers.
KW - attention
KW - fft
KW - spectral gating network
KW - transformer
KW - vision transformer
KW - vit
UR - http://www.scopus.com/inward/record.url?scp=105003629518&partnerID=8YFLogxK
U2 - 10.1109/WACV61041.2025.00924
DO - 10.1109/WACV61041.2025.00924
M3 - Chapter in a published conference proceeding
AN - SCOPUS:105003629518
T3 - Proceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025
SP - 9543
EP - 9554
BT - Proceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025
PB - IEEE
CY - U. S. A.
Y2 - 28 February 2025 through 4 March 2025
ER -