Abstract
Visual question answering (VQA) is an important and challenging multimodal task in computer vision and photogrammetry. Recently, efforts have been made to bring the VQA task to aerial images, due to its potential real-world applications in disaster monitoring, urban planning, and digital earth product generation. However, the development of VQA in this domain is restricted by the huge variation in the appearance, scale, and orientation of the concepts in aerial images, along with the scarcity of well-annotated datasets. In this paper, we introduce a new dataset, HRVQA, which provides a collection of 53,512 aerial images of 1024 × 1024 pixels and semi-automatically generated 1,070,240 QA pairs. To benchmark the understanding capability of VQA models for aerial images, we evaluate the recent methods on the HRVQA dataset. Moreover, we propose a novel model, GFTransformer, with gated attention modules and a mutual fusion module. The experiments show that the proposed dataset is quite challenging, especially the specific attribute-related questions. Our method achieves superior performance in comparison to the previous state-of-the-art approaches. The dataset and the source code are released at https://hrvqa.nl/
Original language | English |
---|---|
Pages (from-to) | 65-81 |
Number of pages | 17 |
Journal | ISPRS Journal of Photogrammetry and Remote Sensing |
Volume | 214 |
Early online date | 14 Jun 2024 |
DOIs | |
Publication status | Published - 31 Aug 2024 |
Acknowledgements
We sincerely appreciate all valuable comments and suggestions from all reviewers, which helped us to improve the quality of the paper.Keywords
- Benchmark dataset
- High-resolution aerial images
- Transformers
- Visual question answering
ASJC Scopus subject areas
- Computers in Earth Sciences
- Engineering (miscellaneous)
- Atomic and Molecular Physics, and Optics
- Computer Science Applications