TY - GEN
T1 - Multimodal Rationales for Explainable Visual Question Answering
AU - Li, Kun
AU - Vosselman, George
AU - Yang, Michael Ying
PY - 2025/6/10
Y1 - 2025/6/10
N2 - Visual Question Answering (VQA) is a challenging task of predicting the answer to a question about the content of an image. Prior works directly evaluate the answering models by simply calculating the accuracy of predicted answers. However, the inner reasoning behind the predictions is dis-regarded in such a 'black box' system, and we cannot ascertain the trustworthiness of the predictions. Even more concerning, in some cases, these models predict correct answers despite focusing on irrelevant visual regions or textual tokens. To develop an explainable and trustworthy answering system, we propose a novel model termed MRVQA (Multimodal Rationales for VQA), which provides visual and textual rationales to support its predicted answers. To measure the quality of generated rationales, a new metric vtS (visual-textual Similarity) score is introduced from both visual and textual perspectives. Considering the extra annotations distinct from standard VQA, MRVQA is trained and evaluated using samples synthesized from some existing datasets. Extensive experiments across three EVQA datasets demonstrate that MRVQA achieves new state-of-the-art results through additional rationale generation, enhancing the trustworthiness of the explainable VQA model. The code and the synthesized dataset are released under https://github.com/lik1996/MRVQA2025.
AB - Visual Question Answering (VQA) is a challenging task of predicting the answer to a question about the content of an image. Prior works directly evaluate the answering models by simply calculating the accuracy of predicted answers. However, the inner reasoning behind the predictions is dis-regarded in such a 'black box' system, and we cannot ascertain the trustworthiness of the predictions. Even more concerning, in some cases, these models predict correct answers despite focusing on irrelevant visual regions or textual tokens. To develop an explainable and trustworthy answering system, we propose a novel model termed MRVQA (Multimodal Rationales for VQA), which provides visual and textual rationales to support its predicted answers. To measure the quality of generated rationales, a new metric vtS (visual-textual Similarity) score is introduced from both visual and textual perspectives. Considering the extra annotations distinct from standard VQA, MRVQA is trained and evaluated using samples synthesized from some existing datasets. Extensive experiments across three EVQA datasets demonstrate that MRVQA achieves new state-of-the-art results through additional rationale generation, enhancing the trustworthiness of the explainable VQA model. The code and the synthesized dataset are released under https://github.com/lik1996/MRVQA2025.
KW - explanations
KW - multimodal reasoning
KW - visual question answering
UR - http://www.scopus.com/inward/record.url?scp=105017849290&partnerID=8YFLogxK
U2 - 10.1109/CVPRW67362.2025.00024
DO - 10.1109/CVPRW67362.2025.00024
M3 - Chapter in a published conference proceeding
AN - SCOPUS:105017849290
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 191
EP - 201
BT - Proceedings - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025
PB - IEEE
CY - U. S. A.
T2 - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025
Y2 - 11 June 2025 through 12 June 2025
ER -