Multimodal retrieval and generation via structure enhanced context learning

  • Xi Tian

Student thesis: Doctoral ThesisPhD

Abstract

In recent years, the field of multimodal computer vision has experienced substantial growth due to the increasing diversity of multimodal data sources and the capabilities of deep learning techniques. A primary goal for researchers in deep learning is to create a unified space that connects these disparate modalities. Nevertheless, the inherent heterogeneity of these modalities presents a significant challenge, thus complicating their integration. For instance, image features are often continuous and high-dimensional, while text features tend to be sparse and discrete. Therefore, creating a multimodal representation that effectively captures both shared and complementary information across various modalities remains a difficult task.

Inspired by human perception, we investigate modality structures to advance multimodal representation learning. Humans possess an inherent ability to perceive and process the world structurally, allowing for a comprehensive understanding of different modalities such as visual, auditory, and textual information. Utilizing this structural reasoning, humans can quickly comprehend new modalities and form connections, leading to a more profound, contextual understanding of the world. Incorporating this human-like structural perception into AI systems could significantly improve their learning and reasoning abilities when dealing with multimodal data.

Various structures manifest in numerous forms, extending beyond the structure types found in the physical world. For example, physical structures include spatial relationships between objects in 2D images or part organizations in 3D shapes. In addition, abstract structures exist, including narrative elements in stories, film art, and linguistic representations. Our studies include various structure types by undertaking distinct tasks. Specifically, we begin by constructing and benchmarking a contextual retrieval dataset composed of structural movie storyboards to facilitate film story understanding; subsequently, we explore 2D image generation from layout structures, focusing on improving human synthesis quality; finally, we broaden our research domain to 3D colored shapes and examine structure-aware 3D shape generation from text.
Date of Award26 Jun 2024
Original languageEnglish
Awarding Institution
  • University of Bath
SupervisorYongliang Yang (Supervisor) & Peter Hall (Supervisor)

Cite this

'