Abstract
Zero-shot, training-free, image-based text-to-video generation is an emerging area that aims to generate videos using existing image-based diffusion models. Current methods in this space require specific architectural changes to image-generation models, which limit their adaptability and scalability. In contrast to such methods, we provide a model-agnostic approach. We use intersections in diffusion trajectories, working only with the latent values. We could not obtain localized frame-wise coherence and diversity using only the intersection of trajectories. Thus, we instead use a grid-based approach. An in-context trained LLM is used to generate coherent frame-wise prompts; another is used to identify differences between frames. Based on these, we obtain a CLIP-based attention mask that controls the timing of switching the prompts for each grid cell. Earlier switching results in higher variance, while later switching results in more coherence. Therefore, Our approach can ensure appropriate control between coherence and variance for the frames. Our approach results in state-of-the-art performance while being more flexible when working with diverse image-generation models. The empirical analysis using quantitative metrics and user studies confirms our model's superior temporal consistency, visual fidelity and user satisfaction, thus providing a novel way to obtain training-free, image-based text-to-video generation. Further examples and code at https://djagpal02.github.io/EIDT-V/
| Original language | English |
|---|---|
| Pages (from-to) | 18219-18228 |
| Number of pages | 10 |
| Journal | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition |
| Early online date | 10 Jun 2025 |
| DOIs | |
| Publication status | Published - 13 Aug 2025 |
| Event | 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 - Nashville, USA United States Duration: 11 Jun 2025 → 15 Jun 2025 |
Funding
This work was supported by the Engineering and Physical Sciences Research Council (EPSRC) under Grant No. EP/T518013/1 and Grant No. EP/Y021614/1. For the purpose of open access, the authors have applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising. The authors also acknowledge the University of Bath for providing access to the HEX high-performance computing (HPC) system, which was used for testing.
Keywords
- attention mechanisms
- diffusion models
- generative models
- large language models
- latent space
- model-agnostic methods
- prompt switching
- text-to-video
- unsupervised learning
- video generation
ASJC Scopus subject areas
- Software
- Computer Vision and Pattern Recognition