Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models

Recent advances in text-to-video diffusion models have enabled high-qualityvideo synthesis, but controllable generation remains challenging, particularlyunder limited data and compute. Existing fine-tuning methods for conditionalgeneration often rely on external encoders or architectural modifications,which demand large datasets and are typically restricted to spatially alignedconditioning, limiting flexibility and scalability. In this work, we introduceTemporal In-Context Fine-Tuning (TIC-FT), an efficient and versatile approachfor adapting pretrained video diffusion models to diverse conditionalgeneration tasks. Our key idea is to concatenate condition and target framesalong the temporal axis and insert intermediate buffer frames withprogressively increasing noise levels. These buffer frames enable smoothtransitions, aligning the fine-tuning process with the pretrained model'stemporal dynamics. TIC-FT requires no architectural changes and achieves strongperformance with as few as 10-30 training samples. We validate our methodacross a range of tasks, including image-to-video and video-to-videogeneration, using large-scale base models such as CogVideoX-5B and Wan-14B.Extensive experiments show that TIC-FT outperforms existing baselines in bothcondition fidelity and visual quality, while remaining highly efficient in bothtraining and inference. For additional results, visithttps://kinam0252.github.io/TIC-FT/