Pix2Next: Leveraging Vision Foundation Models for RGB to NIR Image Translation

This paper proposes Pix2Next, a novel image-to-image translation frameworkdesigned to address the challenge of generating high-quality Near-Infrared(NIR) images from RGB inputs. Our approach leverages a state-of-the-art VisionFoundation Model (VFM) within an encoder-decoder architecture, incorporatingcross-attention mechanisms to enhance feature integration. This design capturesdetailed global representations and preserves essential spectralcharacteristics, treating RGB-to-NIR translation as more than a simple domaintransfer problem. A multi-scale PatchGAN discriminator ensures realistic imagegeneration at various detail levels, while carefully designed loss functionscouple global context understanding with local feature preservation. Weperformed experiments on the RANUS dataset to demonstrate Pix2Next's advantagesin quantitative metrics and visual quality, improving the FID score by 34.81%compared to existing methods. Furthermore, we demonstrate the practical utilityof Pix2Next by showing improved performance on a downstream object detectiontask using generated NIR data to augment limited real NIR datasets. Theproposed approach enables the scaling up of NIR datasets without additionaldata acquisition or annotation efforts, potentially accelerating advancementsin NIR-based computer vision applications.