Command Palette
Search for a command to run...
JiYuan Wang Chunyu Lin Lei Sun Rongying Liu Lang Nie Mingxing Li Kang Liao Xiangxiang Chu Yao Zhao

Abstract
Leveraging visual priors from pre-trained text-to-image (T2I) generativemodels has shown success in dense prediction. However, dense prediction isinherently an image-to-image task, suggesting that image editing models, ratherthan T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuningbehaviors of both editors and generators for dense geometry estimation. Ourfindings show that editing models possess inherent structural priors, whichenable them to converge more stably by refining" their innate features, andultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce FE2E, a framework thatpioneeringly adapts an advanced editing model based on Diffusion Transformer(DiT) architecture for dense geometry prediction. Specifically, to tailor theeditor for this deterministic task, we reformulate the editor's original flowmatching loss into theconsistent velocity" training objective. And we uselogarithmic quantization to resolve the precision conflict between the editor'snative BFloat16 format and the high precision demand of our tasks.Additionally, we leverage the DiT's global attention for a cost-free jointestimation of depth and normals in a single forward pass, enabling theirsupervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performanceimprovements in zero-shot monocular depth and normal estimation across multipledatasets. Notably, it achieves over 35\% performance gains on the ETH3D datasetand outperforms the DepthAnything series, which is trained on 100times data.The project page can be accessed https://amap-ml.github.io/FE2E/{here}.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.