Xiamen University Lab Publishes 18 Papers at NeurIPS 2025
The Multimedia Trusted Perception and Efficient Computing Key Laboratory of the Ministry of Education at Xiamen University has achieved a remarkable milestone by having 18 papers accepted at NeurIPS 2025, one of the three top-tier international conferences in artificial intelligence and machine learning (alongside ICML and ICLR), recognized as a CCF A-class conference. NeurIPS 2025 will be held in two locations: Mexico City from November 30 to December 5, 2025, and San Diego, USA, from December 2 to December 7, 2025. This year, the conference received 21,575 valid submissions for its main track, with 5,290 papers accepted—yielding an acceptance rate of 24.52%. The 18 accepted papers from Xiamen University’s laboratory cover diverse and cutting-edge topics in AI, spanning computer vision, multimodal learning, large language models, efficient computation, and more. Below is a summary of the papers, listed in alphabetical order by the first author’s surname: Pan-LUT: Efficient Pan-sharpening via Learnable Look-Up Tables This work introduces Pan-LUT, a novel architecture that replaces complex deep neural network operations with learnable look-up tables (LUTs) to enable ultra-fast, high-resolution remote sensing image fusion under resource constraints. By using pixel intensity differences from the PAN image as index keys, the method constructs separate LUTs for spectral and local texture information. A rotation-based data augmentation strategy during training expands the effective receptive field, enhancing fine-grained detail recovery. The simplicity of LUT indexing and linear interpolation offers substantial computational advantages over convolutions or attention mechanisms. Experiments show Pan-LUT can process 8K-resolution images in under 1 millisecond while matching or surpassing neural network-based methods in performance—making it highly practical for real-world deployment. Co-first authors: Zhongnan Cai (MSc, 2023), Yingying Wang (PhD, 2022); Corresponding author: Professor Xinghao Ding. Unlocker: Disentangle the Deadlock of Learning from Label-noisy and Long-tailed Data This paper addresses the "deadlock" problem in long-tailed data with label noise: noise correction methods rely on unbiased predictions, while long-tailed learning methods require accurate class distributions as priors—creating a circular dependency. To resolve this, the authors propose Unlocker, a two-level optimization framework. The inner loop jointly optimizes noise identification and long-tailed correction, while the outer loop adaptively balances bias correction strength. Extensive experiments show Unlocker significantly outperforms existing methods across multiple benchmarks. First author: Shu Chen (MSc, 2023); Corresponding author: Assistant Professor Yang Lu; co-authors include Xuhong Jun (BSc, 2023), Rui Chi Zhang (MSc, 2024), Dr. Mengke Li (Shenzhen University), Dr. Yonggang Zhang (HKUST), Prof. Bo Han (BNU-HKBU UIC), Prof. Xiaoming Zhang (BNU-HKBU UIC), and Prof. Hanzhi Wang. PlanU: Large Language Model Decision Making through Planning under Uncertainty PlanU is a planning framework for LLMs operating under uncertainty, integrating uncertainty modeling into Monte Carlo Tree Search (MCTS). It introduces two key innovations: (1) value distribution modeling—representing returns as quantile distributions to capture uncertainty more precisely—and (2) a curiosity-driven Upper Confidence Bound (UCC) mechanism to guide search by quantifying node uncertainty. Evaluated on WebShop and TravelPlanner benchmarks, PlanU demonstrates superior performance, adaptability, and robustness across models. Co-first authors: Ziwei Deng (MSc, 2023), Mian Deng (MSc, 2023); Corresponding author: Long-term Faculty Member Siqu Shen; co-authors include Chenjing Liang (MSc, 2025), Zeming Gao, Ma Chenan (MSc graduate), Lin Chenxing, Zhang Haipeng, Dr. Songzhu Mei (NUDT), and Prof. Cheng Wang. WarpGAN: Warping-Guided 3D GAN Inversion with Style-Based Novel View Inpainting WarpGAN tackles single-image 3D GAN inversion for novel view synthesis. Unlike prior methods that rely solely on generative priors for occluded regions—leading to poor quality due to low-bit latent codes—this work integrates image inpainting into the inversion pipeline. It first projects a single view into a 3D latent code, then warps the 3D model to the target view using depth maps. Finally, SVINet repairs occluded regions using symmetric priors and multi-view consistency. Quantitative and qualitative results confirm superior performance over state-of-the-art methods. First author: Kaitao Huang (MSc, 2024); Corresponding author: Professor Yan Yan; co-authors: Jing-Hao Xue (UCL), Prof. Hanzhi Wang. Discovering Important Experts for Mixture-of-Experts Models Pruning Through a Theoretical Perspective This paper proposes Shapley-MoE, a theoretically grounded, scalable method for pruning experts in MoE models. Inspired by cooperative game theory, it uses Shapley values to measure expert contribution without exhaustive evaluation. To overcome the NP-hard complexity, it introduces a Monte Carlo-based approximation with two enhancements: early stopping for unstable small subsets and router-guided importance sampling to prioritize high-impact subsets. Theoretical and empirical results show superior pruning performance and efficiency. First author: Weizhong Huang (PhD, 2025); Corresponding author: Professor Liujuan Cao; co-authors: Yuxin Zhang (PhD, 2022), Xianwu Zheng (Associate Prof.), Fei Chao (Associate Prof.), and Rongrong Ji. DAMamba: Vision State Space Model with Dynamic Adaptive Scan While SSMs have gained attention in vision, most rely on fixed, manual scanning patterns that disrupt spatial relationships. This work proposes Dynamic Adaptive Scanning (DAS), a data-driven method that learns optimal scan order and regions. Built on DAS, DAMamba achieves state-of-the-art performance in image classification, object detection, instance segmentation, and semantic segmentation—surpassing many CNNs and ViTs—while maintaining linear complexity. Co-first authors: Tanzhe Li (PhD, 2024), Caoshuo Li (MSc, 2023); Corresponding author: Associate Prof. Taesong Jin; co-authors: Prof. Baichang Zhang (BUAA), Prof. Rongrong Ji. Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval To accelerate LLM inference, Spotlight Attention improves key-value cache retrieval by using non-linear hashing to better align query and key embeddings. It introduces a lightweight training framework based on Bradley-Terry loss, trained in just 8 hours on a single 16GB GPU. Results show a fivefold reduction in hash code length and over 3× end-to-end throughput on a single A100 GPU, with retrieval time under 100 microseconds for 512K tokens. First author: Wenhao Li (MSc, 2023); Corresponding author: Prof. Rongrong Ji; co-authors: Yuxin Zhang (PhD, 2022), Dr. Gen Luo (Shanghai AI Lab), Prof. Fei Chao, Haiyuan Wan (Tsinghua), Ziyang Gong (SJTU). Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs This paper addresses hallucinations in multi-image MLLMs due to poor cross-modal alignment. It proposes CcDPO, a hierarchical preference optimization framework that first corrects global context bias and then focuses on local visual cues. A new dataset, MultiScope-42k, supports scalable training. Experiments show significant reduction in hallucinations and consistent gains across tasks. Co-first authors: Xudong Li (PhD, 2025), Mengdan Zhang (Tencent Youtu); Corresponding author: Engineer Yan Zhang; co-authors: Peixian Chen (Tencent Youtu), Xianwu Zheng, Xing Sun (Tencent Youtu), Prof. Rongrong Ji. LTD-Bench: Evaluating Large Language Models by Letting Them Draw LTD-Bench evaluates LLMs through visual drawing tasks—requiring models to generate shapes from text or recognize them from drawings. This shifts evaluation from opaque metrics to intuitive, visual outputs, revealing spatial reasoning flaws. The benchmark includes three difficulty levels and two directions: language-to-space and space-to-language. Experiments show even top models struggle, highlighting a critical gap in real-world applicability. Co-first authors: Liuhao Lin (MSc, 2023), Ke Li (Tencent Youtu); Corresponding author: Engineer Yan Zhang; co-authors: Zihan Xu (Tencent Youtu), Yuchen Shi (Tencent Youtu), Yulei Qin (Tencent Youtu), Xing Sun (Tencent Youtu), Prof. Rongrong Ji. JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent JarvisArt is an MLLM-powered agent that simulates professional retouchers. It uses a two-stage training process: chain-of-thought fine-tuning and Group Relative Policy Optimization with Reward (GRPO-R). The Agent-Lightroom Protocol enables seamless integration with Adobe Lightroom. A new benchmark, MMArt-Bench, shows JarvisArt outperforms GPT-4o by 60% in content fidelity while maintaining strong instruction following. Co-first authors: Yunlong Lin, Zixu Lin, Kunjie Lin (MSc, 2023–2024); Corresponding author: Prof. Xinghao Ding; co-authors: Prof. Shuicheng Yan (NUS). CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models CPPO accelerates GRPO by pruning completions with low absolute advantage. It dynamically reallocates computational resources, achieving 8.32× and 3.51× speedups on GSM8K and Math datasets, respectively, without sacrificing accuracy. First author: Zhihang Lin (PhD, 2024); Corresponding author: Prof. Rongrong Ji; co-authors: Mingbao Lin (Rakuten), Prof. Yuan Xie (ECNU). EPA: Boosting Event-based Video Frame Interpolation with Perceptually Aligned Learning EPA replaces pixel-level supervision with perceptual alignment in a semantic feature space. It uses vision foundation models and a bidirectional event-guided module to generate high-quality, human-perceptible interpolated frames. Experiments confirm its superiority in high-speed motion scenarios. First author: Yuhang Liu (PhD, 2025); Corresponding author: Associate Prof. Yongjian Deng (Beijing University of Technology); co-authors: Linghui Fu, Zhen Yang, Hao Chen (SEU), Youfu Li (CityU). Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension Video-RAG enhances long video understanding by extracting visual-aligned auxiliary text (audio, text, object detection) from raw videos and retrieving relevant context. It boosts performance on open-source models, even surpassing commercial models like GPT-4o and Gemini 1.5 when combined with 72B models. First author: Yongdong Luo (MSc, 2023); Corresponding author: Associate Prof. Xianwu Zheng; co-authors: Jiayi Ji (Postdoc), Jinfa Huang (Rochester), Prof. Rongrong Ji. FRN: Fractal-Based Recursive Spectral Reconstruction Network FRN treats spectral reconstruction as a recursive process inspired by fractals. It predicts spectra step-by-step using adjacent bands, leveraging the low-rank nature of spectral data. A band-aware state space model further reduces interference from low-correlation regions. FRN outperforms state-of-the-art methods across datasets. First author: Ge Meng (PhD, 2021); Corresponding author: Prof. Xinghao Ding. L2RSI: Cross-view LiDAR-based Place Recognition for Large-scale Urban Scenes via Remote Sensing Imagery L2RSI enables LiDAR place recognition in large urban areas (over 100 km²) using high-resolution remote sensing images. It aligns LiDAR bird’s-eye views and satellite patches in a shared semantic space via semantic contrastive learning. A spatio-temporal particle filter aggregates information for precise localization. First author: Ziwei Shi (PhD, 2023); Corresponding author: Associate Prof. Yu Zang; co-authors: Xiaoran Zhang (MSc, 2024), Wenjing Xu, Prof. Yan Xia (USTC), Siqu Shen, Prof. Cheng Wang. DynamicVerse: Physically-Aware Multimodal Modeling for Dynamic 4D Worlds DynamicVerse is a framework for reconstructing physical-scale, multimodal 4D scenes from internet videos. It extracts geometry, motion, masks, and descriptions using large-scale models and optimizes sequences via windowed bundle adjustment. The resulting dataset includes over 100K videos, 800K masks, and 10M+ frames. It outperforms existing methods in depth, pose, and intrinsic estimation. Co-first authors: Kairun Wen, Yuzhi Huang (MSc, 2021); Corresponding author: Prof. Xinghao Ding. Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and Empirical Findings This study reveals three inference phases in MLLMs: early fusion, intra-modal modeling, and multimodal reasoning. It observes that visual tokens become redundant once text tokens receive sufficient image context. Based on this, the authors propose DyVTE—a dynamic visual token exit mechanism that removes redundant visual tokens, orthogonal and complementary to existing compression methods. First author: Qiong Wu (PhD, 2022); Corresponding author: Associate Prof. Yiyi Zhou; co-authors: Wenhao Lin (MSc, 2024), Weihao Ye (MSc, 2023), Zhanpeng Zeng (Assistant Prof.), Xiaoshuai Sun (Associate Prof.), Prof. Rongrong Ji. GTR-Loc: Geospatial Text Regularization Assisted Outdoor LiDAR Localization GTR-Loc uses geospatial text (location/direction descriptions) to resolve localization ambiguities in similar geometric scenes. It introduces a modality-removal distillation strategy to transfer text knowledge into the model, enabling high-accuracy localization without requiring text input at inference. It outperforms state-of-the-art methods on multiple outdoor datasets. First author: Shushang Yu (PhD, 2022, now Assistant Professor, Northeastern University); Corresponding author: Prof. Cheng Wang; co-authors: Wen Li (PhD), Xiaotian Sun, Zhimin Yuan (Nanyang Normal), Xin Wang (NEU), Sijie Wang (NTU), Rui She (BUAA).