35 Papers Accepted to the 2023 International Conference on Computer Vision and Pattern Recognition (CVPR) ---- Institute of Automation, Chinese Academy of Sciences
The Chinese Academy of Sciences' Institute of Automation has had 35 papers accepted for the 2023 Conference on Computer Vision and Pattern Recognition (CVPR), highlighting significant advancements in various aspects of computer vision and pattern recognition. The CVPR is one of the top three conferences in the field, and this year's event is scheduled to take place in Vancouver, Canada. Below is a summary of the key contributions: 1. **High-Quality Video Generation Using Decomposed Diffusion Models**: This paper introduces a novel decomposed diffusion model that addresses the challenges of video generation by considering the temporal correlation and content redundancy between video frames. The model uses two jointly trained neural networks to estimate a base noise shared across frames and a residual noise unique to each frame. The approach demonstrates superior quantitative results and supports text-to-video generation, showcasing advancements in video synthesis and manipulation. 2. **Semantic Prompt for Few-Shot Learning**: The authors propose a semantic prompt method to enhance the robustness and performance of few-shot learning classifiers. By using semantic information as a prompt to adaptively adjust the feature extractor, the method improves the discriminative ability of features from few-shot images, achieving a 3.67% improvement in average precision on four few-shot learning datasets. 3. **Clothing-Change Feature Augmentation for Person Re-Identification**: This paper tackles the problem of re-identifying pedestrians who change their clothing across different camera views. The proposed method, CCFA, generates augmented features in the feature space to simulate clothing changes while preserving identity, leading to improved accuracy and generalization in re-identification tasks. 4. **DeltaEdit: Text-Free Training for Text-Driven Image Manipulation**: DeltaEdit introduces a framework for text-driven image editing that does not require text annotations during training. By mapping the difference in CLIP visual features to StyleGAN's editing directions, DeltaEdit enables flexible and high-quality text-driven image manipulations, outperforming existing methods in both training and inference flexibility. 5. **RiDDLE: Reversible and Diversified De-identification with Latent Encryptor**: RiDDLE is a method for face de-identification that ensures the security and diversity of anonymized identities. The system uses a password to guide the encryption process, and only the correct password can decrypt the original identity, enhancing privacy protection. 6. **FrustumFormer: Adaptive Instance-Aware Resampling for Multi-View 3D Detection**: FrustumFormer is a new 3D object detector that focuses on instance regions during feature transformation from 2D to 3D space. By learning occupancy masks and reducing position uncertainty, the model achieves state-of-the-art performance in multi-view 3D detection. 7. **Intrinsic Physical Concepts Discovery with Object-Centric Predictive Models**: This research proposes PHYCINE, a deep learning system inspired by developmental psychology that infers intrinsic physical properties of objects from visual observations. The model successfully decouples physical concepts and improves causal reasoning tasks related to physical events. 8. **Sharpness-Aware Gradient Matching for Domain Generalization**: SAGM is an algorithm that enhances model generalization by aligning the gradients of empirical risk and perturbation loss. The method outperforms state-of-the-art approaches in domain generalization across five benchmarks. 9. **3D Video Object Detection with Learnable Object-Centric Global Optimization**: BA-Det is a 3D video object detector that optimizes long-term visual correspondences. The model uses a learnable object-centric global optimization approach, achieving superior performance on the Waymo Open Dataset with minimal additional computational cost. 10. **BAEFormer: Bi-directional and Early Interaction Transformers for Bird’s Eye View Semantic Segmentation**: BAEFormer is a new Transformer framework for bird's eye view (BEV) semantic segmentation. By using a bi-directional and early interaction mechanism, the model effectively converts perspective views to BEV, outperforming existing methods on the nuScenes dataset. 11. **Hard Patches Mining for Masked Image Modeling**: HPM is a self-supervised masked image modeling framework that mines hard patches to create more challenging training tasks. This approach improves the model's ability to generalize and handle diverse image content. 12. **Blind Video Deflickering by Neural Filtering with a Flawed Atlas**: This paper presents a general framework for eliminating flicker from videos without requiring specific guidance. The method uses neural layers and filters to maintain temporal consistency and avoid artifacts, achieving high performance on various real-world datasets. 13. **BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision**: BEVFormer v2 is a novel BEV detector that leverages perspective supervision to improve convergence and adapt modern image backbones. The model achieves state-of-the-art results on the nuScenes dataset. 14. **Vision Transformer with Super Token Sampling**: SViT is a hierarchical Vision Transformer backbone that introduces super tokens to reduce redundancy and efficiently model global feature dependencies. The network outperforms existing models on various visual tasks. 15. **Mind the Label-shift for Augmentation-based Graph Out-of-distribution Generalization**: LiSA is a graph data augmentation strategy that addresses the label shift problem by generating label-invariant subgraphs. The method significantly improves the generalization performance of graph neural networks on out-of-distribution data. 16. **OpenMix: Exploring Out-of-Distribution Samples for Misclassification Detection**: OpenMix is a method that uses unlabeled out-of-distribution samples to improve the accuracy and confidence of model predictions. The approach enhances the model's ability to detect and filter misclassified samples, improving safety and reliability in high-risk applications. 17. **ViLEM: Visual-Language Error Modeling for Image-Text Retrieval**: ViLEM injects fine-grained image-text associations into the "dual encoder" architecture by modeling errors in text and image representations. The method significantly outperforms state-of-the-art models in image-text retrieval tasks. 18. **AUNet: Learning Relations Between Action Units for Face Forgery Detection**: AUNet is a framework that learns relationships between action units to improve the generalizability of face forgery detection. The method achieves state-of-the-art performance on both in-dataset and cross-dataset evaluations. 19. **Learning to Exploit the Sequence-Specific Prior Knowledge for Image Processing Pipelines Optimization**: This paper proposes a sequential ISP hyperparameter prediction framework that leverages the order of processing modules and parameter similarities. The method optimizes hyperparameters for tasks like object detection and image segmentation, improving the efficiency and performance of image processing pipelines. 20. **High-Fidelity Clothed Avatar Reconstruction from a Single Image**: The proposed method reconstructs a 3D clothed avatar from a single image using a two-stage approach. The first stage generates a rough 3D model, and the second stage refines it using a meta-learned hyperparameter network, achieving high-fidelity and detailed reconstructions. 21. **OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering**: OTAvatar is a one-shot 3D talking face avatar synthesis method that decouples pose and expression. The model uses a controllable tri-plane rendering scheme, enabling high-quality and consistent video editing. 22. **Graphics Capsule: Learning Hierarchical 3D Face Representations from 2D Images**: Graphics Capsule is a neural network that learns hierarchical 3D face representations from 2D images. The model uses implicit 2D sketches and extrude operations to create compact and editable 3D models, demonstrating strong performance in face reconstruction. 23. **MOSO: Decomposing MOtion, Scene and Object for Video Prediction**: MOSO is a two-stage video prediction method that decomposes motion, scene, and object components. The model uses discrete codes to represent these components and a Transformer to model video predictions, achieving state-of-the-art performance on multiple benchmarks. 24. **ZBS: Zero-shot Background Subtraction via Instance-level Background Modeling and Foreground Selection**: ZBS is a zero-shot background subtraction method that uses a visual pre-trained model to extract and model foreground objects. The method overcomes background noise and weak lighting issues, significantly improving detection accuracy. 25. **Cascade Evidential Learning for Open-world Weakly-supervised Temporal Action Localization**: This paper introduces a cascaded evidence learning framework for weakly-supervised temporal action localization in open-world settings. The method collects evidence from multi-scale temporal contexts and knowledge-guided prototypes, enhancing the detection of known and unknown actions. 26. **Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception**: The proposed method collects cross-modal evidence to improve the localization and classification of audio-visual events in weakly-supervised settings. The approach leverages complementary modalities to enhance performance, achieving state-of-the-art results in event-level visual and audio metrics. 27. **Visual-Language Prompt Tuning with Knowledge-guided Context Optimization**: KgCoOp is a prompt tuning algorithm that uses knowledge-guided context optimization to enhance the generalization of pre-trained visual-language models. The method reduces the difference between learnable templates and hand-designed templates, improving performance on unseen classes. 28. **VQACL: A Novel Visual Question Answering Continual Learning Setting**: VQACL is a new continual learning setting for visual question answering (VQA) that includes a nested task sequence and a generalization test. The proposed method addresses catastrophic forgetting and improves generalization, outperforming existing models on multiple benchmarks. 29. **Inversion-Based Style Transfer with Diffusion Models**: This paper presents a method for style transfer using diffusion models and an attention-based inversion method. The approach captures the complete artistic style of an image and achieves state-of-the-art performance in style accuracy and model convergence. 30. **Active Exploration of Multimodal Complementarity for Few-Shot Action Recognition**: AMFAR is a framework that actively explores multimodal complementarity to improve few-shot action recognition. The method uses active sample selection and mutual distillation to enhance the representation learning of reliable modalities. 31. **DPE: Disentanglement of Pose and Expression for General Video Portrait Editing**: DPE is a self-supervised framework that decouples pose and expression in video portrait editing. The method uses a bidirectional cycle to achieve high-quality and consistent video editing without paired data or 3DMM. 32. **SECAD-Net: Self-Supervised CAD Reconstruction by Learning Sketch-Extrude Operations**: SECAD-Net is a neural network that reconstructs 3D CAD models from raw geometric data in a self-supervised manner. The model uses implicit 2D sketches and extrude operations, achieving better and more compact reconstructions than existing methods. 33. **Bilateral Memory Consolidation for Continual Learning**: BiMeCo is a framework that enhances memory interaction in continual learning by decoupling model parameters into short-term and long-term memory modules. The method uses knowledge distillation and momentum-based updates to prevent catastrophic forgetting, improving performance on challenging datasets. 34. **SMOC-Net: Leveraging Camera Pose for Self-Supervised Monocular Object Pose Estimation**: SMOC-Net is a self-supervised monocular object pose estimation network that uses camera pose to reduce domain gaps and training costs. The model outperforms existing methods in performance and training efficiency. 35. **Open-set Semantic Segmentation for Point Clouds via Adversarial Prototype Framework**: APF is a method for open-set semantic segmentation of 3D point clouds. The framework uses an adversarial prototype approach to detect unseen object categories while maintaining performance on seen categories, achieving a better balance between open-set and closed-set segmentation. These papers collectively represent significant advancements in computer vision and pattern recognition, addressing challenges in areas such as video generation, few-shot learning, 3D reconstruction, and domain generalization. They introduce innovative techniques and frameworks that push the boundaries of current research, offering practical solutions to real-world problems.