Command Palette
Search for a command to run...
Papers
Daily updated cutting-edge AI research papers to help you keep up with the latest AI trends

Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation































Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation






























LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow
LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
The Y-Combinator for LLMs: Solving Long-Context Rot with λ-Calculus
ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models
TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation
Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models
HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning
Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer
FASTER: Rethinking Real-Time Flow VLAs
3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model
SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing
Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
Efficient Reasoning with Balanced Thinking
Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models
Complementary Reinforcement Learning
Alignment Makes Language Models Normative, Not Descriptive
MosaicMem: Hybrid Spatial Memory for Controllable Video World Models
MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild
Video-CoE: Reinforcing Video Event Prediction via Chain of Events
FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes
In-Context Watermarks for Large Language Models
WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation
Demystifing Video Reasoning
Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation
Qianfan-OCR: A Unified End-to-End Model for Document Intelligence
InCoder-32B: Code Foundation Model for Industrial Scenarios
MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification
HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions
Mixture-of-Depths Attention
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow
LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
The Y-Combinator for LLMs: Solving Long-Context Rot with λ-Calculus
ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models
TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation
Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models
HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning
Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer
FASTER: Rethinking Real-Time Flow VLAs
3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model
SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing
Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
Efficient Reasoning with Balanced Thinking
Look Before Acting: Enhancing Vision Foundation Representations for Vision-Language-Action Models
Complementary Reinforcement Learning
Alignment Makes Language Models Normative, Not Descriptive
MosaicMem: Hybrid Spatial Memory for Controllable Video World Models
MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild
Video-CoE: Reinforcing Video Event Prediction via Chain of Events
FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes
In-Context Watermarks for Large Language Models
WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation
Demystifing Video Reasoning
Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation
Qianfan-OCR: A Unified End-to-End Model for Document Intelligence
InCoder-32B: Code Foundation Model for Industrial Scenarios
MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification
HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions
Mixture-of-Depths Attention