HyperAI

The complexity of modern document content presents greater challenges to parsing technologies: documents often incorporate lengthy texts, complex charts, professional formulas, multiple languages, and may have irregular layouts. Therefore, efficient and accurate document parsing has become an indispensable key technology.

Current research in the field of document parsing mainly follows two technical paths:One approach is to use a pipeline method based on a modular expert model.While these methods perform stably on specific tasks, their drawbacks are becoming increasingly apparent: the system architecture is complex, errors accumulate along the processing stages, and their capabilities have an inherent upper limit when processing highly complex documents.ThatSecond, an end-to-end approach based on a multimodal large model.While designed to simplify workflows and achieve global optimization, it often encounters problems in practical applications, such as disordered text order and generating "illusionary" content when dealing with long documents or complex layouts. Furthermore, the enormous computational cost of long-sequence output limits its feasibility for deployment in real-world scenarios.

Based on these real-world challengesBaidu has launched PaddleOCR-VL, a high-performance and resource-efficient document parsing model based on a visual language model.The core component of this model is the compact and powerful visual language model PaddleOCR-VL-0.9B, which integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model, enabling accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements such as text, tables, formulas, and charts, while maintaining extremely low resource consumption.

Through comprehensive evaluationPaddleOCR-VL has achieved state-of-the-art (SOTA) performance in both page-level document parsing and element-level recognition tasks.It demonstrates strong competitiveness in comparison with top visual language models, making it more suitable for deployment and application in real-world scenarios.

The HyperAI website now features "PaddleOCR-VL: Multimodal Document Parsing," so give it a try!

Online use:https://go.hyper.ai/3OjbB

A quick overview of hyper.ai's official website updates from November 17th to November 21st:

* High-quality public datasets: 6

* Selection of high-quality tutorials: 3

* This week's recommended papers: 5

* Community article interpretation: 5 articles

* Popular encyclopedia entries: 5

Top conferences with December deadlines: 2

Visit the official website:hyper.ai

Selected public datasets

1. HumanSense Benchmark dataset

HumanSense Benchmark is a human perception evaluation benchmark dataset released by Xi'an Jiaotong University in conjunction with Ant Group. It aims to comprehensively measure the real-world interaction capabilities of models under the fusion of multimodal information such as vision, audio, and text.

Direct use:https://go.hyper.ai/9drzT

2. EditReward-Bench Image Editing Evaluation Dataset

EditReward-Bench is a systematic evaluation benchmark for image editing reward models, jointly released by the University of Science and Technology of China, the Institute of Automation of the Chinese Academy of Sciences, and the Beijing Academy of Artificial Intelligence. It aims to comprehensively evaluate the discriminative ability of reward models from three core dimensions: instruction compliance, consistency maintenance, and overall quality. The dataset contains 3,072 expert-annotated preference comparison data points, comprehensively covering common and complex real-world application scenarios.

Direct use:https://go.hyper.ai/OEVRn

3. UNO-Bench full-modal evaluation benchmark dataset

UNO-Bench, released by Meituan's LongCat team, is the first unified multimodal evaluation benchmark designed to efficiently assess unimodal and multimodal understanding capabilities. The dataset contains 1250 multimodal samples with 98% cross-modal solvability and 2480 unimodal samples, covering 44 task types and 5 modality combinations. The dataset also includes a general scoring model that supports automated evaluation of 6 question types, providing a unified evaluation standard for multimodal tasks.

Direct use:https://go.hyper.ai/gIcIK

4. VERA Speech Reasoning Evaluation Dataset

VERA is a large-scale, multi-task speech dataset released by Duke University in collaboration with Adobe. It is designed to evaluate the reasoning capabilities of large models under voice-native conditions. All samples are presented in native speech form, and the audio is synthesized by Boson Higgs Audio 2 to ensure consistent, clear, and high-quality speech performance.

Direct use:https://go.hyper.ai/AfgW5

5. Facial Emotion Recognition Dataset

Facial Emotion Recognition is a dataset for facial emotion classification tasks, designed to train and evaluate various emotion recognition models. The dataset covers seven basic emotions: anger, disgust, fear, happiness, neutrality, sadness, and surprise. The data is based on and integrated from the publicly available FER2013 and RAF-DB datasets, and facial images are filtered using HaarCascade (approximately 0.8 confidence level) while undergoing denoising and quality enhancement.

Direct use:https://go.hyper.ai/z5x5N

6. AutoDock-GPU_Output docking result dataset

AutoDock-GPU_Output is a sample docking output log (.dlg) generated by running AutoDock-GPU. It contains information such as binding energy, conformation clustering, and final ligand attitude. It serves as a reference dataset for docking result parsing and can be used to learn result parsing and check whether the environment configuration is normal.

Direct use:https://go.hyper.ai/zz7wV

Selected Public Tutorials

1. PaddleOCR-VL: Multimodal Document Parsing

PaddleOCR-VL is a state-of-the-art (SOTA) and resource-efficient model designed specifically for document parsing tasks. Its core component is PaddleOCR-VL-0.9B, a compact and powerful visual language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model, enabling accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements such as text, tables, formulas, and charts, while maintaining extremely low resource consumption.

Run online:https://go.hyper.ai/3OjbB

2. LongCat-Video: Meituan's open-source AI video generation model

LongCat-Video is an open-source AI video generation model with 13.6 billion parameters developed by Meituan's LongCat team. It excels in tasks such as text-to-video, image-to-video, and video continuation, particularly in efficiently generating high-quality long videos. Through multi-reward reinforcement learning optimization (GRPO), the model demonstrates performance comparable to leading open-source video generation models and state-of-the-art commercial solutions in internal and public benchmark tests.

Run online:https://go.hyper.ai/3DWbb

3. Deploying VibeThinker-1.5B using vLLM + OpenWebUI

VibeThinker-1.5B is the first open-source large-scale model released by Weibo AI. Its powerful capabilities don't rely on simply piling on parameters, but rather stem from the SSP training concept proposed by Weibo's developers. This concept encourages the model to explore all possible solution paths during the learning phase, rather than solely focusing on accuracy. Subsequently, reinforcement learning is used to efficiently optimize the strategy, accurately locking in the correct path and maximizing model performance.

Run online:https://go.hyper.ai/PAcy1

This week's paper recommendation

1. Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

This report introduces Kandinsky 5.0, a family of foundational models for high-resolution image and 10-second video synthesis. The framework comprises three core model families: Kandinsky 5.0 Image Lite—a set of image generation models with 6 billion parameters; Kandinsky 5.0 Video Lite—a lightweight and efficient text-to-video and image-to-video generation model with 2 billion parameters; and Kandinsky 5.0 Video Pro—a model with 19 billion parameters capable of achieving exceptional video generation quality.

Paper link:https://go.hyper.ai/cpPY4

2. P1: Mastering Physics Olympiads with Reinforcement Learning

This paper proposes the P1 series of open-source physics inference models, which are trained entirely through reinforcement learning (RL). Among them, P1-235B-A22B is the first open-source model to achieve gold medal-level performance in the 2025 International Physics Olympiad (IPhO 2025), and it won 12 gold medals in 13 international and regional physics competitions in 2024 and 2025.

Paper link:https://go.hyper.ai/434Df

3. VIDEOP2R: Video Understanding from Perception to Reasoning

This paper proposes VideoP2R, a novel, procedural video reinforcement learning fine-tuning framework that enhances video reasoning capabilities by modeling perception and reasoning as two independent processes. Extensive experiments demonstrate that VideoP2R achieves state-of-the-art performance on six out of seven video reasoning and understanding benchmarks.

Paper link:https://go.hyper.ai/0CChs

4. Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

This paper introduces Uni-MoE 2.0, a fully open-source general-purpose omnimodal large model (OLM). This model significantly advances the technological evolution of Uni-MoE in language-centric multimodal understanding, reasoning, and generation capabilities. Extensive evaluations across 85 benchmarks demonstrate that this model achieves or approaches the state-of-the-art (SOTA) performance of current leading OLM models. In over 50 out of 76 benchmarks, it surpasses Qwen2.5-Omni, which has a training dataset of 1.2 trillion tokens.

Paper link:https://go.hyper.ai/wETcQ

5. Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

This paper proposes Think-at-Hard (TaH), a dynamic implicit thinking mechanism that performs deep iterations only on hard-to-predict tokens. This method introduces a lightweight neural decision maker that triggers implicit iterations only on tokens where the standard forward propagation might be incorrect. During the implicit iteration process, a Low-Rank Adaptation (LoRA) module shifts the goal of LLM from general next token prediction to a focus on fine-tuning hard tokens.

Paper link:https://go.hyper.ai/jp3xw

More AI frontier papers:https://go.hyper.ai/iSYSZ

Community article interpretation

1. Interdisciplinary innovation far surpasses human capabilities? AI scientists propose hypotheses/conduct experiments/present at top conferences, ushering in a new paradigm for scientific research.

In August 2024, Sakana AI, founded by Llion Jones, one of the authors of the Transformer paper, launched the world's first "AI scientist," capable of autonomously proposing research questions, designing experiments, and writing papers, causing a stir in the global scientific community. From automated experiments to autonomous discovery, AI is leaping from a research assistant to a "co-researcher." How will the future of science be rewritten when AI enters the laboratory?

View the full report:https://go.hyper.ai/ICpf1

2. Online Tutorial | Object Detection Enters the Era of "Global Awareness": Tsinghua University and Others Release YOLOv13, Achieving Breakthroughs in Both Speed and Accuracy

A joint research team comprised of experts from Tsinghua University, Taiyuan University of Technology, and Xi'an Jiaotong University has proposed a novel object detection model—YOLOv13—extending "relevance modeling" from binary to true high-order structures. The results show that YOLOv13 achieves significant improvements on MS COCO, from small models (N series) to large models, reaching state-of-the-art detection performance with fewer parameters and FLOPs. Specifically, YOLOv13-N improves mAP by 3.01 TP3T compared to YOLOv11-N and by 1.51 TP3T compared to YOLOv12-N.

View the full report:https://go.hyper.ai/W4vib

3. Breakthrough in Image Geolocation! The University of Maine, Google, OpenAI, and others have proposed the LocDiff framework, achieving precise global positioning without the need for grids or reference libraries.

A joint team comprised of the University of Maine, Google, and Harvard University proposed the "Spherical Harmonic Dirac Function (SHDD)" and its integrated framework LocDiff. By constructing an encoding method and diffusion architecture adapted to spherical geometry, it achieves accurate localization without relying on preset grids or external image libraries, providing a groundbreaking technical path for the field.

View the full report:https://go.hyper.ai/Ucsq8

4. From 9,874 papers to 15,000 crystal structures, MOF-ChemUnity reconstructs the panoramic knowledge of MOF, propelling materials discovery into the era of "interpretable AI".

A research team from the University of Toronto and the Clean Energy Innovation Research Centre of the National Research Council of Canada proposed MOF-ChemUnity: a structured, scalable, and extensible knowledge graph. This method utilizes LLM to establish a reliable one-to-one mapping between MOF names and their synonyms in the literature and the crystal structures registered in the CSD, thereby achieving disambiguation between MOF names and their synonyms and crystal structures.

View the full report:https://go.hyper.ai/cRR1o

5. From dry cleaners to the Queen Elizabeth Engineering Prize, Fei-Fei Li defies the Silicon Valley tech myth, focusing on the risks of AI dehumanizing.

In the spring of 2025, Fei-Fei Li was awarded the Queen Elizabeth Award for Engineering, in recognition of her foundational contributions to computer vision and deep learning. As a key figure in the ImageNet project, she pioneered data-driven visual recognition methods and proposed a "human-centered" AI philosophy, maintaining vigilance regarding AI ethics, social value, and the risk of dehumanization amidst Silicon Valley's commercialization wave. However, her minority status places her in a delicate sphere between scientific achievements and industrial practice, sparking ongoing debate.

View the full report:https://go.hyper.ai/bRu25

Top conference with a December deadline

One-stop tracking of top AI academic conferences:https://go.hyper.ai/event

The above is all the content of this week’s editor’s selection. If you have resources that you want to include on the hyper.ai official website, you are also welcome to leave a message or submit an article to tell us!

See you next week!

About HyperAI

HyperAI (hyper.ai) is the leading artificial intelligence and high-performance computing community in China.We are committed to becoming the infrastructure in the field of data science in China and providing rich and high-quality public resources for domestic developers. So far, we have:

* Provide domestic accelerated download nodes for 1800+ public datasets

* Includes 600+ classic and popular online tutorials

* Interpretation of 200+ AI4Science paper cases

* Supports 600+ related terms search

* Hosting the first complete Apache TVM Chinese documentation in China

Visit the official website to start your learning journey:

https://hyper.ai

Baidu Makes a Move! Its OCR Model, PaddleOCR-VL, Breaks Through the Limitations of Pipeline and end-to-end Methods; the Facial Emotion Recognition Dataset Empowers AI to Understand Facial expressions.

Selected public datasets

Selected Public Tutorials

This week's paper recommendation

Community article interpretation

Popular Encyclopedia Articles

Top conference with a December deadline

Command Palette

Baidu Makes a Move! Its OCR Model, PaddleOCR-VL, Breaks Through the Limitations of Pipeline and end-to-end Methods; the Facial Emotion Recognition Dataset Empowers AI to Understand Facial expressions.

Selected public datasets

Selected Public Tutorials

This week's paper recommendation

Community article interpretation

Popular Encyclopedia Articles

Top conference with a December deadline