HyperAI

2.5k Questions! HLE Has Made a Breakthrough in Building a Precise Evaluation System for Large Language Models; Jan-Nano, a Lightweight Large Language Model With 4 Billion Parameters, Is Designed for Deep Research Tasks

特色图像

In recent years, large language models (LLMs) have made breakthrough progress and are capable of handling diverse tasks such as answering questions and creating content, demonstrating their strong capabilities. Benchmarks are important tools for evaluating the development capabilities of LLMs and are of reference significance for improving and enhancing the capabilities of LLMs. However, the current popular benchmarks are lacking in difficulty design, as cutting-edge LLMs have achieved similar and high scores in many existing evaluations, which limits the accuracy of LLM capability measurement and blurs the space for improving the capabilities of large models.

Based on this, the Center for AI Safety and Scale AI jointly released the multimodal human problem benchmark dataset Humanity's Last Exam (HLE).Aims to build the ultimate seal covering the frontiers of human knowledgeClosedEvaluatesystem.This dataset consists of 2,500 questions from dozens of subject areas, and is committed to providing an accurate and effective LLM capability measurement standard, clarifying the gap between current LLM capabilities and professional academics, and better achieving the rapid improvement of LLM capabilities in the frontier areas of knowledge.

At present, the HyperAI Super Neural Network official website has launched the "HLE Human Problem Reasoning Benchmark Dataset", come and try it~

Dataset download:

https://go.hyper.ai/U4YbM

From July 14 to July 18, hyper.ai official website updates:

* High-quality public datasets: 10

* High-quality tutorial selection: 5

* This week's recommended papers: 5

* Community article interpretation: 5 articles

* Popular encyclopedia entries: 5

* Top conferences with deadline in July: 4

Visit the official website:hyper.ai

Selected public datasets

1. GSM8K Mathematical Reasoning Dataset

GSM8K is a mathematical reasoning dataset released by OpenAI in 2022, which aims to improve the performance of machine learning models in understanding and solving complex mathematical problems. The dataset contains 8.5k high-quality, language-diverse elementary school math word problems, covering algebra, arithmetic, geometry and other fields. The steps to solve the questions are between 2-8 steps. Its solution mainly involves a series of simple calculations using basic arithmetic operations (addition, subtraction, multiplication, and division) to get the final answer.

Direct use:https://go.hyper.ai/ZqNLt

2. Crops Disease Crop Disease Dataset

Crops Disease is an agricultural crop disease image dataset designed to help develop computer vision models to automatically detect and classify diseases of different crops. The dataset contains about 1,300 crop disease images, covering common diseases of a variety of crops (such as corn, tomatoes, potatoes, etc.), and each image is annotated with a specific disease category.

Direct use:https://go.hyper.ai/GEDTA

Dataset Example

3. OpenScience Multi-domain Synthetic Datasets

OpenScience is a multi-domain synthetic dataset released by NVIDIA in 2023, which aims to improve the accuracy of advanced benchmarks such as GPQA-Diamond and MMLU-Pro through supervised fine-tuning or reinforcement learning. The dataset contains 6 million multiple-choice question-answer pairs with detailed reasoning traces, covering multiple scientific fields such as STEM, law, economics, and humanities.

Direct use:https://go.hyper.ai/YvAo7

4. Skywork-OR1-RL Mathematical Programming Problem Reasoning Dataset

Skywork-OR1-RL is a mathematical programming problem reasoning dataset designed to train the Skywork-OR1 (Open Reasoner 1) mathematical programming reasoning model. The dataset contains 105k mathematical problems and 14k programming problems that are verifiable, challenging, and diverse.

Direct use:https://go.hyper.ai/mxoAv

5. Bird Species Bird Classification Image Dataset

Bird Species is a bird image classification dataset suitable for training computer vision models to identify and classify bird species. The dataset contains 7 different species, each with 1,200 images. The images of each species contain the feather pattern, color, and body structure of the bird of that species. Some of the images are intentionally blurred, tilted, or contain 2 birds of different species, which increases the complexity of the real world and makes the model more robust for accurate classification in natural environments.

Direct use:https://go.hyper.ai/X2X2M

Dataset Example

6. NextCoder Code Editing Dataset

NextCoder is a synthetic dialogue coding editing dataset released by Microsoft in 2025. It is used for fine-tuning large language models to help enhance the model's performance in code repair, refactoring, and optimization. It is very suitable for training AI programming assistants and improving code reading and multi-round interaction capabilities. The dataset contains about 381k single-round instruction samples (NextCoderDataset) and 57,000 multi-round dialogue samples (Conversational version), covering 8 languages such as Python and Java.

Direct use:https://go.hyper.ai/e4MIs

7. Psych-101 Psychology Knowledge Question Answering Dataset

Psych-101 is a psychological knowledge question-answering dataset designed to help develop natural language processing models for psychological knowledge question-answering tasks and promote psychology-related AI research, especially in psychology education, sentiment analysis, and mental health applications. The dataset contains trial-by-trial data from 160 psychological experiments and 60,092 participants, totaling 10,681,650 choices.

Direct use:https://go.hyper.ai/NUshw

8Leukemia Leukemia Image Dataset

Leukemia is a leukemia cell image dataset designed to be used to train computer vision models to automatically detect and classify leukemia cells. The dataset contains 6,778 cell images, including normal cells (3,389) and leukemia cells (3,389).

Direct use:https://go.hyper.ai/Lwxwj

Dataset View

9. X-ray chest pneumonia X-ray image dataset

X-Ray Images for Chest Pneumonia is a dataset of chest X-ray images designed to train and evaluate computer vision models to help automated diagnostic systems detect respiratory diseases such as pneumonia. The dataset contains approximately 5,800 chest X-ray images divided into two categories: normal and pneumonia (bacterial and viral).

Direct use:https://go.hyper.ai/Pgra4

Dataset Example

10. Soil Moisture Soil moisture image dataset

Soil Moisture is a measurement-based soil moisture dataset that aims to study the impact of soil moisture on crop growth, optimize irrigation systems, and improve agricultural production efficiency. It also has important applications in areas such as climate change and water resource management. The dataset contains 200 soil surface images from the rain-fed agricultural area of Bondowoso, Indonesia.

Direct use:https://go.hyper.ai/TtpgP

Dataset View

Selected Public Tutorials

This week we have compiled 4 categories of quality public tutorials:

*AI for Science Tutorials: 2

*Text recognition tutorial: 1

*Multi-modal tutorial: 1

*Large model tutorial: 1

AI for Science Tutorial

1. RFdiffusion: a diffusion protein design model

RFdiffusion is a protein structure generation framework: it uses RoseTTAFold as the backbone network and introduces the denoised diffusion probability model (DDPM) to design new protein structures from scratch. The framework is able to design proteins with complex shapes (such as α-helices and β-folds) and accurately predict the catalytic site scaffold of enzymes.

Run online:https://go.hyper.ai/q7Ajs

2. Biomni: The first general biomedical agent

Biomni is a general-purpose biomedical AI agent that can autonomously complete complex research tasks across multiple biomedical branches such as genetics, genomics, microbiology, pharmacology and clinical medicine, marking a new stage in the development of AI-driven scientific discovery.

Run online:https://go.hyper.ai/aameS

Text Recognition Tutorial

1. OCRFlux-3B: Intelligent Text Recognition Toolkit

OCRFlux-3B is a toolkit based on a multimodal large language model for converting PDFs and images into clean, readable, plain Markdown text. The tool not only provides page-level text conversion capabilities, but also supports the merging of tables and paragraphs across pages, providing powerful support for processing complex document structures.

Run online:https://go.hyper.ai/BGqmR

PDF Document Example

Multimodal Tutorial

1. EvoSearch-codes: Evolutionary Algorithm Framework

EvoSearch-codes is an Evolutionary Search method launched by the Hong Kong University of Science and Technology and the Kuaishou Keling team. It significantly improves the generation quality of the model by increasing the amount of computation during inference, supports image and video generation, and supports the most advanced diffusion-based and flow-based models. The model can achieve significant optimal results on a series of tasks without training or gradient updates, and exhibits good scaling up capabilities, robustness, and generalization.

Run online:https://go.hyper.ai/zjzrE

Project Examples

Large Model Tutorial

1. Jan-Nano: A compact research-specific language model

Jan-Nano is a 4-billion-parameter lightweight large language model released by the Menlo Research team on July 1, 2025. It is designed for deep research tasks and optimized for the Model Context Protocol (MCP) server to facilitate efficient integration with a variety of research tools and data sources.

Run online:https://go.hyper.ai/mC8gx

Project Examples

💡We have also established a Stable Diffusion tutorial exchange group. Welcome friends to scan the QR code and remark [SD tutorial] to join the group to discuss various technical issues and share application results~

This week's paper recommendation

1. Test-Time Scaling with Reflective Generative Model

This paper introduces MetaStone-S1, the first reflective generative model, which achieves the performance level of OpenAI o3 through a self-supervised process reward model (SPRM). By sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model (PRM) into a unified interface without the need for additional process annotations, reducing PRM parameters by more than 99%, thus achieving efficient inference.

Paper link:https://go.hyper.ai/zFLhf

2. Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

This paper proposes a two-stage paradigm based on Qwen2.5-VL-7B: first large-scale language cold-start fine-tuning, followed by nearly 1000 steps of multimodal reinforcement learning (RL), which exceeds all previous open source attempts. The final model Open-Vision-Reasoner (OVR) achieves state-of-the-art performance on a range of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision, and 54.6% on MathVerse.

Paper link:https://go.hyper.ai/WucU8

3.Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

The researchers found that although Qwen2.5 performs well in mathematical reasoning, its pre-training on a large-scale web corpus makes it vulnerable to data contamination in popular benchmarks, which in turn affects the reliability of the results obtained from these benchmarks. To address this problem, the researchers introduced a generator that can generate fully synthetic arithmetic problems of arbitrary length and difficulty, resulting in a clean dataset called RandomCalculation. Using these leak-free datasets, it is proved that only accurate reward signals can continuously improve performance, while noisy or erroneous signals cannot.

Paper link:https://go.hyper.ai/WZp4V

4. NeuralOS: Towards Simulating Operating Systems via Neural Generative Models

This paper introduces NeuralOS, a neural framework that simulates the graphical user interface (GUI) of an operating system by directly predicting screen frames in response to user input such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN) to track the computer state and a diffusion-based neural renderer to generate screen images. It provides a path to create fully adaptive, generative neural interfaces for future human-computer interaction systems.

Paper link:https://go.hyper.ai/hceCb

5. CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering

In this paper, we propose a neural rendering method that represents scenes as "Compressed Light Field Tokens (CLiFTs)", which preserves the rich appearance and geometry of the scene. CLiFT achieves computationally efficient rendering by using compressed tokens, while being able to change the number of tokens in a single trained network to represent the scene or render new perspectives.

Paper link:https://go.hyper.ai/aqzHX

More AI frontier papers:https://go.hyper.ai/iSYSZ

Community article interpretation

1. Selected for ICML 2025, Meta/Cambridge/MIT proposed the all-atom diffusion Transformer framework, which realizes the unified generation of periodic and non-periodic atomic systems for the first time

A joint research team from Meta FAIR, Cambridge University and MIT proposed the all-atom diffusion Transformer ADiT, which broke the modeling barriers between periodic and non-periodic systems. Through two major innovations, all-atom unified latent representation and Transformer latent diffusion, it achieved a breakthrough in generating molecules and crystals with a single model.

View the full report:https://go.hyper.ai/Dnw5r

2. Simultaneously process protein main chain and side chain information, Stanford et al. achieve full-atom structure modeling based on message passing neural network

The Stanford University team and the Arc Institute in Palo Alto, California jointly proposed a new protein sequence design method FAMPNN (Full-Atom MPNN), which can explicitly model the sequence identity and side chain structure of each amino acid residue. The model uses a message passing architecture based on graph neural networks, combined with improved MPNN and GVP modules for full-atom encoding, which can simultaneously process the main chain and side chain information of proteins.

View the full report:https://go.hyper.ai/x04Am

3. Online Tutorial | 150 professional tools/59 databases/105 software packages, Biomni surpasses expert-level efficiency in 8 real research tasks

Stanford University, in collaboration with Genentech, Arc Institute, UCSF and other institutions, has developed the first general biomedical AI agent, Biomni, which can autonomously perform a wide range of research tasks across different biomedical subfields and create the first unified environmental agent - mining the necessary tools, databases and solutions from tens of thousands of publications in 25 biomedical fields. System benchmarks show that Biomni achieves strong generalization in heterogeneous biomedical tasks without any task-specific prompt tuning.

View the full report:https://go.hyper.ai/VHpMD

4. From architectural features to ecosystem construction, Muxi Dong Zhaohua deeply analyzes the application practice of TVM on domestic GPUs

On July 5, the 7th Meet AI Compiler Technology Salon hosted by HyperAI came to a successful conclusion. Dong Zhaohua, senior director of Muxi Integrated Circuit, shared in depth how to apply TVM on Muxi GPU, introduced the technical characteristics of its GPU products, TVM compiler adaptation solutions, actual application cases and ecological construction vision, and demonstrated the technological breakthroughs and application potential of domestic GPUs in the fields of high-performance computing and AI.

View the full report:https://go.hyper.ai/rxxX3

5. NVIDIA achieves breakthrough in atomic-level protein design, generating proteins with up to 800 residues with high precision

NVIDIA's research team, in collaboration with Mila from the Quebec Institute for Artificial Intelligence in Canada, proposed La-Proteina, an atomic-level protein design method based on partial latent flow matching. It addresses the key challenge of dimensional variability of explicit side chain representations during protein generation, bringing new breakthroughs to the field of protein design.

View the full report:https://go.hyper.ai/0Sw8R

Popular Encyclopedia Articles

1. DALL-E

2. Reciprocal sorting fusion RRF

3. Pareto Front

4. Large-scale Multi-task Language Understanding (MMLU)

5. Contrastive Learning

Here are hundreds of AI-related terms compiled to help you understand "artificial intelligence" here:

https://go.hyper.ai/wiki

July deadline for the top conference

July 11 7:59:59 POPL 2026

July 15 7:59:59 SODA 2026

July 18 7:59:59 SIGMOD 2026

July 19 7:59:59 ICSE 2026

One-stop tracking of top AI academic conferences:https://go.hyper.ai/event

The above is all the content of this week’s editor’s selection. If you have resources that you want to include on the hyper.ai official website, you are also welcome to leave a message or submit an article to tell us!

See you next week!