HyperAI

Cut Training Costs in Half! OmniConsistency Achieves SOTA Results With 2.6k Images; Wan2.1-VACE-14B Unlocks a New Dimension of Video Generation

特色图像

As digital vision technology is booming, open source models have made significant breakthroughs in image stylization. However, there is still a significant gap with commercial models in terms of stylization consistency. To break through this technical bottleneck, Show Lab innovatively launched OmniConsistency, a consistency plug-in built on a large-scale diffusion transformer, which aims to bridge the performance gap between open source methods and commercial models.

OmniConsistency adopts a two-stage progressive learning strategy to decouple style learning from consistency, thereby effectively alleviating the problem of style degradation.Significantly improves visual coherence and aesthetic quality, achieving performance comparable to the commercial state-of-the-art model GPT-4o.

In addition, to support model training and evaluation, the research team also constructed the OmniConsistency stylized image pair dataset.This dataset uses GPT-4o to synthesize input images of 22 different artistic styles and generates corresponding descriptive text annotations for the source images and stylized images to meet diverse creative needs.

At present, HyperAI has launched "OmniConsistency: GPT-4o-level character style transfer model" and "OmniConsistency stylized image pair dataset". Come and try it~

OmniConsistency: GPT-4o-level character style transfer model

Online use:https://go.hyper.ai/WU5fY

OmniConsistency stylized image pair dataset

Online use:https://go.hyper.ai/RxZk9

From June 9 to June 13, hyper.ai official website updates:

* High-quality public datasets: 10

* High-quality tutorials: 13

* This week's recommended papers: 5

* Community article interpretation: 4 articles

* Popular encyclopedia entries: 5

* Top conferences with deadlines in June and July: 6

Visit the official website:hyper.ai

Selected public datasets

1. OpenThoughts3-1.2M Reasoning Dataset

OpenThoughts3-1.2M is an open source reasoning dataset that contains 850,000 math questions, 250,000 code questions, and 100,000 science questions, and the annotations are completed using the QwQ-32B model.

Direct use:https://go.hyper.ai/1u77Q

Dataset Framework

2. OpenThoughts2-1M Reasoning Dataset

The dataset is based on the OpenThoughts-114k dataset, adding existing datasets such as OpenR1 and other math and code reasoning data. The data contains 1 million high-quality examples covering math, science, code, and puzzles. The performance of the OpenThinker2 model trained on this dataset is comparable to the DeepSeek-R1-Distill model.

Direct use:https://go.hyper.ai/FK1Z3

Data Structure

3. OmniConsistency stylized image pair dataset

OmniConsistency is a large-scale multi-style image pair dataset that focuses on image stylization and cross-modal consistency learning, aiming to provide standardized resources for image generation, style transfer, and multimodal model training. The dataset covers 22 different art styles such as cartoons, oil paintings, traditional art, pixel art, etc., to meet diverse creative needs.

Direct use:https://go.hyper.ai/RxZk9

4. Nemotron-Personas character dataset

The dataset contains artificially synthesized characters based on real-world demographics, geographic distribution, and personality traits, designed to capture the diversity and richness of the population. It is the first dataset of its kind to have statistics associated with attributes such as name, gender, age, background, marital status, education, occupation, and place of residence.

Direct use:https://go.hyper.ai/uwpRH

5. VCBench Mathematical Reasoning Benchmark Dataset

VCBench is a benchmark dataset designed for evaluating multimodal mathematical reasoning with explicit visual dependencies. The dataset contains 1,720 question-answer pairs and a total of 6,697 images.

Direct use:https://go.hyper.ai/4Ck1t

6. AudioTrust audio benchmark dataset

This dataset is a large-scale audio-text benchmark dataset. As the first multi-dimensional trust evaluation benchmark tailored for large audio models, AudioTrust focuses on evaluating the multi-dimensional credibility of audio large language models (ALLMs).

Direct use:https://go.hyper.ai/WgJSW

7. LEXam Legal Reasoning Benchmark Dataset

The dataset contains 340 real legal exams from different courses and levels (undergraduate and master's) from the Law School of the University of Zurich, Switzerland, covering Swiss, European and international law, as well as legal theory and legal history. The dataset has a total of 4,886 questions, including 2,841 long-answer questions and 2,045 multiple-choice questions.

Direct use:https://go.hyper.ai/qYpoL

8. ReasonMap traffic graph reasoning benchmark dataset

ReasonMap emphasizes spatial relationships and route reasoning in images. It is the first multimodal reasoning benchmark focused on high-resolution transportation maps (mainly subway maps) and is designed to evaluate the ability of large models to understand fine-grained structured spatial information in images.

Direct use:https://go.hyper.ai/5ejzs

9. Chinese-LiPS multimodal speech recognition dataset

As the first Chinese multimodal speech recognition dataset that combines "lip reading information + slide semantic information", Chinese-LiPS covers complex contexts such as Chinese explanations, popular science, teaching, and knowledge dissemination, and is committed to promoting the development of Chinese multimodal speech recognition technology.

Direct use:https://go.hyper.ai/uaDMt

10. Brain Tumor Dataset

This dataset is a brain tumor MRI segmentation and classification dataset, which aims to provide high-quality data support for medical imaging analysis of brain tumors and is suitable for brain tumor segmentation and classification tasks. The data contains about 5,000 MRI slices.

Direct use:https://go.hyper.ai/8qq5w

Selected Public Tutorials

This week we have compiled 4 categories of quality public tutorials:

* Video generation tutorials: 3

* Image processing tutorials: 3

*Speech generation tutorials: 2

*Large model deployment tutorial: 2

*AI for Science Tutorials: 2

Video GenerationTutorial

1. ComfyUI HunyuanCustom video generation workflow tutorial

HunyuanCustom is a multimodal custom video generation framework, a multimodal, conditionally controllable generation model based on the Hunyuan Video generation framework, centered on topic consistency. It supports the generation of topic-consistent videos conditioned on text, image, audio, and video inputs. With the multimodal capabilities of HunyuanCustom, many downstream tasks can be accomplished.

This tutorial uses a single RTX 4090 card as the resource, and the video generation takes about 10 minutes. It is recommended to use a GPU with 80GB of memory for better generation quality.

Run online:https://go.hyper.ai/Vw6bJ

Demo Example

2. ComfyUI Wan2.1-VACE-14B Image to Video Workflow Tutorial

The model is trained based on Tongyi Wanxiang V2.1 base and is the industry's first video AI tool that supports flexible combination of multiple tasks. It can complete the whole process from video generation to refined editing in one stop. It supports text to video, image to video, first and last frames to video, etc.

This tutorial uses a single A6000 card. It takes about 30 minutes to generate a video. We recommend using a higher computing power.

Run online:https://go.hyper.ai/4ULKi

3. Vchitect-2.0 Video Diffusion Model Demo

The model uses an innovative parallel Transformer architecture design with 2 billion parameters and can generate smooth, high-quality video content based on text prompts.

This tutorial uses a single-card A6000 as a resource, which can be deployed with one click to generate custom videos.

Run online:https://go.hyper.ai/r6OC2

Image Processing Tutorial

1. JoyCaption Beta 1 Subtitle Visual Language Model Demo

The model covers a wide range of image styles, content, race, gender, and orientation, with minimal filtering to understand all aspects of the world, but without supporting illegal content. Users can use a variety of modes and prompts to generate descriptive captions for different application scenarios, such as social media posts, product listings, etc.

This tutorial uses a single RTX 4090 card as the resource. Enter the link to generate subtitles that are super appropriate to the content~

Run online:https://go.hyper.ai/13wrE

2. Describe Anything Model Demo

The model is able to generate detailed descriptions based on user-specified regions (points, boxes, scribbles, or masks). For video content, a complete description can be obtained by simply annotating the region on any frame.

This tutorial uses a single RTX 4090 card as the resource. You can deploy it with one click and just click where you need to describe it.

Run online:https://go.hyper.ai/aitMs

3. OmniConsistency: GPT-4o-level character style transfer model

OmniConsistency significantly improves visual coherence and aesthetic quality, achieving performance comparable to the most advanced commercial model GPT-4o. It fills the performance gap between open source models and commercial models in terms of style consistency, provides a low-cost, highly controllable solution for AI creation, and promotes the democratization of image generation technology. Its compatibility and plug-and-play features also lower the threshold for developers and creators to use it.

The computing resources of this tutorial use a single RTX A6000 card. Enter the link to achieve personalized creation~

Run online:https://go.hyper.ai/WU5fY

Demo Example

Speech Generation Tutorial

1. Stable-audio-open-small: Audio generation model demo

Stable-audio-open-small focuses on efficiently creating high-quality short audio content. Based on advanced diffusion model technology, it supports users to quickly generate diverse audio such as music clips, sound effects and ambient sounds (such as drum loops, melody clips or natural soundscapes) through text prompts, which is suitable for music production, game development, film and television soundtracks and other scenarios.

This tutorial uses resources for a single card A6000, and one-click deployment to make exclusive music!

Run online:https://go.hyper.ai/jl9Y3

2. Chatterbox TTS: Speech Synthesis Demo

Chatterbox is the first open source TTS model that supports exaggerated emotion control. It is based on the LLaMA architecture with 500 million parameters and is trained using more than 500,000 hours of selected audio data. It supports multi-language and multi-timbre generation, and its performance exceeds that of closed-source systems such as ElevenLabs. One of its core functions is zero-sample voice cloning, which can generate highly realistic personalized voices with only 5 seconds of reference audio without the need for a complex training process.

The computing resources used in this tutorial are a single RTX 4090 card. The model prompts only support English. Come and clone your own voice with one click.

Run online:https://go.hyper.ai/KAF8m

Large Model Deployment Tutorial

1. One-click deployment of DeepSeek-R1-0528-Qwen3-8B

The model has 8 billion parameters. By distilling the complex reasoning capabilities of DeepSeek-R1-0528 onto the smaller Qwen3-8B base model, it combines the multi-language capabilities of Qwen3 and the reasoning optimization of DeepSeek-R1. Its performance is comparable to GPT-4, and it supports efficient single-card deployment, making it an ideal choice for academic and enterprise applications.

The computing resources used in this tutorial are a single RTX 4090 card. Enter the link to deploy the enhanced large model with one click.

Run online:https://go.hyper.ai/UnQEa

2. vLLM+Open WebUI deploys AM-Thinking-v1 dense language model

AM-Thinking-v1 is a 32B dense language model focused on enhancing reasoning capabilities. The model demonstrates strong performance on reasoning benchmarks, comparable to large MoE models such as DeepSeek-R1, Qwen3-235B-A22B, Seed1.5-Thinking, and larger dense models such as Nemotron-Ultra-253B-v1.

This tutorial uses dual-card A6000 resources, one-click cloning experience 32B dense language model!

Run online:https://go.hyper.ai/mbAMu

AI for Science Tutorial

1. VASP machine learning force field fine-tuning

VASP is a computer program for atomic-scale material modeling from first principles, such as electronic structure calculations and quantum mechanics molecular dynamics. In this tutorial, we will generate a series of corresponding phonon spectra by continuously changing the machine learning hyperparameters and obtain the corresponding optimal machine learning force field parameter file.

Run online:https://go.hyper.ai/2DmyQ

2. VASP machine learning force field calculates silicon phonon spectrum

Phonopy is a Python toolkit for calculating phonon band structure, thermal properties, group velocity and other phonon-related quantities at harmonic and quasi-harmonic levels. In this tutorial, we will use an automated script to demonstrate the calculation process of machine learning force field phonon spectrum.

Run online:https://go.hyper.ai/tmnQ4

This week's paper recommendation

1. MiMo-VL Technical Report

This article introduces two open source models, MiMo-VL-7B-SFT and MiMo-VL-7B-RL, which are powerful visual language models that achieve state-of-the-art performance in general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B in 35 of the 40 tasks evaluated and scores 59.4 on OlympiadBench, surpassing models with up to 78 billion parameters. In addition, the article also contributes a set of comprehensive evaluation tools covering more than 50 tasks to promote reproducibility and advance the field.

Paper link:https://go.hyper.ai/0v2Lr

2. Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA

Large language models (LLMs) often hallucinate in question answering (QA) tasks. A critical but understudied factor is the temporal nature of questions - i.e., whether the question is evergreen (the answer remains stable over time) or mutable (the answer changes over time). This paper introduces EverGreenQA, the first multilingual QA dataset with evergreen labels that supports both evaluation and training. Using EverGreenQA, we benchmark 12 modern large language models to evaluate whether they encode the temporal nature of questions explicitly (via verbal judgments) or implicitly (via uncertainty signals).

Paper link:https://go.hyper.ai/UnDRj

3. MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection

This paper proposes MambaNeXt-YOLO, a new target detection framework that strikes a balance between accuracy and efficiency. Its specific contributions include the following three aspects: MambaNeXt module: a hybrid design that combines convolutional neural networks (CNNs) with the Mamba state-space structure, which can effectively extract local features and model long-range dependencies; Multi-branch Asymmetric Fusion Pyramid Network (MAFPN): an enhanced feature pyramid structure for improving the multi-scale detection capabilities of targets of different sizes; Efficiency optimization for edge devices: without using any pre-training, our method achieves a mAP of 66.6% and an inference speed of 31.9 FPS on the PASCAL VOC dataset, supporting efficient deployment on edge devices such as NVIDIA Jetson Xavier NX and Orin NX.

Paper link:https://go.hyper.ai/FGaro

4. ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development

This paper introduces ComfyUI-Copilot, a large language model-based plugin designed to enhance the usability and efficiency of ComfyUI. The core of the ComfyUI-Copilot system adopts a hierarchical multi-agent framework, which includes a central assistant agent responsible for task allocation and multiple specialized worker agents responsible for tasks with different purposes. The results show that it can accurately recommend nodes and accelerate workflow development.

Paper link:https://go.hyper.ai/n0WyZ

5. Prot42: a Novel Family of Protein Language Models for Target-aware Protein Binder Generation

This paper proposes a new family of protein language models, Prot42, which is pre-trained based on massive unlabeled protein sequences. Prot42 uses a decoder-only architecture, draws on the latest advances in natural language processing, and can deeply capture the evolution, structure, and function of proteins, significantly expanding the language-based computational protein design capabilities.

Paper link:https://go.hyper.ai/nHOJA

More AI frontier papers:https://go.hyper.ai/iSYSZ

Community article interpretation

1. 8k long sequence modeling, protein language model Prot42 can generate high affinity binders using only the target protein sequence

A joint research team from the Inception AI Institute in Abu Dhabi and Cerebras Systems in Silicon Valley developed Prot42, the first family of protein language models (PLMs) that relies solely on protein sequence information and does not require 3D structure input. It enables long sequence modeling and high-affinity binder generation, bringing disruptive breakthroughs to the field of protein design.

View the full report:https://go.hyper.ai/UMKY8

2. Event Preview | AMD/Muxi Integrated Circuit/ByteDance/Peking University/Shanghai Innovation Technology gathered in Beijing to explore from multiple perspectives from bottom-level compilation to scenario applications

Innovations and practices around the upstream and downstream of AI compilers continue to emerge, and everyone's attention to this field is also increasing! In order to better connect cutting-edge research and application scenarios, HyperAI will hold the 7th Meet AI Compiler Technology Salon in Beijing on July 5. The 7th 2025 Meet AI Compiler Technology Salon will be held at Garage Coffee in Beijing on July 5.

View the full report:https://go.hyper.ai/QM1xm

3. Selected for ICML 2025, Tsinghua University/Renmin University proposed the unified biomolecular dynamics simulator UniSim

The group of Professor Liu Yang from Tsinghua University and the group of Professor Huang Wenbing from the Gaoling School of Artificial Intelligence at Renmin University of China jointly proposed a unified biomolecular time-coarsening dynamics simulator UniSim, which for the first time realized unified time-coarsening dynamics simulation across molecular types (small molecules, peptides, proteins) and chemical environments.

View the full report:https://go.hyper.ai/gQ1ob

4. Based on 86,000 protein structure data, a machine learning method combining quantum mechanics calculations was used to discover 69 new nitrogen-oxygen-sulfur bonds

The team at Georg-August University developed an innovative computational biology algorithm, SimplifiedBondfinder, to systematically analyze more than 86,000 high-resolution X-ray protein structures and discovered a new type of NOS bond formed between arginine (Arg)-cysteine and glycine (Gly)-cysteine that had never been observed before.

View the full report:https://go.hyper.ai/nurdR

Popular Encyclopedia Articles

1. DALL-E

2. Reciprocal sorting fusion RRF

3. Pareto Front

4. Large-scale Multi-task Language Understanding (MMLU)

5. Contrastive Learning

Here are hundreds of AI-related terms compiled to help you understand "artificial intelligence" here:

https://go.hyper.ai/wiki

Deadline for the conference is June-July

June 19 7:59:59 ICDE 2026

July 2 7:59:59 VLDB 2026

July 11 7:59:59 POPL 2026

July 15 7:59:59 SODA 2026

July 18 7:59:59 SIGMOD 2026

July 19 7:59:59 ICSE 2026

One-stop tracking of top AI academic conferences:https://go.hyper.ai/event

The above is all the content of this week’s editor’s selection. If you have resources that you want to include on the hyper.ai official website, you are also welcome to leave a message or submit an article to tell us!

See you next week!