Breaking Through the Bottleneck of Long-text Output of 10,000 Words! Tsinghua University Open-sources LongWriter-6k Dataset; 7 CCF Class A Top Conferences Are About to Close

Although the current long-context large model can handle massive text input, it is unable to generate long content due to the lack of long output examples. To solve this problem,A research team from Tsinghua University built the LongWriter-6k dataset, which is able to extend the maximum output window size of large models to 10,000+ words!
With the help of the model trained by LongWriter-6k, not only can it generate exciting novels with ups and downs in the plot and super long length in novel creation, allowing readers to immerse themselves in a grand literary world; in academic research, it can also generate detailed research reports and paper reviews, providing rich reference materials for scientific researchers.
The hyper.ai official website has now launched the "LongWriter-6k long context output dataset", which also supports online use.Scroll down to get the link~
From August 19th to August 23rd, hyper.ai official website updates:
* High-quality public datasets: 10
* Selected high-quality tutorials: 2
* Community Article Selection: 2 articles
* Popular encyclopedia entries: 5
* Top conferences with deadline in September: 7
Visit the official website:hyper.ai
Selected public datasets
1. LongWriter-6k long context output dataset
The dataset contains 6k SFT data with output length of 2k-32k words (including English and Chinese), which can support the training of LLM and expand its maximum output window size to 10,000+ words.
Direct use:https://go.hyper.ai/77byR
2. EVOBC Oracle-Bone Script Evolution Dataset
The dataset contains ancient texts from six historical periods that researchers systematically collected from authoritative documents and websites, and consists of 229,170 images representing 13,714 different character categories.
Direct use:https://go.hyper.ai/oe5fU
3. HUST-OBS Oracle Bone Recognition Dataset
The dataset contains over 140,000 images from 3 different sources, including books, websites, and existing databases, making it one of the largest OBS identification and decryption datasets to date.
Direct use:https://go.hyper.ai/bXxx1
4. Alpaca-Cleaned instruction fine-tuning dataset
The Alpaca-Cleaned dataset is a cleaned version of the original Alpaca dataset released by Stanford University in 2024. This dataset fixes some issues in the original Alpaca, such as hallucinatory answers, merged instructions, empty outputs, and inconsistent input fields, thereby improving the quality and consistency of the data.
Direct use:https://go.hyper.ai/yNlAa
5. Al Medical Chatbot Medical Conversation Dataset
This is an experimental dataset designed for running medical chatbots, which contains 256,916 conversations between patients and doctors.
Direct use:https://go.hyper.ai/kaGzv
6. Openstory++ Large-Scale Image Instance Dataset
Openstory++ is designed to solve the problem that existing image generation models have difficulty maintaining instance consistency in the context of long text. It combines instance-level annotations of images and text to provide a rich resource that enables it to generate images with high consistency in the context of long text.
Direct use:https://go.hyper.ai/no3E7
7. MedTrinity-25M Large-Scale Multimodal Medical Dataset
MedTrinity-25M contains more than 25 million medical images, covering 10 imaging modes, and more than 65 diseases are annotated. This dataset not only contains rich global and local annotations, but also integrates multi-level information annotations in multiple modalities (such as CT, MRI, X-ray, etc.). This dataset will provide great support for multimodal tasks such as medical image processing, report generation, classification and segmentation, and promote the pre-training of medical-based artificial intelligence models.
Direct use:https://go.hyper.ai/JCSJP
8. 1920 Raider Waite Tarot Tarot Image Dataset
This dataset contains images and related text descriptions of 78 cards from the original Rider-Waite Tarot Deck, providing researchers and artists with a rich resource for exploring the art and symbolism of tarot cards, and can be used to train models to generate tarot-style images.
Direct use:https://go.hyper.ai/8bd2R
9. Waterloo Exploration Large-Scale Image Quality Assessment Database
The database contains 4,744 original natural images and 94,880 distorted images created from these original images, which can be used to test the generalization ability of image quality assessment models.
Direct use:https://go.hyper.ai/m5mhN
10. SWE-bench Verified Code Generation Evaluation Benchmark Dataset
The benchmark is an improved version (subset) of the existing SWE-bench, designed to more reliably evaluate the ability of AI models to solve real-world software problems.
Direct use:https://go.hyper.ai/oxOBY
For more public datasets, please visit:
Selected Public Tutorials
1. ComfyUl AuraFlow Wenshengtu Workflow Demo
This model achieves state-of-the-art results on GenEval, with higher processing efficiency and better detail presentation on the Wensheng graph task. This tutorial uses ComfyUI to deploy the AuraFlow Wensheng graph model. The model and related environment configurations have been built and can be cloned with one click for inference.
Direct use:https://go.hyper.ai/KpI4B
2. Whisper Web Online Speech Recognition Tool
Whisper performs speech recognition based on ML and can be accelerated by WebGPU. It supports online/local audio file upload and instant recording in more than 100 languages. The recognized text can be exported in TXT and JSON file formats and can be directly translated into English. This tutorial is based on the open source project Whisper Web on GitHub, and runs Whisper directly in the browser.
Direct use:https://go.hyper.ai/N3iwm
Community Articles
Recently, Lv Haiquan, Sun Rong, Zhang Kai from Shandong University and Mei Qi from Shanxi Medical University, together with Helix Matrix and other research teams, have made a breakthrough. Using machine learning technology and based on mRNA analysis, they have successfully developed a new method for evaluating the characteristics of cancer stem cells in primary breast cancer patient samples, the BCSC signature. This article is a detailed interpretation and sharing of the research paper.
View the full report:https://go.hyper.ai/SPAjK
In the AI for Bioengineering Summer School of Shanghai Jiao Tong University, Dr. Zhou Bingxin of Shanghai Jiao Tong University shared the definition, advantages, and cutting-edge applications of graph neural networks in the fields of protein prediction and generation with the theme of "Graph Neural Networks and Protein Structure Representation". This article is a transcript of the highlights of Dr. Zhou Bingxin's sharing.
View the full report:https://go.hyper.ai/GjXi5
The Zhejiang University research team proposed InstructProtein, which uses knowledge instructions to align protein language with human language, demonstrating the ability to integrate biological sequences into large language models. This article is a detailed interpretation and sharing of the research paper.
View the full report:https://go.hyper.ai/GjXi5
Popular Encyclopedia Articles
1. Paired t-Test
2. Reciprocal sorting fusion RRF
3. Pareto Front
4. Variational Autoencoder VAE
5. Data Augmentation
Here are hundreds of AI-related terms compiled to help you understand "artificial intelligence" here:

One-stop tracking of top AI academic conferences:https://go.hyper.ai/event
The above is all the content of this week’s editor’s selection. If you have resources that you want to include on the hyper.ai official website, you are also welcome to leave a message or submit an article to tell us!
See you next week!
About HyperAI
HyperAI (hyper.ai) is the leading artificial intelligence and high-performance computing community in China.We are committed to becoming the infrastructure in the field of data science in China and providing rich and high-quality public resources for domestic developers. So far, we have:
* Provide domestic accelerated download nodes for 1300+ public data sets
* Includes 400+ classic and popular online tutorials
* Interpretation of 100+ AI4Science paper cases
* Support 500+ related terms search
* Hosting the first complete Apache TVM Chinese documentation in China
Visit the official website to start your learning journey: