Date

2 years ago

Size

56.03 MB

Organization

Paper URL

Dataset Introduction

This dataset is a collection of 1 billion different characters automatically organized from web data launched by Tencent Seattle AI Lab in 2024. These 1 billion characters (about 13% of the world's total population) as distributed carriers of world knowledge can leverage almost all perspectives encapsulated in LLM, thereby facilitating the large-scale creation of diverse synthetic data for various scenarios. By demonstrating the use cases of PERSONA HUB in large-scale synthesis of high-quality mathematical and logical reasoning problems, instructions (i.e. user prompts), knowledge-rich texts, game NPCs, and tools (functions), the research team demonstrated that character-driven data synthesis is versatile, scalable, flexible, and easy to use, and has the potential to drive a paradigm shift in synthetic data creation and practical applications, which may have a profound impact on the research and development of LLM. The relevant paper isScaling Synthetic Data Creation with 1,000,000,000 Personas"

Dataset background

Tencent Seattle AI Lab has introduced a novel, persona-driven data synthesis approach that leverages multiple perspectives in large language models (LLMs) to create diverse synthetic data. The researchers introduced a system called "Persona Hub" that automatically curates 1 billion different personas (about 13% of the world's total population) from web data. These personas, as distributed carriers of world knowledge, are able to touch almost all perspectives contained in LLMs, thereby facilitating the creation of diverse synthetic data at scale for a variety of scenarios. This technical report also discusses the broad implications and ethical issues that may arise from the use of Persona Hub, such as data security, threats to the leading position of existing LLMs, and the possibility of simulating real society in virtual worlds.

PersonaHub.torrent

Seeding 1Downloading 0Completed 245Total Downloads 311

PersonaHub/
- README.md
  2.42 KB
- README.txt
  4.83 KB

This dataset is contributed by community users and is intended for educational and informational purposes only. If any content involves copyright infringement, please contact us at [email protected] for prompt review and removal.

Related Datasets

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Use this Dataset

Discuss on Discord

Date

2 years ago

Size

56.03 MB

Organization

Paper URL

arxiv.org

Dataset Introduction

Dataset background

PersonaHub.torrent

Seeding 1Downloading 0Completed 245Total Downloads 311

PersonaHub/
- README.md
  2.42 KB
- README.txt
  4.83 KB

Related Datasets

ToolACE Complex Tools Learning Dialogue Dataset

3 months ago

Nemotron Personas France (French Synthetic Personas Dataset)

3 months ago

CHIMERA General Inference Synthetic Dataset

8 days ago

Nemotron-Personas-Brazil Brazilian Synthetic Character Dataset

8 days ago

CL-bench Context Learning Evaluation Benchmark Dataset

4 months ago

RoVid-X Robot Video Generation Dataset

8 days ago

Patient Segmentation Dataset

5 months ago

TransPhy3D Transparent Reflection Synthesis Video Dataset

5 months ago

Nemotron-Math-v2 Mathematical Inference Dataset

8 days ago

TxT360-3efforts Multi-Task Inference Dataset

8 days ago

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Persona Hub: A Dataset of 1 Billion Different Personas Automatically Curated From Web Data

Dataset Introduction

Dataset background

Build AI with AI

HyperAI Newsletters

Command Palette

Persona Hub: A Dataset of 1 Billion Different Personas Automatically Curated From Web Data

Dataset Introduction

Dataset background

Related Datasets

ToolACE Complex Tools Learning Dialogue Dataset

Nemotron Personas France (French Synthetic Personas Dataset)

CHIMERA General Inference Synthetic Dataset

Nemotron-Personas-Brazil Brazilian Synthetic Character Dataset

CL-bench Context Learning Evaluation Benchmark Dataset

RoVid-X Robot Video Generation Dataset

Patient Segmentation Dataset

TransPhy3D Transparent Reflection Synthesis Video Dataset

Nemotron-Math-v2 Mathematical Inference Dataset

TxT360-3efforts Multi-Task Inference Dataset

Build AI with AI

HyperAI Newsletters

Command Palette

Persona Hub: A Dataset of 1 Billion Different Personas Automatically Curated From Web Data

Dataset Introduction

Dataset background

Related Datasets

ToolACE Complex Tools Learning Dialogue Dataset

Nemotron Personas France (French Synthetic Personas Dataset)

CHIMERA General Inference Synthetic Dataset

Nemotron-Personas-Brazil Brazilian Synthetic Character Dataset

CL-bench Context Learning Evaluation Benchmark Dataset

RoVid-X Robot Video Generation Dataset

Patient Segmentation Dataset

TransPhy3D Transparent Reflection Synthesis Video Dataset

Nemotron-Math-v2 Mathematical Inference Dataset

TxT360-3efforts Multi-Task Inference Dataset

Build AI with AI

HyperAI Newsletters

Related Datasets

ToolACE Complex Tools Learning Dialogue Dataset

Nemotron Personas France (French Synthetic Personas Dataset)

CHIMERA General Inference Synthetic Dataset

Nemotron-Personas-Brazil Brazilian Synthetic Character Dataset

CL-bench Context Learning Evaluation Benchmark Dataset

RoVid-X Robot Video Generation Dataset

Patient Segmentation Dataset

TransPhy3D Transparent Reflection Synthesis Video Dataset

Nemotron-Math-v2 Mathematical Inference Dataset

TxT360-3efforts Multi-Task Inference Dataset

Related Datasets

ToolACE Complex Tools Learning Dialogue Dataset

Nemotron Personas France (French Synthetic Personas Dataset)

CHIMERA General Inference Synthetic Dataset

Nemotron-Personas-Brazil Brazilian Synthetic Character Dataset

CL-bench Context Learning Evaluation Benchmark Dataset

RoVid-X Robot Video Generation Dataset

Patient Segmentation Dataset

TransPhy3D Transparent Reflection Synthesis Video Dataset

Nemotron-Math-v2 Mathematical Inference Dataset

TxT360-3efforts Multi-Task Inference Dataset