HyperAIHyperAI

Command Palette

Search for a command to run...

NVIDIA Launches Nemotron-Personas-Brazil: Open Synthetic Dataset for Sovereign AI in Brazilian Portuguese

Nemotron-Personas-Brazil is an open dataset of 6 million fully synthetic personas designed to support sovereign AI development in Brazil. Built in collaboration with WideLabs, an NVIDIA Inception member, the dataset is grounded in real demographic, geographic, and occupational distributions from official data provided by the Brazilian Institute of Geography and Statistics (IBGE). It is released under the CC BY 4.0 license, making it freely available for commercial and research use. The dataset is created using NVIDIA’s NeMo Data Designer, a compound AI system for synthetic data generation. It ensures statistical accuracy while maintaining privacy by generating personas that reflect Brazil’s diverse population without representing any real individual. Each persona is written in natural Brazilian Portuguese and includes detailed attributes such as age, gender, education, occupation, location, cultural background, skills, goals, hobbies, and interests. To capture Brazil’s rich socio-demographic diversity, the dataset incorporates data at the state and municipality level, reflecting the country’s five macro-regions. It includes not only traditional job titles but also skills and career paths relevant to micro-entrepreneurs and regional trades. Life stages such as student status, unemployment, and retirement are also represented to mirror real population dynamics. Cultural traits like preferences in arts, sports, and travel are integrated to reflect local social norms and lifestyles. The dataset is designed with privacy in mind—no personally identifiable information is included. All data is synthetically generated based on public statistics, ensuring that developers can train AI models on authentic cultural patterns without privacy risks. Nemotron-Personas-Brazil is intended for Brazilian developers, researchers, and organizations building AI systems that serve local populations. It helps bridge the gap left by English-centric training data and supports the creation of AI that is culturally relevant and linguistically accurate. Global developers can also use the dataset to improve model performance in Brazilian Portuguese contexts. The dataset is part of NVIDIA’s expanding Nemotron-Personas Collection, which includes similar datasets for the USA, Japan, India, and Singapore. An enhanced version will soon be available directly within NeMo Data Designer, allowing developers to generate, customize, and extend Brazilian personas as part of their own data pipelines. The release promotes data democratization by providing enterprise-grade synthetic data at no cost, lowering barriers for startups, researchers, and developers in underrepresented regions. It enables the creation of AI systems that are more inclusive, accurate, and aligned with real-world populations. The dataset is available for download via Hugging Face using the command: dataset = load_dataset("nvidia/nemotron-personas-brazil") For more information or to participate in future dataset development, interested parties can join the discussion on NVIDIA’s Discord.

Related Links