FPT and NVIDIA’s Vietnamese AI Dataset Breaks into Global Top 15 Trending Rankings

08/06/2026

Just four days after its release, Nemotron-Personas-Vietnam, a dataset jointly developed by FPT Corporation and NVIDIA, entered the Top 15 Trending Datasets on Hugging Face, the world's leading open-source platform for sharing AI models and datasets.

A "Vietnamese Personas" Dataset for AI Development

On Hugging Face, trending rankings reflect the level of community interest in a resource, typically measured through downloads, likes, and user engagement. The appearance of Nemotron-Personas-Vietnam in the Top 15 Trending list demonstrates growing international interest in datasets specifically designed for the Vietnamese language and local context, particularly as many countries accelerate efforts to develop sovereign AI capabilities.

Nemotron-Personas-Vietnam is not a large language model itself but rather a foundational dataset that supports AI development. Simply put, if an AI model is the "brain" responsible for processing and generating language, the dataset serves as part of the "learning material" that helps train, fine-tune, and evaluate the model more effectively.

The dataset is built around Vietnamese-language personas—synthetic "character profiles" designed to reflect the diversity of Vietnamese people across daily life, education, careers, and personal interests. These personas do not represent real individuals. Instead, they are AI-generated synthetic data created using statistical distributions and validation methods intended to closely mirror the realities of Vietnamese society.

The public release of Nemotron-Personas-Vietnam includes 100,000 records, representing approximately 900,000 Vietnamese personas, with a total size of 118 million tokens, including 52 million persona tokens. Tokens can be understood as the smallest units an AI model uses to read and process language. A dataset of 118 million tokens provides a substantial textual foundation for developers seeking to generate training data, fine-tune models, or evaluate Vietnamese-language AI systems.

Each record contains multiple attributes, including occupation, skills, career goals, interests in sports, arts, travel, and cuisine, as well as age, gender, education level, marital status, region, and locality. This multidimensional structure enables developers to filter, segment, and generate datasets tailored to specific user groups, industries, or application scenarios.

The dataset covers six centrally governed cities and provinces—Hanoi, Ho Chi Minh City, Hai Phong, Da Nang, Can Tho, and Dong Nai—based on Vietnam's updated administrative boundaries following the 2025 reorganization.

Nemotron-Personas-Vietnam is publicly available on Hugging Face under the CC BY 4.0 license, allowing both commercial and non-commercial use with proper attribution. This gives researchers, startups, enterprises, and the Vietnamese AI community access to a foundational resource for experimenting with, training, fine-tuning, and evaluating AI systems.

Advancing Sovereign AI for Vietnam

With Nemotron-Personas-Vietnam, developers gain access to a dataset that more accurately reflects the characteristics of Vietnamese people. This enables the generation of additional synthetic data, helps reduce bias during model training, and improves the diversity and relevance of responses produced by Vietnamese AI systems.

This represents an important step toward building AI that not only "speaks Vietnamese" but also better understands Vietnamese people, Vietnamese society, and Vietnam-specific challenges.

Associate Professor Dr. Ngo Xuan Bach, Director of AI Products at FPT Smart Cloud and Director of the Quantum AI & Cyber Security Institute at FPT Corporation, stated: "FPT believes that sovereign AI must be built from the ground up to reflect local language, culture, and economic realities. The Nemotron-Personas-Vietnam dataset demonstrates our commitment to helping local AI developers gain easier access to the resources needed to build AI solutions tailored for Vietnamese users while remaining scalable to regional markets."

The collaboration between FPT and NVIDIA is driven by a shared goal of providing the AI community with efficient open models, datasets, and development libraries. These resources enable developers to build AI systems that better reflect the language, culture, regulations, data infrastructure, and economic priorities of individual countries rather than relying entirely on global generalized models.

Within this collaboration, NVIDIA contributed its open-model framework, the NVIDIA NeMo Data Designer synthetic data library, and the Nemotron-Personas methodology. This structured approach enables the creation of large-scale synthetic datasets capable of reflecting demographic, geographic, and contextual characteristics specific to individual countries.

FPT contributed local expertise, domain knowledge, data validation capabilities, data infrastructure, and AI research capabilities through FPT Smart Cloud, the Quantum AI & Cyber Security Institute, and FPT DC5.

Globally, persona datasets are becoming an increasingly important approach in AI development, particularly for models that require diverse synthetic data, reduced bias, and better representation of user contexts. Under the Nemotron-Personas initiative, NVIDIA has already developed persona datasets for several countries and regions, including the United States, Japan, India, Singapore, Brazil, and France.

Most widely used AI models today are trained primarily on English-language data and Western contexts. When deployed in Vietnam, these models may not fully understand differences in language, culture, occupations, regional characteristics, communication styles, and the practical needs of Vietnamese users. As a result, responses can sometimes appear unnatural, inaccurate, or insufficiently adapted to local contexts.

The inclusion of Nemotron-Personas-Vietnam among Hugging Face's trending datasets highlights the growing importance of localized data in the global AI race. For Vietnam, it represents a practical step toward expanding resources available to the technology community, supporting businesses and researchers in developing AI systems that better understand Vietnamese users, serve local needs more effectively, and scale across the region.