Japanese Commonsense Qa Data Generator Nemotron Persona Jp Seed
๐จ NeMo Data Designer: Japanese Commonsense Reasoning Dataset Generation
๐ Overview
This notebook generates synthetic datasets for the following tasks using NeMo Data Designer:
- jcommonsenseqa: Japanese commonsense question answering
Seed Data: Uses nvidia/Nemotron-Personas-Japan directly as the dataset
๐ Important โ Environment Setup
- Ensure that NeMo Data Designer installation and configuration are completed
- Ensure that the local LLM server is running
๐ฆ Import Required Modules
โ๏ธ Initialize NeMo Data Designer Client
๐๏ธ Define Model Configuration
๐ Prepare Seed Data
Load persona data from nvidia/Nemotron-Personas-Japan and
pass it to Data Designer as a pandas DataFrame.
Define Target Count and Category Breakdown
Define target of 2000 total seeds with category-specific breakdowns.
- SEED_TARGET: 2000 total seeds
- WeakA: geo(250), tools(100), public(200), other(150) = 400 total
- WeakB (weakness reinforcement): finance(400), safety(350), vocab(350) = 1100 total
- Typical: Remaining 500 seeds
- Bias suppression: Max 10 per occupation, max 12 per prefecture
Handle Missing Values
Fill missing values in required columns with empty strings. Create columns with empty strings if they don't exist.
Text Construction
Combine multiple columns to construct text for classification.
_all_text: Combine all columns_core_text: Combine core columns only (primary target for keyword matching)_core_len: Character count of core text
Create Duplicate Suppression Key
Create a key (_attr_key) for duplicate detection based on attribute combinations.
This prevents selecting multiple similar personas.
Exclude completely empty keys (all fields empty).
Exclude by Negative Keywords
Exclude personas containing inappropriate keywords (extreme expressions, crime-related, etc.) unsuitable for JCommonsenseQA.
Evaluate _core_text and remove matching entries.
Define Category Keyword Dictionaries
Define keywords for transportation/movement, daily life/housework, and tools.
- geo_kw: A_Transportation/Movement (trains, stations, buses, walking, etc.)
- life_kw: F_Daily Life/Housework (cooking, cleaning, shopping, etc.)
- tools_kw: B_Tools/Usage (knives, vacuum cleaners, stationery, etc.)
Define keywords for public facilities/manners, culture/etiquette, and finance.
- public_kw: D_Public Facilities/Manners (lines, order, priority seats, etc.)
- culture_kw: D_Public Facilities/Manners (etiquette, ceremonies, etc.)
- finance_kw: C_Payment/Money (accounting, banking, card payments, etc.)
Calculate Keyword Scores
Score how many keywords from each category are contained in the persona text.
- Primarily calculate scores using
_core_text - geo/tools only: recalculate by adding supplementary text (travel_persona, hobbies, etc.)
- This prevents depletion of geo and tools categories and suppresses misclassification to public
Exclude Abnormal Scores and Estimate Categories
Exclude data with abnormally high scores (containing unnaturally many keywords) and estimate the most suitable category for each persona.
- Select the category with the highest score
- In case of ties, decide by priority (finance > safety > vocab > ...)
Determine Neutral Data
Classify personas with few keyword hits and shorter length as Neutral.
Conditions:
- Core text length is 260 characters or less
- Keyword hit count is 0
- Does not contain definition keywords (such as '~ใจใฏ')
Limit Neutral to 50 entries to prevent too many thin seeds.
Create Sampling Pools
Create sampling pools for each category.
- typical_pool: Neutral and thin data (max_score โค 2)
- weakB_pool: Reinforcement targets (finance, safety, vocab)
- geo_pool, tools_pool, public_pool, other_pool: Each sub-category of WeakA
Define Sampling Function with Caps
Function that samples while suppressing bias by occupation and prefecture.
Operation:
- First sample while respecting caps
- If insufficient, relax caps to fill the remainder
- Always ensure the specified count is met
Sample WeakB Categories
Sample weakness reinforcement targets (WeakB).
- finance: 400 entries
- safety: 350 entries
- vocab: 350 entries
Sample a total of 1100 entries, excluding already selected data from subsequent sampling.
WeakA - Sample Geo/Tools
Sample transportation/movement and tools categories from WeakA.
- geo (Transportation/Movement): 250 entries
- tools (Tools): 100 entries
WeakA - Sample Public/Other
Sample the remainder of WeakA.
- public (Public Facilities/Manners): 200 entries
- other (culture/life): 150 entries
- For Other, prioritize those with public facility-related keywords
- Suppress those with religion-related keywords (penalty)
- This prevents category D from being biased toward religion
Sample Typical and Final Adjustments
Fill the remaining slots (approx. 500 entries) from the Typical category.
Process:
- Sample remaining count from Typical pool
- Combine all parts
- If insufficient, add from unused data
- If excess, adjust to 2000 entries
- Always ensure exactly 2000 entries
Assign Themes and Check Distribution
Map categories to JCommonsenseQA themes (A-F, N) and check the final distribution.
Themes:
- A: Transportation/Movement
- B: Tools/Usage
- C: Payment/Money
- D: Public Facilities/Manners
- E: Safety/Danger
- F: Daily Life/Housework
- N: Neutral
[seed_jc] size: 2000 jc_category finance 407 vocab 364 safety 352 geo 320 public 208 culture 145 tools 102 life 102 Name: count, dtype: int64 jc_theme B_้ๅ ทใป็จ้ 466 C_ๆฏๆใใปใ้ 407 D_ๅ ฌๅ ฑๆฝ่จญใปใใใผๆ้ 353 E_ๅฎๅ จใปๅฑ้บ 352 A_ไบค้ใป็งปๅ 320 F_็ๆดปใปๅฎถไบ 102 Name: count, dtype: int64
Create and Save Final Output Data
Select columns needed for prompt generation and create the final seed data.
Output Columns:
- uuid, occupation, prefecture, region, marital_status
- age_band, skills_and_expertise_list
- jc_theme, jc_category, _attr_key
Theme-Based Topic Category Assignment
This code probabilistically assigns topic_category to each seed persona based on JCommonsenseQA themes (A-F, N).
Main Features
-
Define weights for each theme
- Define assignment probabilities for topic categories (transportation, public places, daily life, etc.) for each theme (transportation/movement, tools/usage, etc.)
- Example:
A_Transportation/Movementโ Transportation 60%, Public places 25%, Daily life 15%
-
Deterministic random number generation
stable_u01(): Always generates the same 0-1 value from UUID/attribute key (using MD5 hash)- Guarantees complete reproducibility as it always returns the same result for the same input
-
Weighted probability selection
pick_weighted(): Selects categories based on weights using cumulative probability method- Determines appropriate topic category based on 0-1 random value
-
Automatic key selection
- Uses
uuidcolumn if it exists, otherwise uses_attr_key - Identifier for assigning unique stable random numbers to each row
- Uses
Process Flow
Row data โ Get jc_theme โ Get weights for theme
โ Generate hash value from UUID (0-1) โ Weighted selection โ Assign topic_category
Execution Result
A topic_category column is added to each seed persona, assigning topic categories (transportation, public places, daily life, shopping, school, meals, workplace) according to the theme.
CSV Text Normalization and Unterminated Quote Detection
Clean and normalize dataframe text in advance to prevent issues with unterminated quotes and control characters during CSV file output.
Overview
This code combines 4 normalization processes to clean up data:
- Unify newline codes - Convert CRLF/CR to LF for cross-platform compatibility
- Remove control characters - Delete invisible characters that hinder CSV parsing (preserve tabs and newlines)
- Unicode safety - Handle corrupted characters and isolated surrogates
- Quote escaping - Double quotes as needed (usually handled automatically by csv.writer)
Apply the clean_cell() function to all cells in all columns to convert the entire dataframe to a safe state.
Unterminated Quote Detection
The is_potential_unterminated_quote() function warns when the number of double quote occurrences in each row is odd. While not a complete detection, it functions as an inexpensive primary screening before CSV writing and can detect potential syntax errors early.
Filter Unterminated Quotes and Output CSV
Detect and exclude problematic rows with potential unterminated quotes, outputting only clean seed data to CSV file. Apply the is_potential_unterminated_quote() function to all rows to identify suspicious rows (odd number of double quotes), and extract only safe rows by logical inversion of the mask. Then remove judgment flag columns used in intermediate processing (is_neutral, _has_definition, has_neg_jc) to clean up the data, and save it in CSV format as the final seed data. This prevents CSV syntax errors and parsing failures, ensuring safety in downstream processing.
Note: The variable name is suspects, but it actually contains clean data after excluding suspicious rows.
๐๏ธ Define Data Structures
Data Structure for jcommonsenseqa
๐ Configuration 1: Using Seed Data
Uses the persona dataset directly as seed, and Data Designer automatically samples columns from the dataset.
[15:03:51] [INFO] ๐ Uploading seed dataset to datastore
Upload 0 LFS files: 0it [00:00, ?it/s]
Add jcommonsenseqa Generated Columns
Seed data columns can be directly referenced in the format {{ column_name }}.
jcommonsenseqa็ๆใซใฉใ ใ่ฟฝๅ ใใพใใ
๐ Quality Evaluation Setup
Evaluate data quality using LLM-as-a-Judge
ๅ่ณช่ฉไพกใซใฉใ ใ่ฟฝๅ ใใพใใ
๐ Generate Preview
First check quality with a small amount of data
[15:03:56] [INFO] โ Validation passed [15:03:56] [INFO] ๐ Starting preview generation [15:03:56] [INFO] โ๏ธ Sorting column configs into a Directed Acyclic Graph [15:03:56] [INFO] ๐ฉบ Running health checks for models...
Seedใใผใฟใใ็ใฎใใฌใใฅใผใ็ๆไธญ... ==========
[15:03:57] [INFO] |-- ๐ Checking 'openai/gpt-oss-120b' in provider named 'nvidiabuild' for model alias 'gpt-oss-120b'...
[15:03:57] [INFO] |-- โ
Passed!
[15:03:58] [INFO] |-- ๐ Checking 'openai/gpt-oss-120b' in provider named 'nvidiabuild' for model alias 'quality-judge'...
[15:03:58] [INFO] |-- โ
Passed!
[15:03:58] [INFO] โณ Processing batch 1 of 1
[15:03:58] [INFO] ๐ฑ Sampling 1 records from seed dataset
[15:03:58] [INFO] |-- seed dataset size: 2000 records
[15:03:58] [INFO] |-- sampling strategy: ordered
[15:03:58] [INFO] ๐๏ธ Preparing llm-structured column generation
[15:03:58] [INFO] |-- column name: 'jcqa_data'
[15:03:58] [INFO] |-- model config:
{
"alias": "gpt-oss-120b",
"model": "openai/gpt-oss-120b",
"inference_parameters": {
"temperature": 0.9,
"top_p": 0.95,
"max_tokens": 2048,
"max_parallel_requests": 8,
"timeout": 1200,
"extra_body": null
},
"provider": "nvidiabuild"
}
[15:04:03] [INFO] ๐ Processing llm-structured column 'jcqa_data' with 8 concurrent workers
[15:04:05] [INFO] โ๏ธ Preparing llm-judge column generation
[15:04:05] [INFO] |-- column name: 'quality_metrics'
[15:04:05] [INFO] |-- model config:
{
"alias": "quality-judge",
"model": "openai/gpt-oss-120b",
"inference_parameters": {
"temperature": 0.3,
"top_p": 0.9,
"max_tokens": 1024,
"max_parallel_requests": 4,
"timeout": 1500,
"extra_body": null
},
"provider": "nvidiabuild"
}
[15:04:10] [INFO] ๐ Processing llm-judge column 'quality_metrics' with 4 concurrent workers
[15:04:10] [INFO] ๐งฉ Generating column `clarity_score` from expression
[15:04:10] [INFO] ๐งฉ Generating column `difficulty` from expression
[15:04:10] [INFO] ๐ Model usage summary:
{
"openai/gpt-oss-120b": {
"token_usage": {
"prompt_tokens": 1653,
"completion_tokens": 563,
"total_tokens": 2216
},
"request_usage": {
"successful_requests": 1,
"failed_requests": 0,
"total_requests": 1
},
"tokens_per_second": 187,
"requests_per_minute": 5
}
}
[15:04:10] [INFO] ๐ Measuring dataset column statistics:
[15:04:10] [INFO] |-- ๐ฑ column: 'uuid'
[15:04:10] [INFO] |-- ๐ฑ column: 'professional_persona'
[15:04:10] [INFO] |-- ๐ฑ column: 'sports_persona'
[15:04:10] [INFO] |-- ๐ฑ column: 'arts_persona'
[15:04:10] [INFO] |-- ๐ฑ column: 'travel_persona'
[15:04:10] [INFO] |-- ๐ฑ column: 'culinary_persona'
[15:04:10] [INFO] |-- ๐ฑ column: 'persona'
[15:04:10] [INFO] |-- ๐ฑ column: 'cultural_background'
[15:04:10] [INFO] |-- ๐ฑ column: 'skills_and_expertise'
[15:04:10] [INFO] |-- ๐ฑ column: 'skills_and_expertise_list'
[15:04:10] [INFO] |-- ๐ฑ column: 'hobbies_and_interests'
[15:04:10] [INFO] |-- ๐ฑ column: 'hobbies_and_interests_list'
[15:04:10] [INFO] |-- ๐ฑ column: 'career_goals_and_ambitions'
[15:04:10] [INFO] |-- ๐ฑ column: 'sex'
[15:04:10] [INFO] |-- ๐ฑ column: 'age'
[15:04:10] [INFO] |-- ๐ฑ column: 'marital_status'
[15:04:10] [INFO] |-- ๐ฑ column: 'education_level'
[15:04:10] [INFO] |-- ๐ฑ column: 'occupation'
[15:04:10] [INFO] |-- ๐ฑ column: 'region'
[15:04:10] [INFO] |-- ๐ฑ column: 'area'
[15:04:10] [INFO] |-- ๐ฑ column: 'prefecture'
[15:04:10] [INFO] |-- ๐ฑ column: 'country'
[15:04:10] [INFO] |-- ๐ฑ column: 'age_band'
[15:04:10] [INFO] |-- ๐ฑ column: '_all_text'
[15:04:10] [INFO] |-- ๐ฑ column: '_core_text'
[15:04:10] [INFO] |-- ๐ฑ column: '_core_len'
[15:04:10] [INFO] |-- ๐ฑ column: '_attr_key'
[15:04:10] [INFO] |-- ๐ฑ column: 'score_finance'
[15:04:10] [INFO] |-- ๐ฑ column: 'score_safety'
[15:04:10] [INFO] |-- ๐ฑ column: 'score_vocab'
[15:04:10] [INFO] |-- ๐ฑ column: 'score_public'
[15:04:10] [INFO] |-- ๐ฑ column: 'score_tools'
[15:04:10] [INFO] |-- ๐ฑ column: 'score_life'
[15:04:10] [INFO] |-- ๐ฑ column: 'score_geo'
[15:04:10] [INFO] |-- ๐ฑ column: 'score_culture'
[15:04:10] [INFO] |-- ๐ฑ column: '_geo_text'
[15:04:10] [INFO] |-- ๐ฑ column: '_tools_text'
[15:04:10] [INFO] |-- ๐ฑ column: '_kw_hits'
[15:04:10] [INFO] |-- ๐ฑ column: 'jc_category'
[15:04:10] [INFO] |-- ๐ฑ column: 'max_score_any'
[15:04:10] [INFO] |-- ๐ฑ column: '_public_bonus'
[15:04:10] [INFO] |-- ๐ฑ column: '_religion_pen'
[15:04:10] [INFO] |-- ๐ฑ column: 'jc_theme'
[15:04:10] [INFO] |-- ๐ฑ column: 'topic_category'
[15:04:10] [INFO] |-- ๐๏ธ column: 'jcqa_data'
[15:04:10] [INFO] |-- โ๏ธ column: 'quality_metrics'
[15:04:10] [INFO] |-- ๐งฉ column: 'clarity_score'
[15:04:10] [INFO] |-- ๐งฉ column: 'difficulty'
[15:04:10] [INFO] โ
Preview complete!
ใใฌใใฅใผ็ๆๅฎไบ!
ใใฌใใฅใผใใผใฟใฎๅๆ:
ใใฌใใฅใผใใผใฟใฎๆๅใฎๆฐไปถ:
uuid \
0 749db6e7c2e245b2ae3b46aa12c4f1e0
professional_persona \
0 ไธญๅถ ไปๅญใฏไฟ้บๅฅ็ดใฎใชในใฏ่ฉไพกใจ้กงๅฎขใใผใบใฎไฝ็ณป็ๅๆใซ้ทๅนดๅพไบใใ้่ทๅพใใกใณใฟใชใณใฐใจ...
sports_persona \
0 ไธญๅถ ไปๅญใฏๅญฃ็ฏใซๅใใใใฆใฉใผใญใณใฐใจใณใใฅใใใฃใฎ่ปฝ้ๅใฏใฉในใงไฝๅ็ถญๆใๅณใใ็ซถไบ็ใช...
arts_persona \
0 ไธญๅถ ไปๅญใฏ่ถ้ใจๆธ้ใฎไผ็ตฑ็็จฝๅคใๅบ็คใซใใใธใฟใซๅขจ็ตตใใคใณใฟใฉใฏใใฃใ่ถๅฎคไฝ้จใจใใฃใ้...
travel_persona \
0 ไธญๅถ ไปๅญใฏ่ฟ้ใฎๆญดๅฒ็ๅฏบ้ขใๅญฃ็ฏใฎ่พฒ็ฃ็ฉ็ดๅฃฒๆใธใฎๆฅๅธฐใ่จชๅใ่จ็ปใใๅ่ปไบ็ดใจๅฎฟๆณๅ
ใฎใญ...
culinary_persona \
0 ไธญๅถ ไปๅญใฏๅญฃ็ฏใฎๆ น่ใจๆตท่ปใไฝฟ็จใใไฝๅกฉๅ้ฃใๅฅฝใฟใๆน่ถใจ็
่ถใฎๆฝๅบๆ้ใๅพฎ่ชฟๆดใใชใใใ...
persona \
0 ไธญๅถ ไปๅญใฏ็ต็น็ใชใชในใฏ็ฎก็ใจๅฅๅบทๅฟๅใฎ็ๆดป็ฟๆ
ฃใ็ตฑๅใใใชใผใใณใใคใณใใจ่จ็ปๆงใง้ซ้ฝข่
...
cultural_background \
0 ไธ้็ๅบ่บซใง่ฟ็ฟๅฐๆน็นๆใฎๆธฉใใไบบๆ
ใจใๅนด้ท่
ใธใฎๆฌๆใ้ใใใไพกๅค่ฆณใๆใกใ็้ข็ฎใใจๅฅๅบท...
skills_and_expertise \
0 ไฟ้บๅฅ็ดใฎ็ฎก็ใจๆดๆฐใใชในใฏ่ฉไพกใป้กงๅฎขใฎใใผใบๆๆกใๆณ่ฆๅถใฎ้ตๅฎใซๅ ใใฆใExcelใWor...
skills_and_expertise_list ... _public_bonus \
0 ['ไฟ้บๅฅ็ด็ฎก็', 'ใชในใฏ่ฉไพก', '้กงๅฎขๅฏพๅฟ', 'ๆณ่ฆๅถ้ตๅฎ', 'Excelๆฅญๅ'] ... None
_religion_pen jc_theme topic_category \
0 None C_ๆฏๆใใปใ้ ๅ
ฌๅ
ฑใฎๅ ด
jcqa_data \
0 {'answer_index': 0, 'choice0': '็พ้ใงๆฏๆใ', 'choi...
jcqa_data__reasoning_trace \
0 We need to output JSON with fields: question, ...
quality_metrics \
0 {'difficulty': {'reasoning': 'ๆฅๆฌใฎ้ง
็ชๅฃใงใฎๆฏๆใๆนๆณใฏๅบใ...
quality_metrics__reasoning_trace clarity_score difficulty
0 We need to evaluate the generated data's quali... ๆ็ขบ ๆใใ
[1 rows x 50 columns]
๐ Generate Production Data
If no issues in preview, generate a large-scale dataset
๐ Analyze Results
๐ Quality Comparison
Compare quality with and without seed data
๐พ Save jcommonsenseqa Data
๐ Summary
What We Did
- โ Correctly configured nvidia/Nemotron-Personas-Japan as seed data
- โ Generated data by directly referencing seed data columns
- โ Generated synthetic data for jcommonsenseqa
- โ Created with seed data
- โ Quality evaluation using LLM-as-a-Judge
- โ Generated quality comparison report