Notebooks
U
Unsloth
HuggingFace Course Gemma3 (1B) GRPO

HuggingFace Course Gemma3 (1B) GRPO

unsloth-notebooksunslothnb

To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!

Join Discord if you need help + ⭐ Star us on Github

In this Hugging Face and Unsloth notebook, you will learn to transform Gemma3 (1B) GRPO into a Reasoning model using GRPO.

To install Unsloth on your local device, follow our guide. This notebook is licensed LGPL-3.0.

You will learn how to do data prep, how to train, how to run the model, & how to save it

News

Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. Blog

You can now train embedding models 1.8-3.3x faster with 20% less VRAM. Blog

Ultra Long-Context Reinforcement Learning is here with 7x more context windows! Blog

3x faster LLM training with 30% less VRAM and 500K context. 3x faster500K Context

New in Reinforcement Learning: FP8 RLVision RLStandbygpt-oss RL

Visit our docs for all our model uploads and notebooks.

Installation

[ ]
[ ]

Unsloth

Load up Gemma 3 1B Instruct, and set parameters

[3]
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 03-19 15:51:40 [__init__.py:256] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.17: Fast Gemma3 patching. Transformers: 4.50.0.dev0. vLLM: 0.8.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
model.safetensors:   0%|          | 0.00/2.00G [00:00<?, ?B/s]
generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]
tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]
added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update a small amount of parameters!

[4]
Unsloth: Making `model.base_model.model.model` require gradients

Data Prep

We're using OpenAI's famous GSM8K dataset!

[5]
README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]
train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]
test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]
Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]
Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]
Dataset({
,    features: ['question', 'answer'],
,    num_rows: 7473
,})

Let's look at the first row:

[6]
'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?'
[7]
'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'

We notice all answers like about have a ####, so we extract it:

[8]
'72'

We now create a system prompt which can be customized. We add 4 extra symbols for working out or thinking / reasoning sections and a final answer:

[9]
'You are given a problem.\nThink about the problem and provide your working out.\nPlace it between <start_working_out> and <end_working_out>.\nThen, provide your solution between <SOLUTION></SOLUTION>'

Let's map the dataset! and see the first row:

[10]
Map:   0%|          | 0/7473 [00:00<?, ? examples/s]
{'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
, 'answer': '72',
, 'prompt': [{'content': 'You are given a problem.\nThink about the problem and provide your working out.\nPlace it between <start_working_out> and <end_working_out>.\nThen, provide your solution between <SOLUTION></SOLUTION>',
,   'role': 'system'},
,  {'content': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
,   'role': 'user'}]}

We create a regex format to match the reasoning sections and answers:

[11]

We verify it works:

[12]
<re.Match object; span=(0, 71), match='<start_working_out>Let me think!<end_working_out>>

We now want to create a reward function to match the format exactly - we reward it with 3 points if it succeeds:

[13]

If it fails, we want to reward the model if it at least follows the format partially, by counting each symbol:

[14]

Finally, we want to extract the generated answer, and reward or penalize it! We also reward it based on how close the answer is to the true one via ratios:

[15]

Also sometimes it might not be 1 number as the answer, but like a sentence for example "The solution is $20" -> we extract 20.

[16]
['0.34']
[17]

Train the model

Now set up GRPO Trainer and all configurations!

[18]
Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 4

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the reward column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

StepTraining Lossrewardreward_stdcompletion_lengthkl
10.0000000.1250000.000000200.0000000.000000
20.0000000.0723750.248112200.0000000.000000
30.000000-0.0790000.163776182.5000000.000005
[19]
Unsloth: Switching to float32 training since model cannot work with float16
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 50
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4
 "-____-"     Trainable parameters = 6,522,880/1,006,408,832 (0.65% trained)
******************** Question:
A concert ticket costs $40. Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10. How much did Mr. Benson pay in all? 
Answer:
476 
Response:
<start_working_out>
Let $C$ be the cost of a ticket, which is $C = 40$.
Let $N$ be the number of tickets Mr. Benson bought, which is $N = 12$.
The total cost of the tickets without discount is $N \times C = 12 \times 40 = 480$.
Mr. Benson received a 5% discount for every ticket bought that exceeds 10.
The number of tickets that exceed 10 is $12 - 10 = 2$.
The discount for each of these tickets is $5\%$ of $10 = 0.05 \times 10 = 0.5$.
The discount amount for the 2 tickets is $2 \times 0.5 = 1$.
The discounted price for each of the 2 tickets is $10 - 0.5 = 9.5$.
The total cost for the 2 tickets is $2 \times 9.5 = 19$.
The total cost of the tickets is the original cost minus the discount amount: $480 - 19 = 461$.
However, this is incorrect because we need to consider the discounts for the tickets that exceed 10.
Let $x$ be the number of tickets that exceed 10. We have $x = 12 - 10 = 2$.
The discount is 5% for each ticket that exceeds 10.
So the discounted price for each ticket is $40 \times (1 - 0.05) = 40 \times 0.95 = 38$.
The total cost for the 2 tickets is $2 \times 38 = 76$.
The total cost of the 12 tickets is $12 \times 40 = 480$.
The total discount is $2 \times 0.05 \times 12 = 0.1 \times 12 = 1.2$.
The total cost after discount is $480 - 1.2 = 478.8$.
However, we are given that Mr. Benson received a 5% discount for every ticket bought that exceeds 10.
So, we need to calculate the number of tickets that exceed 10.
Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10.
The number of tickets that exceed 10 is $12 - 10 = 2$.
The discount on each of these 2 tickets is $5\%$ of $10$, which is $0.05 \times 10 = 0.5$.
The discount amount for each of the 2 tickets is $0.5 \times 40 = 20$.
The discounted price for each of the 2 tickets is $40 - 20 = 20$.
The total cost for the 2 tickets is $2 \times 20 = 40$.
The total cost for the 12 tickets is $12 \times 40 = 480$.
The total discount is $2 \times 0.5 = 1$.
The total cost after discount is $480 - 1 = 479$.
But this is not correct.

Let $N = 12$. The cost per ticket is $40$.
The discount is 5% for 
Extracted:
None
Unsloth: Will smartly offload gradients to save VRAM!
******************** Question:
Jane is trying to decide whether to buy a house or a trailer. A house costs $480,000 and a trailer costs $120,000. Each loan will be paid in monthly installments over 20 years. How much more is the monthly payment on the house compared to the trailer? 
Answer:
1500 
Response:
<start_working_out>
We need to calculate the monthly payment on the house and the trailer.

House cost: $480,000
Loan amount: $480,000
Interest rate: We need to assume an interest rate for this problem. Let's assume an annual interest rate of 6% (this is a common rate).
Loan term: 20 years, so 20 * 12 = 240 months
We will use the loan payment formula: M = P [ i(1 + i)^n ] / [ (1 + i)^n – 1]
where M is the monthly payment, P is the loan amount, i is the monthly interest rate, and n is the number of months.

Monthly interest rate (i) = Annual interest rate / 12 = 0.06 / 12 = 0.005
Number of months (n) = 240

M = 480000 [ 0.005(1 + 0.005)^240 ] / [ (1 + 0.005)^240 – 1]
M = 480000 [ 0.005(1.005)^240 ] / [ (1.005)^240 – 1]
M = 480000 [ 0.005 * 3.310853] / [ 3.310853 – 1]
M = 480000 [ 0.016554265] / [2.310853]
M = 480000 * 0.00703658
M = $331.54

Trailer cost: $120,000
Loan amount: $120,000
Interest rate: We still assume an annual interest rate of 6% (this is a common rate).
Loan term: 20 years, so 20 * 12 = 240 months
We will use the loan payment formula: M = P [ i(1 + i)^n ] / [ (1 + i)^n – 1]
where M is the monthly payment, P is the loan amount, i is the monthly interest rate, and n is the number of months.

Monthly interest rate (i) = Annual interest rate / 12 = 0.06 / 12 = 0.005
Number of months (n) = 240

M = 120000 [ 0.005(1 + 0.005)^240 ] / [ (1 + 0.005)^240 – 1]
M = 120000 [ 0.005(1.005)^240 ] / [ (1.005)^240 – 1]
M = 120000 [ 0.005 * 3.310853 ] / [ 3.310853 – 1]
M = 120000 [ 0.016554265] / [2.310853]
M = 1200 
Extracted:
None
******************** Question:
Janet pays $40/hour for 3 hours per week of clarinet lessons and $28/hour for 5 hours a week of piano lessons. How much more does she spend on piano lessons than clarinet lessons in a year? 
Answer:
1040 
Response:
<start_working_out>
Let $C$ be the amount Janet spends on clarinet lessons per week, and $P$ be the amount Janet spends on piano lessons per week.
Janet pays $40/hour for 3 hours per week of clarinet lessons, so $C = 40 \times 3 = 120$.
Janet pays $28/hour for 5 hours a week of piano lessons, so $P = 28 \times 5 = 140$.
Janet's weekly spending on clarinet lessons is $120/week$, and her weekly spending on piano lessons is $140/week$.
There are 52 weeks in a year.
The amount she spends on clarinet lessons in a year is $120 \times 52 = 6240$.
The amount she spends on piano lessons in a year is $140 \times 52 = 7280$.
The difference between the amount she spends on piano lessons and clarinet lessons in a year is $7280 - 6240 = 1040$.

<SOLUTION>
The amount she spends on clarinet lessons in a year is $120 \times 52 = 6240$.
The amount she spends on piano lessons in a year is $140 \times 52 = 7280$.
The difference between the amount she spends on piano lessons and clarinet lessons in a year is $7280 - 6240 = 1040$.

Final Answer: The final answer is $\boxed{1040}$ 
Extracted:
120
******************** Question:
Sabrina is collecting herbs to make a poultice for her grandmother. She needs twice as many basil leaves as sage leaves and 5 fewer sage leaves than verbena leaves. If she needs 12 basil leaves, how many leaves total does she need? 
Answer:
29 
Response:
<start_working_out>
Let $b$ be the number of basil leaves, $s$ be the number of sage leaves, and $v$ be the number of verbena leaves.
We are given that Sabrina needs twice as many basil leaves as sage leaves, so $b = 2s$.
We are also given that she needs 5 fewer sage leaves than verbena leaves, so $s = v - 5$.
We are told that she needs 12 basil leaves, so $b = 12$.
We can use the equation $b = 2s$ to solve for $s$:
$12 = 2s$
$s = \frac{12}{2}$
$s = 6$
Now we can use the equation $s = v - 5$ to solve for $v$:
$6 = v - 5$
$v = 6 + 5$
$v = 11$
The total number of leaves needed is $b + s + v = 12 + 6 + 11$.
$12 + 6 + 11 = 18 + 11 = 29$
Therefore, the total number of leaves Sabrina needs is 29.

<SOLUTION>29 
Extracted:
29
******************** Question:
Over the past five years, on July 4th, the high temperature for Washington, DC has been: 90 degrees in 2020, 90 degrees in 2019, 90 degrees in 2018, 79 degrees in 2017 and 71 degrees in 2016. What is the average temperature for July 4th in Washington, DC over the past 5 years? 
Answer:
84 
Response:
<start_working_out>
To find the average temperature, we need to sum the temperatures and divide by the number of years (5).

The temperatures are: 90, 90, 90, 79, 71.

Sum of temperatures = 90 + 90 + 90 + 79 + 71 = 410

Number of years = 5

Average temperature = 410 / 5 = 82

<SOLUTION>82
 
Extracted:
82
******************** Question:
Rene can finish reading 30 pages in 60 minutes. Lulu can read 27 pages in 60 minutes and Cherry can read 25 pages in 60 minutes. If they have been reading for 240 minutes now, how many pages have they finished reading in total? 
Answer:
328 
Response:
<start_working_out>
Let R be the number of pages Rene finishes in 60 minutes.
Let L be the number of pages Lulu finishes in 60 minutes.
Let C be the number of pages Cherry finishes in 60 minutes.
We are given that R = 30, L = 27, and C = 25.
We are also given that they have been reading for 240 minutes.
Rene finishes $\frac{30}{60} = \frac{1}{2}$ of a page per minute.
Lulu finishes $\frac{27}{60} = \frac{9}{20}$ of a page per minute.
Cherry finishes $\frac{25}{60} = \frac{5}{12}$ of a page per minute.
The total number of pages they finish in 240 minutes is:
$T = \frac{1}{2}R + \frac{9}{20}L + \frac{5}{12}C$
$T = \frac{1}{2}(30) + \frac{9}{20}(27) + \frac{5}{12}(25)$
$T = 15 + \frac{243}{20} + \frac{125}{12}$
To add these terms, we need to find a common denominator for 20 and 12. The least common multiple of 20 and 12 is 120.
$T = 15 + \frac{243 \cdot 6}{20 \cdot 6} + \frac{125 \cdot 10}{12 \cdot 10}$
$T = 15 + \frac{1458}{120} + \frac{1250}{120}$
$T = 15 + \frac{1458 + 1250}{120}$
$T = 15 + \frac{2708}{120}$
$T = 15 + \frac{677}{30}$
$T = 15 + 22.5666...$
However, we are given that they have been reading for 240 minutes. We need to find the total number of pages finished.
Let $x$ be the total number of pages they have finished.
Rene finishes $R = 30$ pages in 60 minutes.
Lulu finishes $L = 27$ pages in 60 minutes.
Cherry finishes $C = 25$ pages in 60 minutes.
The total pages finished in 60 minutes is $30 + 27 + 25 = 82$ pages.
They have been reading for 240 minutes.
The rate at which they finish pages is:
Rene: $\frac{30}{60} = \frac{1}{2}$ pages per minute
Lulu: $\frac{27}{60} = \frac{9}{20}$ pages per minute
Cherry: $\frac{25}{60} = \frac{5}{12}$ pages per minute
Total pages finished in 240 minutes:
$T = \frac{1}{2}R + \frac{9}{20}L + \frac{5}{12}C = \frac{1}{2}( 
Extracted:
None
******************** Question:
Martin rings the small bell 4 times more than 1/3 as often as the big bell. If he rings both of them a combined total of 52 times, how many times does he ring the big bell? 
Answer:
36 
Response:
<start_working_out>
Let $m$ be the number of times Martin rings the small bell, and $b$ be the number of times he rings the big bell.
We are given that Martin rings the small bell 4 times more than 1/3 as often as the big bell. This can be written as
$$m = \frac{1}{3}b + 4$$
We are also given that he rings both bells a combined total of 52 times. So,
$$m + b = 52$$
Now we have a system of two linear equations with two variables:
$$m = \frac{1}{3}b + 4$$
$$m + b = 52$$
Substitute the first equation into the second equation:
$$\left(\frac{1}{3}b + 4\right) + b = 52$$
$$\frac{1}{3}b + b = 52 - 4$$
$$\frac{4}{3}b = 48$$
$$b = \frac{3}{4} \cdot 48$$
$$b = 3 \cdot 12$$
$$b = 36$$
Now, substitute $b = 36$ into the equation $m = \frac{1}{3}b + 4$:
$$m = \frac{1}{3}(36) + 4$$
$$m = 12 + 4$$
$$m = 16$$
So, Martin rings the small bell 16 times and the big bell 36 times. We want to find the number of times he rings the big bell, which is $b$.
The number of times the big bell is $b = 36$.

<SOLUTION>
The number of times Martin rings the big bell is 36.
Final Answer: The final answer is $\boxed{36}$ 
Extracted:
36.
******************** Question:
Bert fills out the daily crossword puzzle in the newspaper every day. He uses up a pencil to fill out the puzzles every two weeks. On average, it takes him 1050 words to use up a pencil. How many words are in each crossword puzzle on average? 
Answer:
75 
Response:
<start_working_out>
Let $w$ be the number of words in each crossword puzzle on average.
Bert uses up a pencil to fill out the puzzles every two weeks.
We are given that it takes him 1050 words to use up a pencil.
Since he uses up a pencil every two weeks, the number of pencils he uses is $\frac{2}{2} = 1$ pencil every two weeks.
The total number of words used is $1 \times w = w$ words.
We are given that he uses up a pencil to fill out the puzzles every two weeks, and it takes him 1050 words to use up a pencil.
Therefore, $w = 1050$.
So, the number of words in each crossword puzzle on average is 1050.

<SOLUTION>1050 
Extracted:
1050
******************** Question:
Matt can make a batch of a dozen cookies using 2 pounds of flour.  He uses 4 bags of flour each weighing 5 pounds.  If Jim eats 15 cookies how many cookies are left? 
Answer:
105 
Response:
<start_working_out>
Let $C$ be the number of cookies Matt can make.
Matt can make a dozen cookies, which means he can make 12 cookies.
He uses 2 pounds of flour to make a dozen cookies.
He uses 4 bags of flour each weighing 5 pounds.
So, the total weight of flour he uses is $4 \times 5 = 20$ pounds.
Since he makes a dozen cookies, the amount of flour he uses is 2 pounds.
The total weight of flour he uses is 2 pounds.
The number of cookies he can make is $\frac{2}{5}$ dozen.
A dozen is 12 cookies, so $\frac{2}{5}$ dozen is $\frac{2}{5} \times 12 = \frac{24}{5} = 4.8$ cookies.
Since he can only make a whole number of cookies, we can assume he makes 4 cookies.
He uses 2 pounds of flour.
If he makes 4 cookies, he uses $\frac{4}{12} \times 2 = \frac{1}{3} \times 2 = \frac{2}{3}$ pounds of flour.
Jim eats 15 cookies.
The number of cookies remaining is $15 - 4 = 11$ cookies.
However, we don't need to calculate the value of the fraction of flour used.
Let $x$ be the number of cookies Matt can make with 2 pounds of flour.
$x = \frac{2}{5} \times 12 = \frac{24}{5} = 4.8$ cookies.
Since he can only make a whole number of cookies, we are given that Matt can make a dozen cookies, which is 12 cookies.
He uses 2 pounds of flour for a dozen cookies.
He uses 4 bags of flour each weighing 5 pounds. So he uses $4 \times 5 = 20$ pounds of flour.
Since he makes a dozen cookies, he uses $\frac{2}{12} \times 20 = \frac{1}{6} \times 20 = \frac{20}{6} = \frac{10}{3} = 3.33...$ pounds of flour.
Jim eats 15 cookies.
The number of cookies left is $12 - 15 = -3$, which is not possible.
Let $n$ be the number of cookies Matt can make. He uses 2 pounds of flour.
He uses 4 bags of flour, each weighing 5 pounds. So he uses $4 \times 5 = 20$ pounds of flour.
We have $n = \frac{2}{5} \times 12 = \frac{24}{5} = 4.8$ cookies. Since he can only make whole cookies, he makes 4 cookies.
Jim eats 15 cookies.
The number of cookies remaining is $4 - 15 = -11$, which is not possible.
Let $x$ be the number of cookies Matt makes. He uses 2 pounds of flour.
He uses 4 bags of flour, each weighing 5 pounds. So he uses $4 \times 5 = 20$ pounds of flour.
We are given that he makes a dozen cookies, so $x = 12$.
The amount of flour used is 2 pounds.
The number of cookies Matt can make is $\frac{ 
Extracted:
None
******************** Question:
James decides to build a tin house by collecting 500 tins in a week. On the first day, he collects 50 tins. On the second day, he manages to collect 3 times that number. On the third day, he collects 50 tins fewer than the number he collected on the second day. If he collects an equal number of tins on the remaining days of the week, what's the number of tins he collected each day for the rest of the week? 
Answer:
50 
Response:
<start_working_out>
Let $x$ be the number of tins James collected on the third day.
On the first day, he collected 50 tins.
On the second day, he collected 3 times that number, so he collected $3 \times 50 = 150$ tins.
On the third day, he collected 50 tins fewer than the number he collected on the second day, so he collected $150 - 50 = 100$ tins.
The total number of tins collected in the first three days is $50 + 150 + 100 = 300$ tins.
He has 500 tins to collect in total, so the number of tins collected on the remaining days is $500 - 300 = 200$ tins.
There are 7 days in a week. He collects an equal number of tins on the remaining 7 days.
So, the number of tins collected on each of the remaining 7 days is $\frac{200}{7} \approx 28.57$. Since the number of tins must be an integer, we can't have a fraction. However, the problem states that he collects an equal number of tins on the remaining days of the week.

Let $n$ be the number of tins collected on each of the remaining 7 days. Then $7n = 200$.
$n = \frac{200}{7} \approx 28.57$. Since the number of tins collected must be an integer, we need to re-evaluate.

Let $x$ be the number of tins collected on the first day.
Let $y$ be the number of tins collected on the second day.
Let $z$ be the number of tins collected on the third day.
Let $w$ be the number of tins collected on the fourth day.
Let $v$ be the number of tins collected on the fifth day.
Let $u$ be the number of tins collected on the sixth day.
Let $t$ be the number of tins collected on the seventh day.

We are given:
$x = 50$
$y = 3x = 3(50) = 150$
$z = y - 50 = 150 - 50 = 100$
$w = x - 50 = 50 - 50 = 0$
$v = w + z = 0 + 100 = 100$
$u = v + t = 100 + t$
$t = 500 - (x + y + z + w + v + u) = 500 - (50 + 150 + 100 + 0 + 100 + (100+t)) = 500 - (50 + 150 + 100 + 0 + 100 + 100 + t) = 500 - (400 + 300 + t) = 500 - (700 + t) = -200 - t$
Since $t$ must be non-negative, we have $-200 - t \geq 0$, so $t \leq -200$. This is not possible.

Let's re-interpret the problem.
James collects 50 tins on 
Extracted:
None
******************** Question:
A jar of jellybeans has 14 blue jellybeans, 26 purple jellybeans and 40 orange jellybeans. If there are 200 jellybeans in the jar, how many are there of the red color? 
Answer:
120 
Response:
<start_working_out>
Let $B$ be the number of blue jellybeans, $P$ be the number of purple jellybeans, and $O$ be the number of orange jellybeans.
We are given that $B = 14$, $P = 26$, and $O = 40$.
The total number of jellybeans is $B + P + O = 14 + 26 + 40 = 80$.
We are told that there are 200 jellybeans in the jar.
Let $R$ be the number of red jellybeans.
The total number of jellybeans is the sum of the number of blue, purple, orange, and red jellybeans.
So, $B + P + O + R = 200$.
Substituting the given values, we have $14 + 26 + 40 + R = 200$.
$80 + R = 200$.
$R = 200 - 80$.
$R = 120$.
Therefore, there are 120 red jellybeans in the jar.

<SOLUTION>120 
Extracted:
120
******************** Question:
Five adults and two children go to see a movie and buy $12 worth of concessions. The total cost of their trip is $76. If each child's ticket is $7, how much, in dollars, are the adult tickets? 
Answer:
10 
Response:
<start_working_out>
Let $a$ be the number of adult tickets and $c$ be the number of children's tickets.
We are given that there are five adults and two children, so $a+c = 5+2 = 7$.
We are also given that the total cost of the trip is $12 worth of concessions, and the total cost is $76. So, $7a + 7c = 76$.
We have a system of two linear equations with two variables:
\begin{align*} \label{eq:1} a+c &= 7 \\ 7a+7c &= 76\end{align*}
From the first equation, we can write $c = 7-a$. Substitute this into the second equation:
$$7a + 7(7-a) = 76$$
$$7a + 49 - 7a = 76$$
$$49 = 76$$
This equation is inconsistent, so we must have made a mistake.
Let's try another approach. Let $x$ be the price of an adult ticket and $y$ be the price of a child's ticket.
We are given that there are five adults and two children, so $a$ is the number of adult tickets and $c$ is the number of children's tickets.
We are given that there are 5 adults and 2 children, so $a+c = 5+2 = 7$.
We are given that the total cost is $12 worth of concessions.
The cost of the adult tickets is $ax$ and the cost of the children's tickets is $cy$.
The total cost is $ax + cy = 12$.
We are given that each child's ticket is $7, so $y = 7$.
Thus, $ax + 7c = 12$.
Since $a+c = 7$, we can write $c = 7-a$. Substitute this into the equation:
$$ax + 7(7-a) = 12$$
$$ax + 49 - 7a = 12$$
$$ax - 7a = 12 - 49$$
$$a(x-7) = -37$$
$$a = \frac{-37}{x-7}$$
Since $a$ and $c$ must be integers, $x-7$ must be a factor of 37. Since 37 is a prime number, $x-7$ can be $1$ or $37$.
If $x-7 = 1$, then $x = 8$. Then $a = \frac{-37}{1} = -37$, which is not possible.
If $x-7 = 37$, then $x = 44$. Then $a = \frac{-37}{37} = -1$, which is not possible.
Let's try a different approach.
The total cost of the trip is $12. The number of adults is $a$ and the number of children is $c$.
$$ax + 7c = 12$$
We are given that there are 5 adults and 2 children, so $a+c = 5+2 = 7$.
Substituting $c = 7-a$ into the equation, we get
$$ax + 7(7-a) = 12$$
$$ax + 49 
Extracted:
None
******************** Question:
Janelle had 26 green marbles. Then she bought 6 bags of blue marbles. There were 10 marbles in each bag.  She created a gift of 6 green marbles and 8 blue marbles and gave it to a friend. How many marbles does Janelle have now? 
Answer:
72 
Response:
<start_working_out>
Janelle starts with 26 green marbles.
She buys 6 bags of blue marbles, with 10 marbles in each bag. So she buys 6 * 10 = 60 marbles of blue.
Now she has 26 + 60 = 86 marbles.
She creates a gift of 6 green marbles and 8 blue marbles.
So she has 6 green marbles + 8 blue marbles = 14 marbles.
After creating the gift, she has 86 - 14 = 72 marbles.
<end_working_out>
Janelle has 72 marbles now.
 
Extracted:
None

Inference

Now let's try the model we just trained!

[22]
<start_working_out>
The square root of 101 is approximately 10.0498756.
<SOLUTION>
10.0498756<end_of_turn>

Saving, loading finetuned models

To save the final model as LoRA adapters, either use Hugging Face's push_to_hub for an online save or save_pretrained for a local save.

[NOTE] This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

[23]
('gemma-3/tokenizer_config.json',
, 'gemma-3/special_tokens_map.json',
, 'gemma-3/tokenizer.model',
, 'gemma-3/added_tokens.json',
, 'gemma-3/tokenizer.json')

Saving to float16 for VLLM

We also support saving to float16 directly for deployment! We save it in the folder gemma-3-finetune. Set if False to if True to let it run!

[24]

If you want to upload / push to your Hugging Face account, set if False to if True and add your Hugging Face token and upload location!

[25]

GGUF / llama.cpp Conversion

To save to GGUF / llama.cpp, we support it natively now for all models! For now, you can convert easily to Q8_0, F16 or BF16 precision. Q4_K_M for 4bit will come later!

[ ]

Likewise, if you want to instead push to GGUF to your Hugging Face account, set if False to if True and add your Hugging Face token and upload location!

[ ]

Now, use the gemma-3-finetune.gguf file or gemma-3-finetune-Q4_K_M.gguf file in llama.cpp.

And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other resources:

  1. Train your own reasoning model - Llama GRPO notebook Free Colab
  2. Saving finetunes to Ollama. Free notebook
  3. Llama 3.2 Vision finetuning - Radiography use case. Free Colab
  4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our documentation!

Join Discord if you need help + ⭐️ Star us on Github ⭐️

This notebook and all Unsloth notebooks are licensed LGPL-3.0.