Unsloth Gpt Oss (20B) GRPO

Gpt Oss (20B) GRPO

unsloth-notebooksunslothoriginal_template

alph-notebooks/unsloth-notebooks / gpt-oss-(20B)-GRPO.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!

Join Discord if you need help + ⭐ Star us on Github ⭐

To install Unsloth on your local device, follow our guide. This notebook is licensed LGPL-3.0.

You will learn how to do data prep, how to train, how to run the model, & how to save it

News

Unsloth's Docker image is here! Start training with no setup & environment issues. Read our Guide.

gpt-oss RL is now supported with the fastest inference & lowest VRAM. Try our new notebook which creates kernels!

Introducing Vision and Standby for RL! Train Qwen, Gemma etc. VLMs with GSPO - even faster with less VRAM.

Unsloth now supports Text-to-Speech (TTS) models. Read our guide here.

Visit our docs for all our model uploads and notebooks.

Installation

[1]

Unsloth

Goal: Make faster kernels with Reinforcement Learning

Our goal is to make a faster matrix multiplication kernel by doing RL on GTP-OSS 20B with Unsloth.

You will learn how to:

Counteract reward hacking like cheating, caching, laziness.
Timing and correctness of kernels and time limits.
Making good reward functions
How to seriously do RL to make optimized CUDA kernels

[2]

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.3: Fast Gpt_Oss patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gpt_oss won't work! Using float32.

model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.37G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.16G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/165 [00:00<?, ?B/s]

Unsloth: Offloading embeddings to RAM to save 1.08 GB.

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

We now add some small amount of LoRA weights to GPT-OSS so we only need to train those, instead of training on the full model.

[ ]

Unsloth: Making `model.base_model.model.model` require gradients

Optimized matrix multiplication

Numpy has optimized matrix multiplication kernels for CPUs via BLAS optimized operations. For GPUs, one can use CUDA accelerated cuBLAS kernels which PyTorch calls under the hood.

To generate some random matrices to do matrix multiplication, we can do the below:

[ ]

We shall generate a small matrix, and see the matrix multiplied output

[ ]

[[-2.8313286   4.54613909 -7.95265309  6.53459836  2.87235103]
 [ 7.0739631   3.76278879  9.31565599 -8.52884711  9.96832952]
 [ 8.41214082  6.51136046 -3.79347975 -2.46773693 -2.32292989]
 [ 3.91302932  4.98335304 -5.33855089  5.71057634 -2.79871647]]
[[ 0.39218774 -9.6181377  -3.49736707]
 [-0.33354865 -1.05626139  3.87231208]
 [ 0.49494174  5.91863954 -6.83183693]
 [ 5.1465162  -7.51648113  1.00445384]
 [ 9.63213377 -4.92327556  3.323014  ]]
[[  54.73441488  -87.89725072   97.94605887]
 [  58.25238906   -1.8467447   -49.25453031]
 [ -35.82528794  -80.25394462   11.51225408]
 [  -0.33785799 -103.64132345   38.51974367]]

We can call a LLM to generate a simple matrix multiply kernel in Python only, and we can calculate the differences between the actual result and the kernel's result

[ ]

We see the error below is very small, so that's good!

[ ]

(7.105427357601002e-15, 4.6783406255758477e-29)

Countering Reward Hacking

The ultimate goal of RL is to maximize some reward (say speed, revenue, some metric).

But RL can cheat When the RL algorithm learns a trick or exploits something to increase the reward, without actually doing the task at end, this is called "Reward Hacking".

Some good examples are in https://en.wikipedia.org/wiki/Reward_hacking

For matrix multiplication kernels, we might see the following issues:

Laziness: RL learns to use Numpy, Torch, other libraries, which calls optimized CUDA kernels.
Caching: RL learns to cache the result of the output
Cheating: RL learns to find the actual output by inspecting Python global variables
RL learns to edit the timing function to make it output 0 time as passed.

And possibly more. We shall try to address each!

Countering Reward Hacking 1: Stop laziness

We can stop the RL algorithm from calling optimized code by inspecting if the generated code imports other non standard Python libraries. We used GPT-5 to help generate this check check_only_stdlib_imports:

[ ]

For example, let's call check_only_stdlib_imports on a random piece of matrix multiplication code generated by GPT-5:

[ ]

Only stdlib imports? False
{'stdlib': [], 'non_stdlib': ['numpy', 'torch'], 'relative_imports': 0}

Countering Reward Hacking 2: Stop cheating

We can stop the RL algorithm from using global or cached variables by restricting it's locals and globals.

We are also going to use exec to create the function, so we have to save the output to an empty dict.

We also disallow global variable access.

[ ]

<function matmul(A, B)>

We also disallow global variable access via types.FunctionType(f.__code__, {})

[ ]

Success
name 'np' is not defined

[ ]

Countering Reward Hacking 3: Stop caching

We can stop the RL algorithm from using cached data by wiping the cache with a large fake matrix. We also have to benchmark carefully with multiple loops and turns.

We also add a timer to not make the algorithm go in an endless loop.

[ ]

For example we use our matmul kernel we had, and benchmark it with a 10 second delay:

[ ]

{'median_ns': 64112906,
, 'mean_ns': 64112906,
, 'stdev_ns': 0,
, 'exceptions': [],
, 'timeouts': 0}

Data & RL task setup

We now have to create a prompt to the model for which it will do some task. For our matrix multiply example, we use the below:

[ ]

Create a new fast matrix multiplication function using only native Python code.
You are given a list of list of numbers.
Output your new function in backticks using the format below:
```python
def matmul(A, B):
    return ...
```

First, let's prompt GPT-OSS without RL and see how it goes:

[ ]

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-21

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Create a new fast matrix multiplication function using only native Python code.
You are given a list of list of numbers.
Output your new function in backticks using the format below:
```python
def matmul(A, B):
    return ...
```<|end|><|start|>assistant<|channel|>analysis<|message|>We need to provide a quick naive matrix multiplication? "New fast" but only native Python. Provide code. Use list comprehension or loops. Optimize maybe by transposing B. Provide code.<|end|><|start|>assistant<|channel|>final<|message|>Here is a simple, fast implementation of matrix multiplication that uses only native Python constructs and avoids any external dependencies. It works with any two 2‑D lists of numbers (ints, floats, etc.), assuming their dimensions are compatible:

```python
def matmul(A, B):
    """
    Multiplies two matrices A and B.

    Parameters
    ----------
    A : list of lists
        First matrix. Must have shape (m, n).
    B : list of lists
        Second matrix. Must have shape (n, p).

    Returns
    -------
    list of lists
        The resulting matrix of shape (m, p).
    """
    # Check basic compatibility
    if not A or not B or len(A[0]) != len(B):
        raise ValueError("Incompatible dimensions for matrix multiplication.")

    # Transpose B once to speed up access
    B_t = [list(col) for col in zip(*B)]  # B transpose, O(n*p)

    m, n = len(A), len(A[0])
    p = len(B_t)
    result = [[0.0] * p for _ in range(m)]

    # Standard triple-loop but with B transposed for cache friendliness
    for i in range(m):
        a_row = A[i]
        res_row = result[i]
        for j in range(p):
            res_row[j] = sum(a_row[k] * B_t[j][k] for k in range(n))

    return result
```

**How it works**

1. **Input validation**: It checks that the number of columns in `A` matches the number of rows in `B`.  
2. **Transposition of `B`**: By transposing `B` (`B_t`), we turn repeated index lookups into simple list accesses, which is much faster in pure Python than accessing nested lists repeatedly.  
3. **Main loop**: For each row `i` of `A` and each row `j` of `B_t` (i.e., each column of `B`), the inner generator expression computes the dot product.  
4. **Result**: The function returns a list of lists representing the product matrix.

This implementation is concise,

Reward functions

We now design the extract_function function which simply extracts the function wrapped in 3 backticks.

And 4 reward functions:

function_works which rewards the model if the strategy is a valid Python function.
no_cheating which checks if the function imported other modules, and if it did, we penalize it.
correctness_check which checks if the kernel was correct or wrong - it shouldn't generate gibberish!
speed_check checks the performance relative to Numpy matmul directly.

[ ]

def matmul(A, B):
    return ...

Below is our function_works reward function which uses Python's exec but guarded by not allowing leakage of local and global variables. We can also use check_only_stdlib_imports first to check if there are errors before even executing the function:

[ ]

(False,
, {'error': "SyntaxError: expected '(' (<unknown>, line 1)",
,  'stdlib': [],
,  'non_stdlib': [],
,  'relative_imports': 0})

[ ]

no_cheating checks if the function cheated since it might have imported Numpy or Torch optimized code.

[ ]

Next correctness_check checks if the kernel was correct. We want to penalize if the absolute error is larger than 1, and if the mean squared error is somewhat bigger then machine epsilon.

We have to execute the code now!

[ ]

np.float64(2.220446049250313e-16)

[ ]

Finally our benchmarking function for speed_check! We shall limit the timer to 10 seconds and do 3 trials.

[ ]

{'median_ns': 195725,
, 'mean_ns': 211578,
, 'stdev_ns': 30687,
, 'exceptions': [],
, 'timeouts': 0}

[ ]

{'median_ns': 70811,
, 'mean_ns': 69910,
, 'stdev_ns': 2926,
, 'exceptions': [],
, 'timeouts': 0}

We can take the difference and do a negative sign for slower ones. If the ratio is less than 1 (ie faster, we shall invert it!)

[ ]

0.02764047958650492

[ ]

3.333333333333333

[ ]

We create the dataset which includes a replica of our prompt. Remember to add reasoning effort of low!

[ ]

{'prompt': [{'content': 'Create a new fast matrix multiplication function using only native Python code.\nYou are given a list of list of numbers.\nOutput your new function in backticks using the format below:\n```python\ndef matmul(A, B):\n    return ...\n```',
,   'role': 'user'}],
, 'answer': 0,
, 'reasoning_effort': 'low'}

Train the model

Now set up GRPO Trainer and all configurations! We also support GSDP, GAPO, Dr GRPO and more! Go to our docs https://unsloth.ai/docs/ for more info!

[ ]

Unsloth: We now expect `per_device_train_batch_size` * `gradient_accumulation_steps` * `world_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 2

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the reward column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

Step	reward	reward_std	completion_length	kl
1	0.125000	0.000000	200.000000	0.000000
2	0.072375	0.248112	200.000000	0.000000
3	-0.079000	0.163776	182.500000	0.000005

[ ]

Unsloth: Switching to float32 training since model cannot work with float16

And let's train the model!

NOTE A T4 free GPU might take 5 minutes for one generation sadly since it's an old GPU - A100 or H100 will be much faster!

[ ]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 2
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
 "-____-"     Trainable parameters = 1,990,656 of 20,916,747,840 (0.01% trained)
`generation_config` default values have been modified to match model-specific defaults: {'max_length': 131072}. If this is not desired, please set these values explicitly.

def matmul(A, B):
    n=len(A); m=len(B[0]); p=len(B)
    res=[[0]*m for _ in range(n)]
    for i in range(n):
        Ai=A[i]
        for k in range(p):
            aik=Ai[k]
            if aik:
                Bk=B[k]
                for j in range(m):
                    res[i][j] += aik*Bk[j]
    return res
def matmul(A, B):
    ...

def matmul(A, B):
    m = len(A)
    k = len(A[0])
    n = len(B[0])
    # adjust check
    if len(B) != k: raise ValueError
    # initialize result
    result = [[0]*n for _ in range(m)]
    for i in range(m):
        for j in range(n):
            sum_val = 0
            for p in range(k):
                sum_val += A[i][p]*B[p][j]
            result[i][j] = sum_val
def matmul(A, B):
    ...
Unsloth: Will smartly offload gradients to save VRAM!
def matmul(A, B):
    # A: m x n, B: n x p
    m, n = len(A), len(A[0]) # error if A empty
    # verify shape of B
    if len(B) != n: raise ValueError("Incompatible dimensions")
    p = len(B[0])
    # compute result matrix
    result = [[0]*p for _ in range(m)]
    for i in range(m):
        for k in range(n):
            aik = A[i][k]
            if aik:
                for j in range(p):
                    result[i][j] += aik*B[k][j]
    return result
def matmul(A, B):
    """
    Multiply two matrices A and B, where A and B are lists of lists.
    The function performs a standard matrix multiplication using plain
    Python loops and integer/float arithmetic without any external libraries.

    Parameters:
        A (list[list[Union[int, float]]]): Left‑hand operand.
        B (list[list[Union[int, float]]]): Right‑hand operand.

    Returns:
        list[list[Union[int, float]]]: Product matrix C = A * B.

    Raises:
        ValueError: If matrix dimensions are incompatible.
    """
    # Validate inputs
    if not A or not B:
        raise ValueError("Input matrices must not be empty.")
    if any(len(row) == 0 for row in A) or any(len(row) == 0 for row in B):
        raise ValueError("Matrix rows must be non‑empty.")
    if len(A[0]) != len(B):
        raise ValueError(
            f"Incompatible dimensions: A is {len(A)}x{len(A[0])} "
            f"but B is {len(B)}x{len(B[0])}."
        )

    m = len(A)           # Rows in A
    n = len(B[0])        # Columns in B
    p = len(B)           # Columns in A = rows in B

    # Allocate result matrix
    C = [[0.0 for _ in range(n)] for _ in range(m)]

    # Compute matrix product
    for i in range(m):
        for j in range(n):
            sum_val = 0.0
            for k in range(p):
                sum_val += A[i][k] * B[k][j]
            C[i][j] = sum_val

    return C
def matmul(A, B):
    return ...
def matmul(A, B):
    if not A or not B:
        return []
    n = len(A)
    m = len(B[0])
    p = len(B)
    result = [[0]*m for _ in range(n)]
    for i in range(n):
        for k in range(p):
            aik = A[i][k]
            if aik:
                for j in range(m):
                    result[i][j] += aik*B[k][j]
    return result
def matmul(A, B):
    # A: m x n, B: n x p
    m = len(A)
    n = len(A[0])  # also len(B)
    p = len(B[0])
    # initialize result matrix with zeros
    result = [[0]*p for _ in range(m)]
    for i in range(m):
        rowA = A[i]
        res_row = result[i]
        for k in range(n):
            aik = rowA[k]
            if aik:
                # then we add aik * B[k][j] to each column j
                rowBk = B[k]
                for j in range(p):
                    res_row[j] += aik * rowBk[j]
    return result
def matmul(A, B):
    m = len(A)
    k = len(A[0])
    n = len(B[0])
    C = [[0]*n for _ in range(m)]
    for i in range(m):
        a_row = A[i]
        Ci = C[i]
        for j in range(n):
            s = 0
            for t in range(k):
                s += a_row[t] * B[t][j]
            Ci[j] = s
    return C
def matmul(A, B):
    n = len(B[0])  # columns in B
    result = []
    for i in range(len(A)):
        row = []
        for j in range(n):
            val = 0
            for k in range(len(A[0])):
                val += A[i][k] * B[k][j]
            row.append(val)
        result.append(row)
    return result
def matmul(A, B):
    # Implementation (Strassen) ...
def matmul(A, B):
    assert len(A[0]) == len(B)
    m = len(A)
    n = len(A[0])
    p = len(B[0])
    result = [[0]*p for _ in range(m)]
    for i in range(m):
        Ai = A[i]
        for k in range(n):
            aik = Ai[k]
            if aik:
                Bk = B[k]
                for j in range(p):
                    result[i][j] += aik * Bk[j]
    return result
def matmul(A, B):
    ...
def matmul(A, B):
    # selects optimal tiling size
    ...
    # uses bit manipulation to accelerate computations
def matmul(A, B):
    n = len(A)
    m = len(B[0])
    p = len(B)
    assert p == len(A[0])
    # build transposed B for faster row dot
    Bt = [list(col) for col in zip(*B)]
def matmul(A, B):
    return ...
def matmul(A, B):
    ...
def matmul(A, B):
    """
    Multiply two matrices A and B, where A and B are lists of lists.
    A should be of shape (m, n) and B of shape (n, p); the result is an
    (m, p) matrix.

    Example
    -------
        A = [[1, 2], [3, 4]]
        B = [[5, 6], [7, 8]]
        >>> matmul(A, B)
        [[19, 22], [43, 50]]
    """
    # Ensure input is rectangular
    if not A or not B or not A[0] or not B[0]:
        return []

    n_rows_A = len(A)
    n_cols_A = len(A[0])   # number of columns in A
    n_rows_B = len(B)
    n_cols_B = len(B[0])   # number of columns in B

    if n_cols_A != n_rows_B:
        raise ValueError("Number of columns of A must equal number of rows of B")

    # Pre‑allocate the result matrix
    result = [[0] * n_cols_B for _ in range(n_rows_A)]

    # For better cache performance we iterate over columns of B only once
    for i in range(n_rows_A):
        ai = A[i]  # local reference
        for k in range(n_cols_A):
            aik = ai[k]
            if aik == 0:
                continue   # skip zero entries for a tiny speed bump
            bk_row = B[k]
            for j in range(n_cols_B):
                result[i][j] += aik * bk_row[j]
    return result
def matmul(A, B):
    # get dims
    ra = len(A); ca= len(A[0]) if A else 0
    rb = len(B); cb= len(B[0]) if B else 0
    # check dims
    if ca != rb: raise ValueError
    # transpose B
    B_T = [[B[i][j] for i in range(rb)] for j in range(cb)]
    # compute result
    return [[sum(a*b for a,b in zip(A[i], B_T[j])) for j in range(cb)] for i in range(ra)]
def matmul(A, B):
    m = len(A)
    n = len(B[0])
    p = len(B)
    # B transpose to improve locality
    B_T = list(zip(*B))
    out = [[0]*n for _ in range(m)]
    for i in range(m):
        Ai = A[i]
        for j in range(n):
            sum = 0
            Bj = B_T[j]
            for k in range(p):
                sum += Ai[k]*Bj[k]
            out[i][j] = sum
    return out
def matmul(A, B): return ...
def matmul(A, B):
    m = len(A)
    n = len(B[0])
    p = len(A[0])
    assert p == len(B)
    result = [[0]*n for _ in range(m)]
    for i in range(m):
        for k in range(p):
            aik = A[i][k]
            for j in range(n):
                result[i][j] += aik * B[k][j]
    return result
def matmul(A, B):
    # A: list of lists (m x n), B: list of lists (n x p)
    # returns list of lists (m x p)
    m, n = len(A), len(A[0])
    p = len(B[0])
    # verify dimensions
    # compute product
    result = [[0] * p for _ in range(m)]
    for i in range(m):
        for k in range(n):
            aik = A[i][k]
            for j in range(p):
                result[i][j] += aik * B[k][j]
    return result
def matmul(A, B):
    return ...
def matmul(A, B):
    # validate
    m = len(A)
    n = len(B[0])  # columns of result
    # B must be of dimension (len(A[0]) x len(B[0]))
    BT = list(zip(*B))  # Transposed B
    result = [[sum(x*y for x, y in zip(row, col)) for col in BT] for row in A]
    return result
def matmul(A, B):
    n = len(A)
    m = len(B[0])
    p = len(B)
    C = [[0]*m for _ in range(n)]
    for i in range(n):
        ai = A[i]
        ci = C[i]
        for k in range(p):
            aik = ai[k]
            if aik:
                bk = B[k]
                for j in range(m):
                    ci[j] += aik * bk[j]
    return C
def matmul(A, B):
    return ...
def matmul(A, B):
    """
    Multiply two matrices represented as lists of lists using only native Python code.

    Parameters
    ----------
    A : list of list of numbers
        The left matrix (m x n).
    B : list of list of numbers
        The right matrix (n x p).

    Returns
    -------
    list of list of numbers
        The product matrix (m x p).
    """
    # Sanity check: A must be compatible with B.
    if not A or not B:
        return []

    m, n = len(A), len(A[0])
    n2, p = len(B), len(B[0])
    if n != n2:
        raise ValueError("Inner dimensions of A and B must match")

    # Pre‑allocate the result matrix with zeros.
    C = [[0] * p for _ in range(m)]

    # Standard triple‑loop algorithm
    for i in range(m):
        ai = A[i]
        ci = C[i]
        for k in range(n):
            aik = ai[k]
            if aik == 0:
                continue
            bk = B[k]         # Row of B affected
            for j in range(p):
                ci[j] += aik * bk[j]

    return C
def matmul(A, B):
    # Check dimensions
    if not A: return []
    m, n = len(A), len(A[0])  # number of rows in A and columns in A
    if not B or len(B) != n: raise ValueError("Size mismatch")
    p = len(B[0])  # number of columns in B
    # Precompute columns of B
    B_cols = [[B[row][col] for row in range(n)] for col in range(p)]
    result = [[sum(a_elem * b_elem for a_elem, b_elem in zip(row, col)) for col in B_cols] for row in A]
    return result
def matmul(A, B):
    ...
def matmul(A, B):
    n = len(A)
    m = len(B[0]) # number of columns of B
    common = len(B)
    # initialize result matrix
    res = [[0]*m for _ in range(n)]
    for i in range(n):
        for k in range(common):
            aik = A[i][k]
            for j in range(m):
                res[i][j] += aik * B[k][j]
    return res
def matmul(A, B):
    # A is m x n, B is n x p,
    # output shape m x p
    m = len(A)
    n = len(A[0]) if A else 0
    # check B dims
    if n == 0:
        return []
    p = len(B[0])

    # preallocate result
    C = [[0]*p for _ in range(m)]
    for i in range(m):
        Ai = A[i]
        Ci = C[i]
        for k in range(n):
            aik = Ai[k]
            if aik:
                Bk = B[k]
                # if aik != 0 multiply
                for j in range(p):
                    Ci[j] += aik * Bk[j]
    return C
def matmul(A, B):
    """
    Multiply two matrices A and B.

    Parameters
    ----------
    A : list[list[int | float]]
        The left matrix (m × k) to be multiplied.
    B : list[list[int | float]]
        The right matrix (k × n) to be multiplied.

    Returns
    -------
    C : list[list[int | float]]
        Result of the product A @ B  (m × n matrix).
    """
    # Basic checks
    if not A or not B or not B[0]:
        return []

    m, k1 = len(A), len(A[0])
    k2, n = len(B), len(B[0])
    if k1 != k2:
        raise ValueError("Inner dimensions of matrices must agree")

    # Initialize the result matrix with zeros
    C = [[0.0] * n for _ in range(m)]

    # A naive but well‑structured implementation that is reasonably fast
    for i in range(m):
        ai = A[i]
        ci = C[i]
        for j in range(n):
            s = 0.0
            for p in range(k1):
                s += ai[p] * B[p][j]
            ci[j] = s

    return C
def matmul(A, B):
    """
    Multiply two matrices A and B using only plain Python code (no external libraries).
    `A` and `B` are expected to be lists of lists, where each inner list represents a row.
    This implementation uses a simple, straightforward algorithm with small optimisations
    such as caching dimensions and avoiding repeated attribute lookups inside loops.

    Note: This function expects that the number of columns in `A` matches the number of rows in `B`.
    """
    # Validate dimensions
    n_rows_A, n_cols_A = len(A), len(A[0])
    n_rows_B, n_cols_B = len(B), len(B[0])
    if n_cols_A != n_rows_B:
        raise ValueError("Incompatible dimensions for matrix multiplication")

    # Pre‑allocate result matrix
    result = [[0] * n_cols_B for _ in range(n_rows_A)]

    # Transpose B to improve cache locality
    B_transposed = [[B[row][col] for row in range(n_rows_B)] for col in range(n_cols_B)]

    # Perform multiplication (standard algorithm)
    for i in range(n_rows_A):
        row_A = A[i]
        for j in range(n_cols_B):
            sum_val = 0
            row_B = B_transposed[j]
            for k in range(n_cols_A):
                sum_val += row_A[k] * row_B[k]
            result[i][j] = sum_val

    return result
def matmul(A, B):
    # Makes sure the matrices can be multiplied
    if len(A[0]) != len(B):
        raise ValueError("Number of columns in A must equal number of rows in B")
    
    # Initialise result matrix with zeros
    rows_A, cols_B, cols_A = len(A), len(B[0]), len(A[0])
    C = [[0] * cols_B for _ in range(rows_A)]
    
    # Standard O(n³) matrix multiplication
    for i in range(rows_A):
        for k in range(cols_A):
            aik = A[i][k]
            # Skip if a[i][k] is zero to save some work
            if aik == 0:
                continue
            for j in range(cols_B):
                C[i][j] += aik * B[k][j]
    
    return C
def matmul(A, B):
    try:
        _lenA = len(A)
        _lenB = len(B)
        if _lenA == 0 or _lenB == 0:
            return []
        m = len(A[0])
        if any(len(row)!=m for row in A):
            raise ValueError
        n = len(B[0])
        if any(len(row)!=n for row in B):
            raise ValueError
        if m != len(B):
            raise ValueError("Incompatible dimensions")
    except Exception:
        raise
    B_T = [tuple(col) for col in zip(*B)]  # transpose
    res = [ [0]*n for _ in range(_lenA) ]
    for i in range(_lenA):
        rowA = A[i]
        result_row = res[i]
        for j in range(n):
            colB = B_T[j]
            s = 0
            for k in range(m):
                s += rowA[k] * colB[k]
            result_row[j] = s
    return res
def matmul(A, B):
    n = len(A)
    m = len(B[0])
    p = len(B)
    res = [[0]*m for _ in range(n)]
    # Transpose B for faster column access
    B_T = [list(col) for col in zip(*B)]
    for i in range(n):
        Ai = A[i]
        for j in range(m):
            res[i][j] = sum(a*b for a,b in zip(Ai, B_T[j]))
    return res
def matmul(A, B):
    # ensure dimensions compatible
    n = len(A)
    m = len(A[0])
    p = len(B[0])
    # a quick check for compatibility
    if len(B) != m:
        raise ValueError("Incompatible matrix dimensions.")
    # initialize result matrix
    result = [[0]*p for _ in range(n)]
    # naive multiplication
    for i in range(n):
        for j in range(p):
            sum_val = 0
            for k in range(m):
                sum_val += A[i][k] * B[k][j]
            result[i][j] = sum_val
    return result
def matmul(A, B):
    n = len(A)
    m = len(B[0])
    common = len(B)
    # compute product
    res = [[0]*m for _ in range(n)]
    for i in range(n):
        ai = A[i]
        row_res = res[i]
        for k in range(common):
            aik = ai[k]
            bk = B[k]
            for j in range(m):
                row_res[j] += aik * bk[j]
    return res
def matmul(A, B):
    # Check input
    nA = len(A)
    mA = len(A[0]) if A else 0
    nB = len(B)
    mB = len(B[0]) if B else 0
    # dims
    if mA != nB:
        raise ValueError("Incompatible dimensions.")
    # result dims
    result = [[0]*mB for _ in range(nA)]
    for i in range(nA):
        for j in range(mB):
            s = 0
            for k in range(mA):
                s += A[i][k] * B[k][j]
            result[i][j] = s
    return result
def matmul(A, B):
    """
    Multiply two matrices A and B (given as lists of lists).
    
    Parameters
    ----------
    A : list[list[float]]
        The first matrix (m x n).
    B : list[list[float]]
        The second matrix (n x p).
    
    Returns
    -------
    C : list[list[float]]
        The product matrix (m x p).
    
    Raises
    -------
    ValueError
        If the inner dimensions don't match.
    """
    # Basic sanity check on dimensions
    if not A or not B or not A[0] or not B[0]:
        raise ValueError("Matrices must have non‑empty dimensions.")
    n = len(A[0])           # number of columns in A
    if any(len(row) != n for row in A):
        raise ValueError("All rows in A must have the same length.")
    if len(B) != n:
        raise ValueError("Number of columns in A must equal number of rows in B.")
    
    m = len(A)              # number of rows in A
    p = len(B[0])           # number of columns in B
    
    # Pre‑compute columns of B for quicker access
    B_t = [tuple(col) for col in zip(*B)]   # transpose: each column is a tuple

    # Compute each entry of the product, using Python's max‑speed loops
    C = [[sum(a * b for a, b in zip(A[i], col)) for col in B_t] for i in range(m)]
    return C
def matmul(A, B):
    """
    Multiply two matrices A and B.
    A and B are lists of lists of numbers (i.e. 2D arrays).
    Returns the result as a new list of lists.
    """
    # Ensure matrix dimensions are compatible
    if len(A[0]) != len(B):
        raise ValueError("Incompatible matrix dimensions for multiplication.")
    
    # Initialize the result matrix with zeros
    result = [[0]*len(B[0]) for _ in range(len(A))]
    
    # Perform multiplication
    for i in range(len(A)):
        for j in range(len(B[0])):
            for k in range(len(A[0])):
                result[i][j] += A[i][k] * B[k][j]
    
    return result
def matmul(A, B):
    # Verify that matrices have compatible dimensions
    if not A or not B or len(A[0]) != len(B):
        raise ValueError("Number of columns in A must equal number of rows in B.")
    
    # Initialize the result matrix (size: rows of A × columns of B)
    n_rows_a = len(A)
    n_cols_b = len(B[0])
    result = [[0] * n_cols_b for _ in range(n_rows_a)]
    
    # Iterate through rows of A and columns of B, accumulating the dot products
    for i in range(n_rows_a):
        for j in range(n_cols_b):
            sum_val = 0
            for k in range(len(A[0])):
                sum_val += A[i][k] * B[k][j]
            result[i][j] = sum_val
    
    return result
def matmul(A, B):
    # assume dims: A: n x m, B: m x p
    n = len(A)
    m = len(A[0]) if A else 0
    p = len(B[0]) if B else 0
    # initialize result matrix
    C = [[0]*p for _ in range(n)]
    for i in range(n):
        rowA = A[i]
        rowC = C[i]
        for k in range(m):
            a = rowA[k]
            if a != 0:  # optional optimization
                colB = B[k]
                for j in range(p):
                    rowC[j] += a * colB[j]
    return C
def matmul(A, B):
    n = len(A)
    Bt = [[B[k][j] for k in range(n)] for j in range(n)]
    res = [[0] * n for _ in range(n)]
    for i in range(n):
        row = A[i]
        for j in range(n):
            res[i][j] = sum(row[k] * Bt[j][k] for k in range(n))
    return res
def matmul(A, B):
    return ...
def matmul(A, B): return ...
def matmul(A, B):
    # Number of rows in A, number of columns in A (and rows in B), number of columns in B
    n, m, p = len(A), len(A[0]), len(B[0])

    # Prepare the result matrix with zeros
    C = [[0] * p for _ in range(n)]

    # Perform the standard O(n*m*p) multiplication
    for i in range(n):
        for k in range(m):
            aik = A[i][k]          # element in A at row i, column k
            for j in range(p):
                C[i][j] += aik * B[k][j]

    return C
def matmul(A, B):
    """
    Multiply two matrices A and B where A is an m×n matrix and B is an n×p matrix.
    Returns the resulting m×p matrix.
    """
    # Number of rows in A
    rows_a = len(A)
    # Number of columns in A (required to multiply with B)
    cols_a = len(A[0]) if A else 0
    # Number of columns in B
    cols_b = len(B[0]) if B else 0
    
    # Quick check a few edge cases
    if rows_a == 0 or cols_a == 0 or cols_b == 0:
        return []

    # Ensure that dimension compatibility holds
    if len(B) != cols_a:
        raise ValueError("Number of columns in A must equal number of rows in B")

    # Prepare the result matrix             
    result = [[0.0 for _ in range(cols_b)] for _ in range(rows_a)]

    # Matrix multiplication
    for i in range(rows_a):
        for k in range(cols_a):
            aik = A[i][k]
            for j in range(cols_b):
                result[i][j] += aik * B[k][j]
    return result
def matmul(A, B): return ...
def matmul(A, B):
    n = len(A)
    m = len(B[0])
    p = len(B)
    # assert len(A[0]) == p
    C = [[0]*m for _ in range(n)]
    # iterate
    for i in range(n):
        Ci = C[i]
        Ai = A[i]
        for k in range(p):
            a = Ai[k]
            if a != 0:
                Bk = B[k]
                for j in range(m):
                    Ci[j] += a * Bk[j]
    return C
def matmul(A, B):
    # A: list of list, columns: n x m
    # B: list of list, dimensions: m x p
    # returns C: n x p
    n = len(A)
    m = len(A[0])
    p = len(B[0])
    # initialize result matrix
    C = [[0]*p for _ in range(n)]
    for i in range(n):
        ai = A[i]
        for k in range(m):
            aik = ai[k]
            if aik:
                bk = B[k]
                for j in range(p):
                    C[i][j] += aik * bk[j]
    return C
def matmul(A, B):
    """
    Multiply two matrices A and B using only native Python code.
    
    Parameters:
    - A: List of lists where each sublist represents a row of matrix A.
    - B: List of lists where each sublist represents a row of matrix B.
    
    Returns:
    - Resulting matrix as a list of lists.
    """
    if not A or not B:
        return []

    n_rows_A = len(A)
    n_cols_A = len(A[0]) if n_rows_A > 0 else 0
    n_rows_B = len(B)
    n_cols_B = len(B[0]) if n_rows_B > 0 else 0

    # Ensure the matrices can be multiplied
    if n_cols_A != n_rows_B:
        raise ValueError("Number of columns in A must equal number of rows in B")

    # Initialize result matrix with zeros
    result = [[0] * n_cols_B for _ in range(n_rows_A)]

    # Perform matrix multiplication
    for i in range(n_rows_A):
        for k in range(n_cols_A):
            aik = A[i][k]
            if aik == 0:
                continue
            for j in range(n_cols_B):
                result[i][j] += aik * B[k][j]

    return result
def matmul(A, B):
    """
    Multiply two matrices A and B using vanilla Python.

    Parameters
    ----------
    A : list of lists of numbers, shape (m, k)
        Left matrix.
    B : list of lists of numbers, shape (k, n)
        Right matrix.

    Returns
    -------
    C : list of lists of numbers, shape (m, n)
        The product A @ B.
    """
    m = len(A)              # rows of A
    k = len(A[0])           # columns of A / rows of B
    n = len(B[0])           # columns of B

    # Initialize the residual matrix with zeros.
    C = [[0]*n for _ in range(m)]

    # Perform the multiplication using the standard triple nested loop.
    for i in range(m):
        Ai = A[i]           # row of A
        Ci = C[i]           # row of C we will fill
        for l in range(k):     # over columns of A and rows of B
            a = Ai[l]
            if a != 0:          # skip zero terms to reduce work
                Bl = B[l]
                for j in range(n):
                    Ci[j] += a * Bl[j]
    return C
def matmul(A, B):
    """
    Multiply two matrices A and B (given as lists of lists) without using external libraries.
    """
    # Basic dimension checks
    if not A or not B:
        return []
    m, p = len(A), len(A[0])         # A is m×p
    assert len(B) == p                # B must be p×n
    n = len(B[0])                     # n columns in B

    # Transpose B to improve cache locality
    B_T = list(zip(*B))               # B^T is n×p, each element is a tuple

    result = [[0]*n for _ in range(m)]

    for i in range(m):
        row_A = A[i]
        row_res = result[i]
        for k in range(p):          # iterate over inner dimension
            aik = row_A[k]
            if aik:
                col_B = B_T[k]       # k-th column of B
                for j in range(n):
                    row_res[j] += aik * col_B[j]
    return result
def matmul(A, B):
    """Multiply two matrices A and B.

    The function assumes that A and B are compatible for matrix multiplication,
    i.e. if A is MxK then B must be KxN. The result is an MxN matrix.

    Parameters
    ----------
    A : list[list[float]]
        MxK matrix.
    B : list[list[float]]
        KxN matrix.

    Returns
    -------
    list[list[float]]
        Product matrix of shape MxN.
    """
    # Grab sizes locally for speed
    m = len(A)           # number of rows of A
    k = len(A[0]) if A else 0  # number of columns in A (inner dimension)
    n = len(B[0]) if B else 0  # number of columns in B

    # Prepare the result matrix.
    # Use a list of lists pre‑filled with zeros.
    result = [[0.0] * n for _ in range(m)]

    # Basic algorithm – triple nested loop.
    for i in range(m):
        Ai = A[i]
        Ri = result[i]
        # Pre‑localize B for a bit of speed.
        for j in range(n):
            s = 0.0
            for p in range(k):
                s += Ai[p] * B[p][j]
            Ri[j] = s

    return result
def matmul(A, B):
    """
    Multiply two matrices A and B where A is mxk and B is kxn
    using simple Python list-of-lists and a few optimizations.
    Arguments:
      A: list of lists, where A[i][j] is the entry of row i and column j
      B: list of lists, where B[i][j] is the entry of row i and column j
    Returns:
      C: the resulting matrix (mxn)
    """
    m, k = len(A), len(A[0])  # size of A: m rows, k columns
    k2, n = len(B), len(B[0])  # size of B: k' rows, n columns
    if k != k2:
        raise ValueError("A and B have incompatible dimensions")

    # To speed up innermost loops we transpose B.
    B_T = [[B[row][col] for row in range(k)] for col in range(n)]

    # Initialize output matrix
    C = [[0] * n for _ in range(m)]

    for i in range(m):
        a_row = A[i]
        for j in range(n):
            # compute dot product of A[i] and B_T[j] as one row
            acc = 0
            for t in range(k):
                acc += a_row[t] * B_T[j][t]
            C[i][j] = acc

    return C
def matmul(A, B):
    """Fast matrix multiplication using only native Python code.

    The function expects A (m x n) and B (n x p) to be lists of lists.
    It uses a single loop to compute the result efficiently.

    Time complexity: O(m*n*p) in the worst case, but many Python
    implementations can handle small matrices quickly.
    """
    m, n = len(A), len(A[0])
    nB, p = len(B), len(B[0])
    if n != nB:
        raise ValueError("A's column count must equal B's row count")

    # Initialize a zero matrix for the result
    C = [[0.0] * p for _ in range(m)]
    
    # Main multiplication loop
    for i in range(m):
        for k in range(n):
            aik = A[i][k]
            if aik == 0:
                continue
            for j in range(p):
                C[i][j] += aik * B[k][j]
    return C
def matmul(A, B):
    m, n = len(A), len(A[0])  # n must equal len(B)
    p = len(B[0])
    return [[sum(a*b for a,b in zip(row, col)) for col in zip(*B)] for row in A]
def matmul(A, B):
    # check dimensions
    n = len(A)
    m = len(B[0])
    # Transpose B
    B_T = list(zip(*B))
    result = [ [0]*m for _ in range(n) ]
    for i, row in enumerate(A):
        for j, col in enumerate(B_T):
            result[i][j] = sum(a*b for a,b in zip(row, col))
    return result
def matmul(A, B):
    n = len(A)
    m = len(A[0])
    p = len(B[0])
    result = [[0] * p for _ in range(n)]
    for i in range(n):
        ai = A[i]
        for k in range(m):
            aik = ai[k]
            if aik:
                bj = B[k]
                for j in range(p):
                    result[i][j] += aik * bj[j]
    return result
def matmul(A, B):
    n = len(A)
    m = len(B[0]) # columns of B
    p = len(B) # rows of B
    # ensure A's columns equal B's rows
    assert len(A[0]) == len(B), "Incompatible dimensions"
    result = [[0] * m for _ in range(n)]
    for i in range(n):
        for k in range(len(B)):
            aik = A[i][k]
            # Multiply aik with each value in row k of B
            for j in range(m):
                result[i][j] += aik * B[k][j]
    return result
def matmul(A, B):
    return ...
def matmul(A, B):
    # Validate shapes
    n = len(A)
    m = len(A[0]) # width of A
    p = len(B[0])
    # maybe check that all rows are same length
    # Also, ensure len(B) == m
    if len(B) != m:
        raise ValueError('Incompatible matrix shapes for multiplication.')
    # Use maybe list comprehension for each element
    result = [[sum(A[i][k]*B[k][j] for k in range(m)) for j in range(p)] for i in range(n)]
    return result
def matmul(A, B): return ...
def matmul(A, B):
    """
    Multiply two matrices A and B provided as lists of lists.
    A and B should be compatible for matrix multiplication
    (number of columns of A equals number of rows of B).

    Args:
        A: List of rows, where each row is an iterable of numbers and
           all rows have the same length.
        B: Same format.

    Returns:
        A new list of lists containing the product A * B.
    """
    # Validate dimensions
    if not A or not B:
        raise ValueError("Input matrices cannot be empty")
    n_rows_a = len(A)
    n_cols_a = len(A[0])
    n_rows_b = len(B)
    n_cols_b = len(B[0])

    if n_cols_a != n_rows_b:
        raise ValueError("Incompatible dimensions for matrix multiplication")

    # Transpose B once to improve cache locality
    B_T = [[B[row][col] for row in range(n_rows_b)] for col in range(n_cols_b)]

    # Allocate result matrix
    result = [[0] * n_cols_b for _ in range(n_rows_a)]

    # Perform multiplication
    for i in range(n_rows_a):
        row_a = A[i]
        row_res = result[i]
        for j in range(n_cols_b):
            col_b = B_T[j]
            s = 0
            # dot product of row_a and col_b
            for k in range(n_cols_a):
                s += row_a[k] * col_b[k]
            row_res[j] = s

    return result
def matmul(A, B):
    n, m = len(A), len(B[0])
    # assume A rows by k, B columns by k
    # B's columns: B_col = [list of column values]
    B_T = list(zip(*B))
    C = []
    for row in A:
        newrow = []
        for col in B_T:
            newrow.append(sum(a*b for a,b in zip(row, col)))
        C.append(newrow)
    return C
def matmul(A, B):
    # Check dimensions
    m, n = len(A), len(A[0]); p, q = len(B), len(B[0])
    if n != p:
        raise ValueError("Incompatible dimensions.")
    result = [[0]*q for _ in range(m)]
    for i in range(m):
        for k in range(n):
            aik = A[i][k]
            for j in range(q):
                result[i][j] += aik * B[k][j]
    return result
def matmul(A, B):
    """Multiply two matrices A (n x m) and B (m x p) using only native Python."""
    n = len(A)
    m = len(A[0]) if A else 0
    p = len(B[0]) if B else 0

    # Verify dimensions
    if not A or not B or len(B) != m:
        raise ValueError("Incompatible matrix dimensions.")

    # Pre-allocate result matrix
    result = [[0] * p for _ in range(n)]

    for i in range(n):
        Ai = A[i]
        res_row = result[i]
        for k in range(m):
            aik = Ai[k]
            if aik:
                Bk = B[k]
                for j in range(p):
                    res_row[j] += aik * Bk[j]
    return result
def matmul(A, B):
    n = len(A)
    m = len(B[0])
    p = len(B)
    C = [[0.0]*m for _ in range(n)]
    for i in range(n):
        a_row = A[i]
        for k in range(p):
            aik = a_row[k]
            if aik!=0:
                B_col = B[k]
                for j in range(m):
                    C[i][j] += aik * B_col[j]
    return C
def matmul(A, B):
    n = len(A)
    m = len(B[0])
    # B must have dimension matching
    # Let's compute transpose of B for cache-friendly.
    BT = list(map(list, zip(*B)))  # transpose B
    result = [[sum(a*b for a,b in zip(rowA, colB)) for colB in BT] 
               for rowA in A]
    return result
def matmul(A, B):
    if not A or not B: return []
    m, n = len(A), len(A[0])
    p = len(B[0])
    # check B's rows equal to A's columns
    if len(B) != n: raise ValueError("...")

    result = [[0.0]*p for _ in range(m)]
    for i in range(m):
        ai = A[i]
        for k, a in enumerate(ai):
            if a:
                bk = B[k]
                for j, b in enumerate(bk):
                    result[i][j] += a*b
    return result
def matmul(A, B):
    n = len(A)
    m = len(A[0])
    p = len(B[0])
def matmul(A, B): return ...
def matmul(A, B):    return ...
def matmul(A, B):
    n = len(A)
    C = [[0]*n for _ in range(n)]
    # naive loop
    for i in range(n):
        for k in range(n):
            for j in range(n):
                C[i][j] += A[i][k] * B[k][j]
    return C
def matmul(A, B):
    m=len(A)
    n=len(A[0])
    p=len(B[0])
    C=[[0]*p for _ in range(m)]
    for i in range(m):
        for k in range(n):
            for j in range(p):
                C[i][j]+=A[i][k]*B[k][j]
    return C
def matmul(A, B):
    # A: m x k, B: k x n -> result m x n
    m = len(A)
    k = len(A[0]) if A else 0
    n = len(B[0]) if B else 0
    # Precompute B's column representation
    Bt = list(zip(*B))  # columns of B as tuples
    return [[sum(a*b for a,b in zip(row, col)) for col in Bt] for row in A]
def matmul(A, B):
    import numpy as np
    A_arr = np.array(A)
    B_arr = np.array(B)
    return (A_arr @ B_arr).tolist()
def matmul(A, B):
    n=len(A); m=len(B[0]); p=len(B) # that is dimension of B's rows
    # use zeros list for result
    result=[[0]*m for _ in range(n)]
    for i in range(n):
        a_row=A[i]
        for j in range(m):
           s=0
           for k in range(p):
                s+=a_row[k]*B[k][j]
           result[i][j]=s
    return result
def matmul(A, B):
    return ...
def matmul(A, B):
    """
    Multiply two matrices A and B.
    A is (m × n) and B is (n × p), both represented as lists of lists.
    Returns the product matrix of dimension (m × p).
    This routine is written purely in Python and uses a little bit of
    pre‑processing to keep memory accesses cache‑friendly.
    """
    # dimensions
    m, n = len(A), len(A[0])
    nB, p = len(B), len(B[0])
    assert n == nB, "Inner dimensions must agree"

    # transpose B to keep the inner loop cache‑friendly
    B_T = [[B[k][j] for k in range(n)] for j in range(p)]

    # prepare result matrix
    C = [[0] * p for _ in range(m)]

    for i in range(m):
        Ai = A[i]
        Ci = C[i]
        for k in range(n):
            aik = Ai[k]
            if aik == 0:
                continue
            Bk = B_T[k]
            for j in range(p):
                Ci[j] += aik * Bk[j]

    return C
def matmul(A, B):
    """
    Multiply two matrices A and B using only native Python code.

    Parameters:
        A: list of lists, shape (m, n)
        B: list of lists, shape (n, p)

    Returns:
        C: list of lists, shape (m, p)
    """
    # check dimensions
    m, n = len(A), len(A[0])
    if len(B) != n:
        raise ValueError("Inner matrix dimensions do not match")
    p = len(B[0])

    # pre‑allocate the result matrix
    C = [[0] * p for _ in range(m)]

    # multiply
    for i in range(m):
        ai = A[i]
        for j in range(p):
            s = 0
            # iterate over the shared dimension
            for k in range(n):
                s += ai[k] * B[k][j]
            C[i][j] = s
    return C
def matmul(A, B):
    m, n_A = len(A), len(A[0])
    n_B, p = len(B), len(B[0])
    assert n_A == n_B  # The matrices must have compatible dimensions
    
    # Preallocate the output matrix
    result = [[0.0] * p for _ in range(m)]
    
    for i in range(m):
        for j in range(p):
            # Compute the dot product of row i and column j
            sum_val = 0.0
            for k in range(n_A):  # or n_B
                sum_val += A[i][k] * B[k][j]
            result[i][j] = sum_val
    
    return result
def matmul(A, B):
    n = len(A)
    # Assume A and B are both n x n
    # If n=1 return element-wise product
    # else compute.
def matmul(A, B):
    """
    Multiplies two matrices A and B.

    Parameters
    ----------
    A : list of lists
        First matrix, with dimensions m x n.
    B : list of lists
        Second matrix, with dimensions n x p.

    Returns
    -------
    list of lists
        Resulting matrix of dimensions m x p.
    """
    # Number of rows in A and columns in B
    m = len(A)
    n = len(A[0])   # Shared dimension
    p = len(B[0])   # Columns in B

    # Prepare result matrix filled with zeros
    C = [[0 for _ in range(p)] for _ in range(m)]

    # Standard triple‑loop multiplication
    for i in range(m):
        for k in range(n):
            aik = A[i][k]
            for j in range(p):
                C[i][j] += aik * B[k][j]
    return C
def matmul(A, B):
    n_rows_A = len(A)
    n_cols_A = len(A[0]) if A else 0
    n_rows_B = len(B)
    n_cols_B = len(B[0]) if B else 0
    if n_cols_A != n_rows_B:
        raise ValueError("Incompatible dimensions")
    C = [[0]*n_cols_B for _ in range(n_rows_A)]
    for i in range(n_rows_A):
        Ai = A[i]
        Ci = C[i]
        for k in range(n_cols_A):
            aik = Ai[k]
            if aik:
                Bk = B[k]
                for j in range(n_cols_B):
                    Ci[j] += aik * Bk[j]
    return C
def matmul(A, B):
    m, n = len(A), len(B[0]) 
    # etc...
    # compute product
    result = [[0]*n for _ in range(m)]
    for i in range(m):
        for k in range(len(A[0])):  # iterate columns of A
            aik = A[i][k]
            for j in range(n):
                result[i][j] += aik * B[k][j]
    return result
def matmul(A, B):
    import math
    n = len(A)
    m = len(A[0])
    # Validate B shape: m==len(B)
    assert m == len(B), "Incompatible mat shape"
    # optionally convert to square by padding with zeros for Strassen
def matmul(A, B):
    n = len(A)
    m = len(A[0])  # number of columns in A
    p = len(B[0])
    # Pre-allocate result matrix
    C = [[0]*p for _ in range(n)]
    for i in range(n):
        row = A[i]
        for j in range(p):
            s = 0
            for k in range(m):
                s += row[k]*B[k][j]
            C[i][j] = s
    return C
def matmul(A, B):
    n = len(A)
    m = len(B[0])  # columns of B
    result = [[0]*m for _ in range(n)]
    for i in range(n):
        for k in range(len(A[i])):
            aik = A[i][k]
            if aik != 0:
                for j in range(m):
                    result[i][j] += aik * B[k][j]
    return result
def matmul(A, B):\n     return ...
def matmul(A, B):
    n, m = len(A), len(A[0])
    o, p = len(B), len(B[0])
    assert m == o
    # Optionally use comprehension:
    result = [[sum(A[i][k]*B[k][j] for k in range(m)) for j in range(p)] for i in range(n)]
    return result
def matmul(A, B):
    """
    Multiply two matrices A and B.

    Parameters
    ----------
    A: list of lists (rows of A)
    B: list of lists (rows of B)

    Returns
    -------
    Resulting matrix as a list of lists.
    """
    # Determine dimensions
    m = len(A)           # number of rows in A
    k = len(A[0]) if A else 0    # number of columns in A
    n = len(B[0])     # number of columns in B

    # Initialize result matrix
    C = [[0] * n for _ in range(m)]

    # Standard triple-loop matrix multiplication
    for i in range(m):
        for j in range(n):
            s = 0
            for l in range(k):
                s += A[i][l] * B[l][j]
            C[i][j] = s
    return C
def matmul(A, B):
    # sizes: A is n×p , B is p×m
    n, p = len(A), len(A[0])          # number of rows of A, columns of A
    mp = len(B[0])                    # number of columns of B
    # prepare result matrix n×m and fill it
    R = [[0] * mp for _ in range(n)]
    for i in range(n):
        for j in range(mp):
            s = 0
            for k in range(p):
                s += A[i][k] * B[k][j]
            R[i][j] = s
    return R
def matmul(A, B):
    return ...
def matmul(A, B):
    n, m = len(A), len(B[0])
    ...
def matmul(A, B):
    n = len(A)
    m = len(B[0])
    p = len(B)
    result = [[0]*m for _ in range(n)]
    for i in range(n):
        for j in range(m):
            s = 0
            for k in range(p):
                s += A[i][k] * B[k][j]
            result[i][j] = s
    return result
def matmul(A, B):
    n = len(A)
    m = len(B[0])
    p = len(B)
    result = [[0]*m for _ in range(n)]
    for i in range(n):
        for k in range(p):
            aik = A[i][k]
            if aik:
                row = result[i]
                for j in range(m):
                    row[j] += aik * B[k][j]
    return result
def matmul(A, B):
    ...
def matmul(A, B):
    if not A or not B: return []
    m, n1 = len(A), len(A[0])
    n2, p = len(B), len(B[0])
    assert n1 == n2, "Dimensions mismatch."
    res = [[0]*p for _ in range(m)]
    for i in range(m):
        for k in range(n1):
            a = A[i][k]
            if a:
                for j in range(p):
                    res[i][j] += a*B[k][j]
    return res
def matmul(A, B):
    """
    Multiply two matrices A and B using a fast algorithm (divide‑and‑conquer).
    The algorithm recursively multiplies submatrices. This implementation
    does not use any external libraries (no numpy etc.) and only relies on
    pure Python data structures.

    Parameters
    ----------
    A : list of list of numbers
        The left matrix.
    B : list of list of numbers
        The right matrix.

    Returns
    -------
    list of list of numbers
        The product matrix A * B.

    The function assumes that A's columns equal B's rows.
    """
    # Determine matrix dimensions
    n = len(A)                # rows of A
    m = len(B[0])             # columns of B
    k = len(B)                # columns of A / rows of B

    # A and B must be compatible
    if not all(len(row) == k for row in A):
        raise ValueError("Incompatible dimensions for matrix multiplication")

    # Base case: if matrices are small, compute directly
    if n <= 1 or k <= 1 or m <= 1:
        # Direct quadratic multiplication
        result = [[0.0] * m for _ in range(n)]
        for i in range(n):
            for j in range(m):
                for s in range(k):
                    result[i][j] += A[i][s] * B[s][j]
        return result

    # Divide matrices into quadrants
    # Helper function to split matrix into four sub‑matrices
    def split(mat):
        half_rows = len(mat) // 2
        half_cols = len(mat[0]) // 2
        top_left = [row[:half_cols] for row in mat[:half_rows]]
        top_right = [row[half_cols:] for row in mat[:half_rows]]
        bottom_left = [row[:half_cols] for row in mat[half_rows:]]
        bottom_right = [row[half_cols:] for row in mat[half_rows:]]
        return top_left, top_right, bottom_left, bottom_right

    # Split A
    A11, A12, A21, A22 = split(A)
    # Split B
    B11, B12, B21, B22 = split(B)

    # Recursive multiplication for each submatrix product
    C11 = matmul(A11, B11)
    C12 = matmul(A12, B12)
    C21 = matmul(A21, B21)
    C22 = matmul(A22, B22)

    # Combine the submatrices into a single result
    result = []
    for i in range(len(C11)):
        result.append(C11[i] + C12[i])   # merge rows from left and right halves
    for i in range(len(C21)):
        result.append(C21[i] + C22[i])

    return result
def matmul(A, B):
    if not A or not B or not B[0]:
        return []
    m, p = len(A), len(A[0])
    p2, n = len(B), len(B[0])
    if p != p2:
        raise ValueError("A's columns must equal B's rows")
    # initialize result matrix
    result = [[0]*n for _ in range(m)]
    for i in range(m):
        Ai = A[i]
        for k in range(p):
            aik = Ai[k]
            if aik:
                Bk = B[k]
                for j in range(n):
                    result[i][j] += aik * Bk[j]
    return result
def matmul(A, B):
    # A is m x n, B is n x p
    if not A or not B: return []
    n = len(A[0])
    assert all(len(row)==n for row in A)
    assert all(len(row)==len(A[0]) for row in B)
    m=len(A); p=len(B[0])
    result=[[0]*p for _ in range(m)]
    for i in range(m):
        for k in range(n):
            a=A[i][k]
            if a:
                # inner multiplication contributed to each column j
                for j in range(p):
                    result[i][j] += a*B[k][j]
    return result
def matmul(A, B):
    ...
def matmul(A, B):
    ...
def matmul(A, B):
    # Sanity check for proper dimensions
    n = len(A)          # Rows in A
    m = len(A[0]) if A else 0   # Columns in A (and rows in B)
    p = len(B[0]) if B else 0   # Columns in B

    # We need A's columns to equal B's rows
    if m != len(B):
        raise ValueError("Dimension mismatch: A's columns must equal B's rows")

    # Initialize the result matrix
    result = [[0] * p for _ in range(n)]

    # Perform the multiplication
    for i in range(n):
        for k in range(m):
            aik = A[i][k]       # Cache the value to avoid repeated indexing
            for j in range(p):
                result[i][j] += aik * B[k][j]

    return result
def matmul(A, B):
    # asserts
def matmul(A, B):
    ... # code

TrainOutput(global_step=100, training_loss=3.8481087469079966e-05, metrics={'train_runtime': 46507.4754, 'train_samples_per_second': 0.004, 'train_steps_per_second': 0.002, 'total_flos': 0.0, 'train_loss': 3.8481087469079966e-05})

Inference

Now let's try the model we just trained!

[ ]

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-22

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Create a new fast matrix multiplication function using only native Python code.
You are given a list of list of numbers.
Output your new function in backticks using the format below:
```python
def matmul(A, B):
    return ...
```<|end|><|start|>assistant<|channel|>analysis<|message|>We need to output a function that multiplies two matrices, using only native Python. Maybe naive algorithm. Provide code.<|end|><|start|>assistant<|channel|>final<|message|>```python
def matmul(A, B):
    """
    Multiply two matrices A and B (as lists of lists) using the standard O(n^3) algorithm.
    Assumes that A and B are compatible for multiplication (i.e., len(A[0]) == len(B)).
    """
    # Dimensions
    m = len(A)          # number of rows in A
    n = len(B[0])       # number of columns in B
    p = len(B)          # number of rows in B

    # Initialize result matrix with zeros
    result = [[0 for _ in range(n)] for _ in range(m)]

    # Perform multiplication
    for i in range(m):
        for k in range(p):
            aik = A[i][k]
            for j in range(n):
                result[i][j] += aik * B[k][j]

    return result
```<|return|>

Saving to float16 or MXFP4 for VLLM

We also support saving to float16 directly. Select merged_16bit for float16 or mxfp4 for MXFP4 (OpenAI's GPT-OSS native precision). We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

[ ]

And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other resources:

Train your own reasoning model - Llama GRPO notebook Free Colab
Saving finetunes to Ollama. Free notebook
Llama 3.2 Vision finetuning - Radiography use case. Free Colab
See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our documentation!

Join Discord if you need help + ⭐️ Star us on Github ⭐️

This notebook and all Unsloth notebooks are licensed LGPL-3.0.