Notebooks
U
Unsloth
HuggingFace Course Gpt Oss (20B) GRPO BF16

HuggingFace Course Gpt Oss (20B) GRPO BF16

unsloth-notebooksunslothnb

To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!

Join Discord if you need help + ⭐ Star us on Github

In this Hugging Face and Unsloth notebook, you will learn to transform gpt oss (20B) GRPO BF16 into a Reasoning model using GRPO.

To install Unsloth on your local device, follow our guide. This notebook is licensed LGPL-3.0.

You will learn how to do data prep, how to train, how to run the model, & how to save it

News

Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. Blog

You can now train embedding models 1.8-3.3x faster with 20% less VRAM. Blog

Ultra Long-Context Reinforcement Learning is here with 7x more context windows! Blog

3x faster LLM training with 30% less VRAM and 500K context. 3x faster500K Context

New in Reinforcement Learning: FP8 RLVision RLStandbygpt-oss RL

Visit our docs for all our model uploads and notebooks.

Installation

[ ]

Unsloth

Goal: Make faster kernels with Reinforcement Learning

Our goal is to make a faster matrix multiplication kernel by doing RL on GPT-OSS 20B with Unsloth.

You will learn how to:

  1. Counteract reward hacking like cheating, caching, laziness.
  2. Timing and correctness of kernels and time limits.
  3. Making good reward functions
  4. How to seriously do RL to make optimized kernels
[ ]
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.4: Fast Gpt_Oss patching. Transformers: 4.56.2.
   \\   /|    Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
model.safetensors.index.json: 0.00B [00:00, ?B/s]
Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]
model-00006-of-00009.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]
model-00008-of-00009.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]
model-00004-of-00009.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]
model-00002-of-00009.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]
model-00005-of-00009.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]
model-00007-of-00009.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]
model-00001-of-00009.safetensors:   0%|          | 0.00/4.50G [00:00<?, ?B/s]
model-00003-of-00009.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]
model-00009-of-00009.safetensors:   0%|          | 0.00/2.75G [00:00<?, ?B/s]
Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]
generation_config.json:   0%|          | 0.00/165 [00:00<?, ?B/s]
tokenizer_config.json: 0.00B [00:00, ?B/s]
tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/440 [00:00<?, ?B/s]
chat_template.jinja: 0.00B [00:00, ?B/s]

We now add some small amount of LoRA weights to GPT-OSS so we only need to train those, instead of training on the full model.

[ ]
Unsloth: Making `model.base_model.model.model` require gradients

Optimized matrix multiplication

Numpy has optimized matrix multiplication kernels for CPUs via BLAS optimized operations. For GPUs, one can use CUDA accelerated cuBLAS kernels which PyTorch calls under the hood.

To generate some random matrices to do matrix multiplication, we can do the below:

[ ]

We shall generate a small matrix, and see the matrix multiplied output

[ ]
[[-2.8313286   4.54613909 -7.95265309  6.53459836  2.87235103]
 [ 7.0739631   3.76278879  9.31565599 -8.52884711  9.96832952]
 [ 8.41214082  6.51136046 -3.79347975 -2.46773693 -2.32292989]
 [ 3.91302932  4.98335304 -5.33855089  5.71057634 -2.79871647]]
[[ 0.39218774 -9.6181377  -3.49736707]
 [-0.33354865 -1.05626139  3.87231208]
 [ 0.49494174  5.91863954 -6.83183693]
 [ 5.1465162  -7.51648113  1.00445384]
 [ 9.63213377 -4.92327556  3.323014  ]]
[[  54.73441488  -87.89725072   97.94605887]
 [  58.25238906   -1.8467447   -49.25453031]
 [ -35.82528794  -80.25394462   11.51225408]
 [  -0.33785799 -103.64132345   38.51974367]]

We can call a LLM to generate a simple matrix multiply kernel in Python only, and we can calculate the differences between the actual result and the kernel's result

[ ]
[ ]

We see the error below is very small, so that's good!

[ ]
(7.105427357601002e-15, 4.6783406255758477e-29)

Countering Reward Hacking

The ultimate goal of RL is to maximize some reward (say speed, revenue, some metric).

But RL can cheat When the RL algorithm learns a trick or exploits something to increase the reward, without actually doing the task at end, this is called "Reward Hacking".

Some good examples are in https://en.wikipedia.org/wiki/Reward_hacking

For matrix multiplication kernels, we might see the following issues:

  • Laziness: RL learns to use Numpy, Torch, other libraries, which calls optimized kernels.
  • Caching: RL learns to cache the result of the output
  • Cheating: RL learns to find the actual output by inspecting Python global variables
  • RL learns to edit the timing function to make it output 0 time as passed.

And possibly more. We shall try to address each!

Countering Reward Hacking 1: Stop laziness

We can stop the RL algorithm from calling optimized code by inspecting if the generated code imports other non standard Python libraries. We used GPT-5 to help generate this check check_only_stdlib_imports:

[ ]

For example, let's call check_only_stdlib_imports on a random piece of matrix multiplication code generated by GPT-5:

[ ]
Only stdlib imports? False
{'stdlib': [], 'non_stdlib': ['numpy', 'torch'], 'relative_imports': 0}

Countering Reward Hacking 2: Stop cheating

We can stop the RL algorithm from using global or cached variables by restricting it's locals and globals.

We are also going to use exec to create the function, so we have to save the output to an empty dict.

We also disallow global variable access.

[ ]
<function matmul(A, B)>

We also disallow global variable access via types.FunctionType(f.__code__, {})

[ ]
Success
name 'np' is not defined
[ ]

Countering Reward Hacking 3: Stop caching

We can stop the RL algorithm from using cached data by wiping the cache with a large fake matrix. We also have to benchmark carefully with multiple loops and turns.

We also add a timer to not make the algorithm go in an endless loop.

[ ]

For example we use our matmul kernel we had, and benchmark it with a 10 second delay:

[ ]
{'median_ns': 70895404,
, 'mean_ns': 70895404,
, 'stdev_ns': 0,
, 'exceptions': [],
, 'timeouts': 0}

Data & RL task setup

We now have to create a prompt to the model for which it will do some task. For our matrix multiply example, we use the below:

[ ]
Create a new fast matrix multiplication function using only native Python code.
You are given a list of list of numbers.
Output your new function in backticks using the format below:
```python
def matmul(A, B):
    return ...
```

First, let's prompt GPT-OSS without RL and see how it goes:

[ ]
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-10-17

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Create a new fast matrix multiplication function using only native Python code.
You are given a list of list of numbers.
Output your new function in backticks using the format below:
```python
def matmul(A, B):
    return ...
```<|end|><|start|>assistant<|channel|>analysis<|message|>Need to implement efficient naive multiplication maybe using list comprehensions? Provide function.<|end|><|start|>assistant<|channel|>final<|message|>```python
def matmul(A, B):
    """
    Multiply two matrices A and B using only native Python code.

    Parameters
    ----------
    A : list[list[float]]
        Left matrix (m x n).
    B : list[list[float]]
        Right matrix (n x p).

    Returns
    -------
    list[list[float]]
        Product matrix (m x p).

    Notes
    -----
    The function assumes that the inner dimensions match (len(A[0]) == len(B)).
    It uses a straightforward triple loop but is written to maximize performance
    with Python's list comprehension and built‑in sum.
    """
    if not A or not B or not A[0] or not B[0]:
        raise ValueError("Matrices cannot be empty")

    m, n = len(A), len(A[0])
    n2, p = len(B), len(B[0])
    if n != n2:
        raise ValueError(f"Inner dimensions must agree: {n} vs {n2}")

    # Transpose B to improve cache locality
    Bt = list(zip(*B))   # (p x n) tuples

    # Compute the product
    return [
        [sum(a * b for a, b in zip(row, col)) for col in Bt]
        for row in A
    ]
```<|return|>

Reward functions

We now design the extract_function function which simply extracts the function wrapped in 3 backticks.

And 4 reward functions:

  1. function_works which rewards the model if the strategy is a valid Python function.
  2. no_cheating which checks if the function imported other modules, and if it did, we penalize it.
  3. correctness_check which checks if the kernel was correct or wrong - it shouldn't generate gibberish!
  4. speed_check checks the performance relative to Numpy matmul directly.
[ ]
def matmul(A, B):
    return ...

Below is our function_works reward function which uses Python's exec but guarded by not allowing leakage of local and global variables. We can also use check_only_stdlib_imports first to check if there are errors before even executing the function:

[ ]
(False,
, {'error': "SyntaxError: expected '(' (<unknown>, line 1)",
,  'stdlib': [],
,  'non_stdlib': [],
,  'relative_imports': 0})
[ ]

no_cheating checks if the function cheated since it might have imported Numpy or Torch optimized code.

[ ]

Next correctness_check checks if the kernel was correct. We want to penalize if the absolute error is larger than 1, and if the mean squared error is somewhat bigger then machine epsilon.

We have to execute the code now!

[ ]
np.float64(2.220446049250313e-16)
[ ]

Finally our benchmarking function for speed_check! We shall limit the timer to 10 seconds and do 3 trials.

[ ]
{'median_ns': 205566,
, 'mean_ns': 231173,
, 'stdev_ns': 39247,
, 'exceptions': [],
, 'timeouts': 0}
[ ]
{'median_ns': 84237,
, 'mean_ns': 87442,
, 'stdev_ns': 4538,
, 'exceptions': [],
, 'timeouts': 0}

We can take the difference and do a negative sign for slower ones. If the ratio is less than 1 (ie faster, we shall invert it!)

[ ]
0.02440329071548132
[ ]
3.333333333333333
[ ]

We create the dataset which includes a replica of our prompt. Remember to add reasoning effort of low!

[ ]
49
{'prompt': [{'content': 'Create a new fast matrix multiplication function using only native Python code.\nYou are given a list of list of numbers.\nOutput your new function in backticks using the format below:\n```python\ndef matmul(A, B):\n    return ...\n```',
,   'role': 'user'}],
, 'answer': 0,
, 'reasoning_effort': 'low'}

Train the model

Now set up GRPO Trainer and all configurations! We also support GSDP, GAPO, Dr GRPO and more! Go to our docs https://unsloth.ai/docs/ for more info!

[ ]
Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 2

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the reward column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

StepTraining Lossrewardreward_stdcompletion_lengthkl
10.0000000.1250000.000000200.0000000.000000
20.0000000.0723750.248112200.0000000.000000
30.000000-0.0790000.163776182.5000000.000005
[ ]

And let's train the model!

NOTE This training run might take quite long, so please wait!

[ ]
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': 199998}.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
 "-____-"     Trainable parameters = 1,990,656 of 20,916,747,840 (0.01% trained)
`generation_config` default values have been modified to match model-specific defaults: {'max_length': 131072}. If this is not desired, please set these values explicitly.
def matmul(A, B):
    """
    Fast matrix multiplication using only native Python code.
    
    Parameters
    ----------
    A : list of list of numbers
        Left matrix of dimensions (m x p).
    B : list of list of numbers
        Right matrix of dimensions (p x n).
    
    Returns
    -------
    C : list of list of numbers
        Resulting matrix of dimensions (m x n) such that C = A × B.
    """
    # Transpose B to allow column access as rows.
    Bt = list(zip(*B))
    # Compute the dot product of each row from A with each column from B
    return [[sum(a * b for a, b in zip(row, col))
             for col in Bt]
            for row in A]
def matmul(A, B):
    return ...
Unsloth: Will smartly offload gradients to save VRAM!
def matmul(A, B):
    # ensure dimensions
    m = len(A)
    n = len(A[0])
    p = len(B[0]) if B else 0
    return [[sum(A[i][k] * B[k][j] for k in range(n)) for j in range(p)] for i in range(m)]
None
def matmul(A, B):
    # A: m x k
    # B: k x n
    # returns m x n
    ...
def matmul(A, B):
    # A: r x p, B: p x c
    r, p = len(A), len(A[0]) if A else 0
    p2, c = len(B), len(B[0]) if B else 0
    if r == 0 or c == 0 or p != p2:
        raise ValueError("Incompatible dimensions for multiplication")
    # transpose B to improve locality
    B_T = list(zip(*B))  # c x p
    result = [[0] * c for _ in range(r)]
    for i in range(r):
        Ai = A[i]
        Ri = result[i]
        for j, Bj in enumerate(B_T):
            s = 0
            for k in range(p):
                s += Ai[k] * Bj[k]
            Ri[j] = s
    return result
def matmul(A, B):
    """
    Multiply two matrices A and B.

    Parameters
    ----------
    A : list[list[float]]
        Left‑hand matrix with shape m x k.
    B : list[list[float]]
        Right‑hand matrix with shape k x n.

    Returns
    -------
    list[list[float]]
        Resulting matrix with shape m x n.

    Raises
    ------
    ValueError
        If dimensions are incompatible.
    """
    if not A or not B:
        return []

    # Dimensions
    rows_a, cols_a = len(A), len(A[0])
    rows_b, cols_b = len(B), len(B[0])

    if cols_a != rows_b:
        raise ValueError("Incompatible dimensions for multiplication")

    # Convert B into columns for faster row‑by‑row multiplication
    B_t = [list(col) for col in zip(*B)]

    # Compute each row of the result
    result = [
        [sum(a * b for a, b in zip(row_a, col_b)) for col_b in B_t]
        for row_a in A
    ]

    return result
None
def matmul(A, B):
    """Compute the matrix product of A and B using pure Python."""
    # Quick sanity checks
    m, n = len(A), len(A[0])
    p, q = len(B), len(B[0])
    assert n == p, "Number of columns of A must equal number of rows of B"
    # Prepare the result matrix with zeros
    result = [[0] * q for _ in range(m)]
    # This variant loops over the outermost index that is likely to be cache-friendly
    # and prefetches the row of B so that we access its elements in order.
    for i in range(m):
        row_a = A[i]
        for k in range(n):
            aik = row_a[k]
            if aik:                          # skip the zero case to save work
                row_b = B[k]
                for j in range(q):
                    result[i][j] += aik * row_b[j]
    return result
def matmul(A, B):
    m = len(A)
    if m==0: return []
    n = len(A[0]) or 0
    p = len(B[0]) if B else 0
    # ensure B has n rows
    # Use list comprehension summing product over k
    # Compute B transposed for column-wise access: BT = list(zip(*B))
    BT = list(zip(*B))
    return [[sum(a*b for a,b in zip(row,col)) for col in BT] for row in A]
def matmul(A, B):
    """
    Multiply two matrices where each matrix is represented as a list of lists
    and the elements are integers or floats.

    Parameters
    ----------
    A : list[list[int|float]]
        Left‑hand matrix of size (m × p).
    B : list[list[int|float]]
        Right‑hand matrix of size (p × n).

    Returns
    -------
    list[list[int|float]]
        The product matrix C = A @ B of size (m × n).
    """

    # Dimensions: A: m×p, B: p×n
    m, p = len(A), len(A[0])
    p2, n = len(B), len(B[0])
    if p != p2:
        raise ValueError("Inner dimensions must agree for matrix multiplication")

    # Pre‑transpose B so that column access is contiguous
    # This reduces random memory access during the dot product
    B_T = [list(col) for col in zip(*B)]

    result = []
    for i, row in enumerate(A):
        # Compute each entry of the i‑th row of the result
        result_row = []
        for j, col in enumerate(B_T):
            # use a local variable for speed
            dot = 0
            for a, b in zip(row, col):
                dot += a * b
            result_row.append(dot)
        result.append(result_row)

    return result
None
def matmul(A, B):
    """
    Multiply two matrices A and B using plain Python.
    A: list of lists (m × n), B: list of lists (n × p).
    Returns the product (m × p) as a new list of lists.
    """
    if not A or not B:
        return []

    m = len(A)
    n = len(A[0])
    p = len(B[0])

    # Quick compatibility check
    assert len(B) == n, "Incompatible matrix dimensions"

    # Allocate result matrix
    result = [[0] * p for _ in range(m)]

    # Standard triple‑loop multiplication, with a small speed‑up:
    # pull outer indices, cache row and column values locally,
    # and skip inner loop when the coefficient is zero.
    for i in range(m):
        Ai = A[i]
        Ri = result[i]
        for k in range(n):
            aik = Ai[k]
            if aik:                       # skip zero entries
                Bk = B[k]
                for j in range(p):
                    Ri[j] += aik * Bk[j]
    return result
def matmul(A, B):
    # number of rows in A
    m = len(A)
    # number of columns in B
    p = len(B[0]) if B else 0
    # number of columns in A (used as number of rows in B)
    k = len(A[0]) if A else 0

    # If shapes are incompatible, raise an error
    if len(B) != k:
        raise ValueError("Incompatible matrices: A.shape[1] != B.shape[0]")

    # Result matrix initialised with zeros
    result = [[0] * p for _ in range(m)]

    # Triple‑loop multiplication
    for i in range(m):
        for j in range(p):
            # Compute the dot product of row i of A and column j of B
            s = 0
            for t in range(k):
                s += A[i][t] * B[t][j]
            result[i][j] = s
    return result
None
def matmul(A, B):
    # Determine dimensions
    m, n = len(A), len(A[0])
    nB, p = len(B), len(B[0])
    assert n == nB, "Incompatible matrices"
    result = [[0]*p for _ in range(m)]
    # Multiply
    for i in range(m):
        Ai = A[i]
        for k in range(n):
            aik = Ai[k]
            if aik:
                Bk = B[k]
                for j in range(p):
                    result[i][j] += aik * Bk[j]
    return result
def matmul(A, B):
    """Fast matrix product using only native Python.

    Args:
        A (List[List[Number]]): left matrix (n × m)
        B (List[List[Number]]): right matrix (m × p)

    Returns:
        List[List[Number]]: the product A * B (n × p)

    Raises:
        ValueError: if the matrices cannot be multiplied
    """
    # Basic shape checks
    if not A or not A[0] or not B or not B[0]:
        raise ValueError("Input matrices must be non‑empty")
    n, m = len(A), len(A[0])
    if m != len(B):
        raise ValueError("Number of columns of A must equal number of rows of B")

    # Transpose B once for better locality
    B_T = list(zip(*B))  # now each element of B_T is a tuple representing a column

    # Compute the product
    result = []
    for row in A:
        # dot(row, col) for each column
        result.append([sum(a * b for a, b in zip(row, col)) for col in B_T])

    return result
None
def matmul(A, B):
    return ...
def matmul(A, B):
    ...
def matmul(A, B):
    n = len(A)
    m = len(B[0])
    k = len(B)
    result = [[0]*m for _ in range(n)]
    for i in range(n):
        for j in range(m):
            s = 0
            for l in range(k):
                s += A[i][l] * B[l][j]
            result[i][j] = s
    return result
None
None
def matmul(A, B):
    m = len(A)
    n = len(B[0])
    p = len(B)
    return [[sum(A[i][k] * B[k][j] for k in range(p)) for j in range(n)] for i in range(m)]
None
def matmul(A, B):
    return ...
def matmul(A, B):
    import math
    # Assume square matrices of same size and power of 2
    n = len(A)
    if n == 0:
        return []
    def add(X, Y):
        return [[X[i][j] + Y[i][j] for j in range(n)] for i in range(n)]
    def sub(X, Y):
        return [[X[i][j] - Y[i][j] for j in range(n)] for i in range(n)]
    def split(M):
        k = n // 2
        return [ [row[:k] for row in M[:k]],
                 [row[:k] for row in M[k:]],
                 [row[k:] for row in M[:k]],
                 [row[k:] for row in M[k:]] ]
    def combine(A11, A12, A21, A22):
        k = len(A11)
        result = [ [0]* (k*2) for _ in range(k*2) ]
        for i in range(k):
            result[i][:k]   = A11[i]
            result[i][k:]   = A12[i]
            result[i+k][:k] = A21[i]
            result[i+k][k:] = A22[i]
        return result
    def strassen(X, Y):
        if n == 1:
            return [[X[0][0]*Y[0][0]]]
        a, b, c, d = split(X)
        e, f, g, h = split(Y)
        p1 = strassen(a, sub(f, h))
        p2 = strassen(add(a, b), h)
        p3 = strassen(add(c, d), e)
        p4 = strassen(d, sub(g, e))
        p5 = strassen(add(a, d), add(e, h))
        p6 = strassen(sub(b, d), add(g, h))
        p7 = strassen(sub(a, c), add(e, f))
        c11 = add(sub(add(p5, p4), p2), p6)
        c12 = add(p1, p2)
        c21 = add(p3, p4)
        c22 = sub(sub(add(p1, p5), p3), p7)
        return combine(c11, c12, c21, c22)
    return strassen(A, B)
def matmul(A, B):
    n = len(A)
    m = len(A[0])
    p = len(B[0])
    assert len(B) == m
    BT = list(zip(*B)) # transposed as tuples
    C = [[sum(Ai[k] * BTj[k] for k in range(m)) for BTj in BT] for Ai in A]
    return C
def matmul(A, B):
    n = len(A)
    m = len(B[0])
    p = len(B)
    # transpose B
    B_T = list(map(list, zip(*B)))  # list of columns
    return [[sum(a*b for a,b in zip(row, col)) for col in B_T] for row in A]
None
def matmul(A, B):
    return ...
def matmul(A, B):
    ...
def matmul(A, B):
    """Matrix multiplication using only native Python (no external libraries).

    Works for arbitrary sized matrices with compatible dimensions.
    The algorithm transposes matrix B to enhance cache locality,
    then uses a list‑comprehension to calculate the dot–product of
    corresponding rows and columns.

    Parameters
    ----------
    A : list[list[Number]]
        Left matrix of shape (m, n).
    B : list[list[Number]]
        Right matrix of shape (n, p).

    Returns
    -------
    list[list[Number]]
        Resulting product matrix of shape (m, p).
    """
    # Transpose B for efficient column access
    B_transposed = list(zip(*B))  # tuples, one per column of B
    return [
        [
            # dot product of row from A and column from B
            sum(a * b for a, b in zip(row_a, col_b))
            for col_b in B_transposed
        ]
        for row_a in A
    ]
None
None
def matmul(A, B):
    return ...
def matmul(A, B):
    return ...
def matmul(A, B):
    # check dimension matches
    nrows_a = len(A)
    ncols_a = len(A[0]) if A else 0
    nrows_b = len(B)
    ncols_b = len(B[0]) if B else 0
    if ncols_a != nrows_b:
        raise ValueError("Incompatible dimensions")
    # transpose B for cache locality
    BT = list(zip(*B))  # tuple of tuples used as rows
    result = [[0]*ncols_b for _ in range(nrows_a)]
    for i in range(nrows_a):
        ai = A[i]
        ri = result[i]
        for j in range(ncols_b):
            s = 0
            bj = BT[j]
            for k in range(ncols_a):
                s += ai[k] * bj[k]
            ri[j] = s
    return result
def matmul(A, B):
    B_T = list(zip(*B))
    res = [[sum(a*b for a,b in zip(row, col)) for col in B_T] for row in A]
    return res
def matmul(A, B):
    """
    Multiply two square matrices A and B.

    Parameters
    ----------
    A : list[list[float]]
        First n × n matrix.
    B : list[list[float]]
        Second n × n matrix.

    Returns
    -------
    list[list[float]]
        The product matrix C = A @ B.
    """
    # Transpose B once for O(1) column access
    B_T = list(zip(*B))

    # Compute C[i][j] = dot(A[i], B_T[j])
    return [[sum(a * b for a, b in zip(row, col))
             for col in B_T]
            for row in A]
def matmul(A, B): return ...
def matmul(A, B):
    """
    Multiply two matrices A and B using native Python lists.
    `A` and `B` must be rectangular (i.e. all rows the same length).

    Returns a new matrix containing the product.
    Raises ValueError if the inner dimensions do not match.
    """
    if not A or not B:
        return []

    n_rows_A, n_cols_A = len(A), len(A[0])
    n_rows_B, n_cols_B = len(B), len(B[0])

    if n_cols_A != n_rows_B:
        raise ValueError("cannot multiply: inner dimensions do not match")

    # preallocate result matrix
    result = [[0] * n_cols_B for _ in range(n_rows_A)]

    for i in range(n_rows_A):
        # Local references for speed
        row_a = A[i]
        for k in range(n_cols_A):
            aik = row_a[k]
            if aik == 0:
                continue     # skip zero multiplications
            row_b = B[k]
            for j in range(n_cols_B):
                result[i][j] += aik * row_b[j]

    return result
None
def matmul(A, B):
    return ...
None
None
None
def matmul(A, B):
    return ...
None
None
None
None
def matmul(A, B):
    """Return the matrix product of A and B.

    A must be an `n×m` matrix, B an `m×p` matrix.
    Matrices are represented as nested lists of numbers.
    """
    if not A or not B:
        return []

    # Transpose B once so we can iterate over rows efficiently
    B_T = list(zip(*B))

    # Compute each entry using a dot‑product of corresponding rows
    return [
        [sum(a * b for a, b in zip(row_a, col_b)) for col_b in B_T]
        for row_a in A
    ]
None
def matmul(A, B):
    # A is m x p, B is p x n
    m = len(A)
    p = len(A[0])  # find p
    n = len(B[0])
    # create result m x n
    result = [... for i in ...]
def matmul(A, B):
    m, p = len(A), len(A[0]) if A else 0
    p2, n = len(B), len(B[0]) if B else 0
    if p != p2: raise ValueError("Incompatible dimensions")
    # Precompute transpose of B for cache-friendly access
    Bt = list(zip(*B))
    return [[sum(a * b for a, b in zip(row, col)) for col in Bt] for row in A]
def matmul(A, B): return ...
def matmul(A, B):
    return ...
None
def matmul(A, B):
    return ...
None
def matmul(A, B): return ...
None
def matmul(A, B):
    """
    Multiply two matrices A (m×n) and B (n×p) given as lists of lists.
    Uses only native Python code and is tuned for speed by transposing B.
    """
    if not A:
        return []

    n_rows_A, n_cols_A = len(A), len(A[0])
    # Basic consistency check – assume all rows have equal length
    # and B has compatible dimensions.
    n_rows_B, n_cols_B = len(B), len(B[0]) if B else 0

    # Transpose B to access its columns as tuples (faster indexing)
    B_T = list(zip(*B))          # shape: (p × n)

    result = []
    for row_A in A:              # iterate over rows of A
        # Compute the dot product of row_A with each column of B
        res_row = [sum(a * b for a, b in zip(row_A, col_B)) for col_B in B_T]
        result.append(res_row)

    return result
None
None
def matmul(A, B):
    """
    Multiply two matrices A and B (lists of lists) using native Python.
    The matrices are assumed to be square and of compatible dimensions.
    """
    n = len(A)
    # Transpose B to improve cache locality for the inner sum
    B_T = list(zip(*B))                       # each element is a tuple
    return [[sum(a * b for a, b in zip(row, col))
             for col in B_T]
            for row in A]
def matmul(A, B):
    """
    Multiply two matrices A and B using only native Python code.

    Parameters
    ----------
    A : list of lists (m x k)
    B : list of lists (k x n)

    Returns
    -------
    C : list of lists (m x n)
        Product matrix such that C[i][j] = sum(A[i][p] * B[p][j] for p in range(k))

    Raises
    ------
    ValueError
        If the number of columns in A does not equal the number of rows in B.
    """
    if not A or not B:
        raise ValueError("Both matrices must be non‑empty.")
    m, k1 = len(A), len(A[0])
    k2, n = len(B), len(B[0])
    if k1 != k2:
        raise ValueError(f"Incompatible shapes: {m}x{k1} multiplied by {k2}x{n}")
    # Pre‑compute columns of B for faster access
    B_cols = list(zip(*B))  # n tuples each of length k
    # Compute the product
    return [[sum(a * b for a, b in zip(row, col))
             for col in B_cols] for row in A]
def matmul(A, B):
    """
    Fast matrix multiplication for plain Python objects.
    A : list of m rows, each a list of n numbers
    B : list of n rows, each a list of p numbers
    Returns a new matrix of shape m × p.
    """
    m = len(A)
    if m == 0:
        return []
    n = len(A[0])
    # Verify dimension compatibility
    if len(B) != n or any(len(row) != n for row in A):
        raise ValueError("Inner dimensions must agree.")
    p = len(B[0])

    # Allocate result matrix
    result = [[0] * p for _ in range(m)]

    # Standard triple-loop, optimized for speed in pure Python
    for i in range(m):
        row_a = A[i]
        row_res = result[i]
        for k in range(n):
            a_val = row_a[k]
            row_b = B[k]
            # Use local variables for performance
            for j in range(p):
                row_res[j] += a_val * row_b[j]

    return result
def matmul(A, B):
    """Multiply two matrices A and B.

    A: list of m rows, each containing n elements.
    B: list of n rows, each containing p elements.
    Returns a new list of list representing the product matrix of shape (m, p).
    """
    # Basic sanity checks
    if not A or not B:
        raise ValueError("Input matrices must be non‑empty.")
    m, n = len(A), len(A[0])
    nB, p = len(B), len(B[0])
    if n != nB:
        raise ValueError("Number of columns of A must equal number of rows of B.")
    for row in A:
        if len(row) != n:
            raise ValueError("All rows of A must have the same length.")
    for row in B:
        if len(row) != p:
            raise ValueError("All rows of B must have the same length.")

    # Allocate the result matrix (m x p) initialized to 0
    result = [[0] * p for _ in range(m)]

    # Perform multiplication
    for i in range(m):
        rowA = A[i]
        rowR = result[i]
        for k in range(n):
            aik = rowA[k]
            if aik:  # Skip work for zero multiplication
                rowB = B[k]
                for j in range(p):
                    rowR[j] += aik * rowB[j]
    return result
None
None
def matmul(A, B):
    # Validate dimensions
    if not A or not B:
        return []
    if len(A[0]) != len(B):
        raise ValueError("Matrix dimensions do not match for multiplication.")
    B_cols = list(zip(*B))  # transpose B
    result = [[sum(a*b for a,b in zip(row, col)) for col in B_cols] for row in A]
    return result
None
def matmul(A, B):
    """Multiply matrices A × B (list‑of‑list format) using an optimized pure‑Python routine."""
    n, p = len(A), len(A[0])           # rows of A, columns of B (must match)
    m = len(B[0])                     # columns of B
    # Result matrix initialized with zeros
    result = [[0] * m for _ in range(n)]
    for i in range(n):
        rowA = A[i]
        rowR = result[i]
        for k in range(p):
            aik = rowA[k]
            if aik:                    # skip zero entries for a small extra speedup
                rowBk = B[k]
                for j in range(m):
                    rowR[j] += aik * rowBk[j]
    return result
def matmul(A, B):
    """Fast matrix multiplication using only native Python code.

    Parameters
    ----------
    A : list[list[float]]
        Left hand matrix of shape (n, m).
    B : list[list[float]]
        Right hand matrix of shape (m, q).

    Returns
    -------
    list[list[float]]
        The product matrix of shape (n, q).

    Notes
    -----
    The routine pre-allocates the result matrix and uses local variable
    bindings to reduce attribute look‑ups inside the innermost loop,
    which gives a noticeable speed boost for large matrices.
    """
    # Basic dimensions checking
    n, m = len(A), len(A[0])
    m2, q = len(B), len(B[0])
    if m != m2:
        raise ValueError("Number of columns in A must equal number of rows in B.")

    # Pre‑allocate result matrix with zeros
    result = [[0.0] * q for _ in range(n)]

    # Perform multiplication
    for i in range(n):
        rowA = A[i]
        rowR = result[i]
        for k in range(m):
            aik = rowA[k]
            rowB = B[k]
            for j in range(q):
                rowR[j] += aik * rowB[j]
    return result
def matmul(A, B):
    B_T = list(zip(*B))  # transpose B for inner product
    return [[sum(a*b for a,b in zip(row, col)) for col in B_T] for row in A]
def matmul(A, B):
    # A: m x n, B: n x p
    m, n = len(A), len(A[0])
    n2, p = len(B), len(B[0])
    assert n==n2
    # initialize result matrix
    C = [[0]*p for _ in range(m)]
    # transpose B for better locality
    B_T = list(zip(*B))
    for i in range(m):
        Ai = A[i]
        Ci = C[i]
        for k in range(n):
            aik = Ai[k]
            Bk = B_T[k]
            for j in range(p):
                Ci[j] += aik * Bk[j]
    return C
def matmul(A, B):
    """Fast matrix multiplication using pure Python.

    Arguments:
        A: List of lists, the left matrix of size m x n.
        B: List of lists, the right matrix of size n x p.

    Returns a new matrix C of size m x p where C[i][j] = sum(A[i][k] * B[k][j] for k in range(n)).

    This implementation transposes B once to enable efficient column access
    and uses nested list comprehensions together with the built‑in `sum`
    function, which is implemented in C.
    """
    # Verify dimensions
    if not A or not B:
        raise ValueError("Matrices cannot be empty")
    n = len(A[0])
    if any(len(row) != n for row in A):
        raise ValueError("All rows in A must have the same length")
    if len(B) != n:
        raise ValueError("Number of columns in A must equal number of rows in B")
    p = len(B[0])
    if any(len(row) != p for row in B):
        raise ValueError("All rows in B must have the same length")

    # Transpose B to get columns as rows
    B_cols = list(zip(*B))

    # Compute product
    return [[sum(a * b for a, b in zip(row, col)) for col in B_cols] for row in A]
def matmul(A, B):
    BT = list(zip(*B))
    return [[sum(a*b for a,b in zip(row, col)) for col in BT] for row in A]
def matmul(A, B):
    """Multiplies matrix A by matrix B using pure native Python.

    Parameters
    ----------
    A : list[list[float]]
        The first matrix of shape (l, m).
    B : list[list[float]]
        The second matrix of shape (m, n).

    Returns
    -------
    list[list[float]]
        The product matrix of shape (l, n).
    """
    # Pre‑get dimensions for speed
    l = len(A)           # Number of rows in A
    m = len(A[0])        # Number of columns in A / rows in B
    n = len(B[0])        # Number of columns in B

    # The result will have shape (l, n)
    result = [[0.0] * n for _ in range(l)]

    for i in range(l):
        Ai = A[i]
        for k in range(m):
            aik = Ai[k]
            Bk = B[k]
            # Unroll column loop to reduce attribute lookups
            for j in range(n):
                result[i][j] += aik * Bk[j]

    return result
def matmul(A, B):
    """
    Multiply two matrices A (n x m) and B (m x p) using only native Python.
    Returns the resulting matrix as a list of lists.
    """
    n = len(A)
    m = len(A[0])
    if len(B) != m:
        raise ValueError("Number of columns in A must equal number of rows in B")
    p = len(B[0])

    # Pre‑allocate result matrix with zeros
    result = [[0] * p for _ in range(n)]

    for i in range(n):
        row_a = A[i]
        for k in range(m):
            aik = row_a[k]          # A[i][k]
            row_b = B[k]            # The k-th row of B
            for j in range(p):
                result[i][j] += aik * row_b[j]  # C[i][j] += A[i][k] * B[k][j]
    return result
def matmul(A, B):
    return ...
def matmul(A, B):
    return ...
None
def matmul(A, B):
    # assume A dims m x n, B n x p
def matmul(A, B):
    return ...
def matmul(A, B):
    # Should handle maybe rectangular matrices
    ...
def matmul(A, B): return ...
def matmul(A, B): return ...
def matmul(A, B):
    result = []
    for i in range(len(A)):
        res_row = []
        for j in range(len(B[0])):
            sum_val = 0
            for k in range(len(B)):
                sum_val += A[i][k] * B[k][j]
            res_row.append(sum_val)
        result.append(res_row)
    return result
def matmul(A, B): return ...
def matmul(A, B):
    """
    Multiply two matrices represented as lists of lists using only
    standard Python code. Raises a ValueError if the matrices cannot
    be multiplied.
    
    Parameters
    ----------
    A : List[List[Number]]
        The left-hand-side matrix of shape (m, n).
    B : List[List[Number]]
        The right-hand-side matrix of shape (n, p).

    Returns
    -------
    C : List[List[Number]]
        The product matrix of shape (m, p).
    """
    if not A or not B:
        raise ValueError("Empty matrices cannot be multiplied")

    # Verify inner dimensions match
    n = len(A[0])
    for row in A:
        if len(row) != n:
            raise ValueError("All rows of A must have the same length")
    if len(B) != n:
        raise ValueError("Number of columns in A must equal number of rows in B")

    p = len(B[0])
    for row in B:
        if len(row) != p:
            raise ValueError("All rows of B must have the same length")

    # Transpose B so that columns can be accessed as tuples
    B_T = list(zip(*B))        # Each element is a tuple of length n

    # Compute the product
    C = []
    for a_row in A:
        c_row = []
        for b_col in B_T:
            dot_product = sum(x * y for x, y in zip(a_row, b_col))
            c_row.append(dot_product)
        C.append(c_row)

    return C
def matmul(A, B):
    """
    Multiply two matrices A and B represented as nested lists.
    
    Parameters:
        A (list[list[float]]): Matrix of size (m x n).
        B (list[list[float]]): Matrix of size (n x p).
    
    Returns:
        list[list[float]]: Resultant matrix of size (m x p).
    """
    # Pre‑compute the columns of B
    B_cols = list(zip(*B))
    
    # Compute the product row by row
    return [
        [
            sum(a * b for a, b in zip(row, col))
            for col in B_cols
        ]
        for row in A
    ]
def matmul(A, B):
    # Basic dimension check
    if not A or not B:
        return []
    n_rows_a = len(A)
    n_cols_a = len(A[0])
    n_rows_b = len(B)
    n_cols_b = len(B[0])
    if n_cols_a != n_rows_b:
        raise ValueError("Incompatible dimensions for matrix multiplication")
    result = [[0] * n_cols_b for _ in range(n_rows_a)]
    for i in range(n_rows_a):
        for k in range(n_cols_a):
            aik = A[i][k]
            if aik == 0:
                continue
            for j in range(n_cols_b):
                result[i][j] += aik * B[k][j]
    return result
def matmul(A, B): return ...
def matmul(A, B):
    return ...
def matmul(A, B):
    # A: m x n, B: n x p
    # returns m x p
    # check sizes
    m, n = len(A), len(A[0])
    assert len(B) == n
    p = len(B[0])
    # Precompute transpose of B
    Bt = list(zip(*B))
    res = [[sum(a*b for a,b in zip(row, col)) for col in Bt] for row in A]
    return res
def matmul(A, B):
    return ...
def matmul(A, B):
    """Multiply two matrices A and B (list of lists) purely in plain Python."""
    m, n = len(A), len(A[0])
    nB, p = len(B), len(B[0])
    if n != nB:
        raise ValueError("Inner matrix dimensions must agree.")
    # Pre‑allocate result matrix
    C = [[0] * p for _ in range(m)]
    # Perform multiplication in the ctr order: i, k, j
    for i in range(m):
        rowA = A[i]
        rowC = C[i]
        for k in range(n):
            aik = rowA[k]
            if aik:  # skip zero multiplications
                rowB = B[k]
                for j in range(p):
                    rowC[j] += aik * rowB[j]
    return C
def matmul(A, B): return ...
def matmul(A, B):
    return ...
def matmul(A, B):
    B_T = list(zip(*B))
    return [[sum(a*b for a,b in zip(row, col)) for col in B_T] for row in A]
def matmul(A, B):
    return ...
None
def matmul(A, B):
    ...
def matmul(A, B):
    ...
def matmul(A, B):
    ...
None
def matmul(A, B):
    # assume A rows, B columns
    m, k1 = len(A), len(A[0]) if A else 0
    k2, n = len(B), len(B[0]) if B else 0
    assert k1 == k2, "Inner dimensions must match"
    B_T = list(zip(*B))  # transposed B
    result = [[sum(a*b for a,b in zip(row, col)) for col in B_T] for row in A]
    return result
None
def matmul(A, B):
    """
    Multiply two matrices A and B (both given as lists of lists) using
    only native Python constructs.

    The function first validates that the inner dimensions match (number of
    columns in A must equal the number of rows in B).  Then it computes the
    product using a straightforward triple‑loop but expressed in a compact
    list‑comprehension.  The columns of B are accessed by transposing B
    once via ``zip(*B)``, which avoids explicit indexing and gives good
    cache locality in Python.

    Parameters
    ----------
    A : List[List[Number]]
        Left‑hand matrix of size m × n.

    B : List[List[Number]]
        Right‑hand matrix of size n × p.

    Returns
    -------
    List[List[Number]]
        The matrix product of A and B, which has shape m × p.

    Raises
    ------
    ValueError
        If the matrices cannot be multiplied due to incompatible dimensions.
    """
    # Validate dimensions
    if not A or not B:
        raise ValueError("Matrices cannot be empty")
    n_cols_A = len(A[0])
    n_rows_B = len(B)
    if n_cols_A != n_rows_B:
        raise ValueError("Inner dimensions must match: "
                         f"{n_cols_A} != {n_rows_B}")

    # Transpose B once to make column access efficient
    B_cols = list(zip(*B))

    # Compute product using list comprehensions
    return [
        [sum(a * b for a, b in zip(row, col)) for col in B_cols]
        for row in A
    ]
def matmul(A, B):
    if not A or not B:
        return []
    m, p = len(A), len(A[0])
    n = len(B[0])
    # ensure inner dimension matches
    result = [[0]*n for _ in range(m)]
    for i in range(m):
        for k in range(p):
            aik = A[i][k]
            for j in range(n):
                result[i][j] += aik * B[k][j]
    return result
def matmul(A, B):
    n = len(A)
    m = len(B[0])
    p = len(B)
    return [[sum(A[i][k] * B[k][j] for k in range(p)) for j in range(m)] for i in range(n)]
def matmul(A, B):
    # Basic check dims
    n = len(A)
    assert n > 0
    m = len(A[0])
    # B has size m x p
    assert len(B) == m
    p = len(B[0])
    # Pre-allocate result
    C = [[0]*p for _ in range(n)]
    # Compute transpose of B for better locality
    B_T = list(map(list, zip(*B)))
    for i in range(n):
        Ai = A[i]
        Ci = C[i]
        for j in range(p):
            Bj = B_T[j]
            s = 0
            for k in range(m):
                s += Ai[k] * Bj[k]
            Ci[j] = s
    return C
def matmul(A, B):
    """Return the product of two matrices in Theta(m^3/mn)*n^2 time,
    optimizing cache usage if --precache-multiple was selected.
    """
    _check_input(A, B)
    m = len(A); n = len(A[0]); k = len(B[0]);   # (m x n) * (n x k) => (m x k)
    _use_a = (m / n <= 100.0 * n / k)
    M_local_cache_avail_prefs = sympy.cache.get('M_local_cache_avail_prefs', 0)
    M_local_cache_avail_nonp = sympy.cache.get('M_local_cache_avail_nonp', 0)

    if sympy.cache and M_local_cache_avail_nonp and _use_a:
        local_cache_key_A = (m,n,1)
        local_cache_key_B = (n,k,1)
        if local_cache_key_A not in sympy.cache:
            sympy.cache[local_cache_key_A] = [(r,c) for r in range(m) for c in range(n)]
        if local_cache_key_B not in sympy.cache:
            sympy.cache[local_cache_key_B] = [(r,c) for r in range(n) for c in range(k)]
        A_local, B_local = sympy.cache[local_cache_key_A], sympy.cache[local_cache_key_B]
    else:
        A_local, B_local = None, None

    def fast_matmul(A_local, B_local, _):
        result = [[sum(A_local[i][j] * B_local[j][l] for j in range(n))
                   for l in range(k)]
                  for i in range(m)]
        return result

    if A_local and B_local:
        return fast_matmul(A_local, B_local, k)
    else:
        return fast_matmul(A, B, k)
def matmul(A, B):
    n, m = len(A), len(B)
    assert all(len(row)==m for row in A) and all(len(row)==len(B[0]) for row in B)
    # maybe compute B transposed
    BT = list(zip(*B))
    result = [[sum(a*b for a,b in zip(row, col)) for col in BT] for row in A]
    return result
def matmul(A, B):
    m = len(A)
    n = len(A[0])
    p = len(B[0])
    # compute result matrix C(m x p)
    result = [[sum(A[i][k]*B[k][j] for k in range(n)) for j in range(p)] for i in range(m)]
    return result
None
None
None
def matmul(A, B):
    n = len(A)
    m = len(B[0])
    p = len(B)
    result = [[0]*m for _ in range(n)]
    for i in range(n):
        Ai=A[i]
        Ri=result[i]
        for k in range(p):
            aik=Ai[k]
            Bk=B[k]
            for j in range(m):
                Ri[j]+=aik*Bk[j]
    return result
def matmul(A, B):
    """
    Multiply two matrices A (n x m) and B (m x p) without using external libraries.
    """
    n=len(A)
    m=len(A[0])
    p=len(B[0])
    # Check compatibility
    if len(B)!=m:
        raise ValueError("Incompatible dimensions")
    # Transpose B for cache-friendly access
    B_T=[list(col) for col in zip(*B)]
    # Prepare result matrix
    result=[[0]*p for _ in range(n)]
    for i in range(n):
        row=A[i]
        # local assignments
        res_row=result[i]
        for j in range(p):
            col=B_T[j]
            s=0
            for a, b in zip(row, col):
                s += a*b
            res_row[j]=s
    return result
def matmul(A, B):
    """Multiply two matrices A (m x n) and B (n x p) using pure Python.

    Args:
        A: list of m lists, each of length n.
        B: list of n lists, each of length p.

    Returns:
        C: list of m lists, each of length p, the product.
    """
    m = len(A); n = len(A[0]); p = len(B[0])
    # Preallocate result
    C = [[0]*p for _ in range(m)]
    # transpose B for better cache (although Python's memory model)
    B_T = [list(col) for col in zip(*B)]
    # iterate over rows of A and rows of B_T
    for i in range(m):
        Ai = A[i]
        Ci = C[i]
        for k in range(p):
            s = 0
            Bk = B_T[k]
            for j in range(n):
                s += Ai[j] * Bk[j]
            Ci[k] = s
    return C
def matmul(A, B):
    # A: m x n, B: n x p
    m, n = len(A), len(A[0])
    n2, p = len(B), len(B[0])
    assert n == n2
    # Precompute columns of B
    Bcols = list(zip(*B))
    return [[sum(a*b for a,b in zip(row,c)) for c in Bcols] for row in A]
None
def matmul(A, B):
    """
    Multiply two matrices using pure Python.

    Parameters
    ----------
    A : List[List[Number]]
        The first matrix (m × n).
    B : List[List[Number]]
        The second matrix (n × p).

    Returns
    -------
    List[List[Number]]
        The product matrix (m × p).

    Raises
    ------
    ValueError
        If the inner dimensions of A and B do not match.
    """
    if not A or not B or not B[0]:
        return []

    m = len(A)
    n = len(A[0])
    if len(B) != n:
        raise ValueError("Inner dimensions must agree: A: {}, B: {}.".format(n, len(B)))

    p = len(B[0])
    # column count of B must be p
    # Use list comprehensions for a compact, Python‑native implementation
    return [[sum(A[i][k] * B[k][j] for k in range(n)) for j in range(p)] for i in range(m)]
def matmul(A, B):
    ...
None
def matmul(A, B):
    return ...
def matmul(A, B):
    return ...
def matmul(A, B): return ...
def matmul(A, B):
    # matrix multiplication using only native Python
    ...
    return ...
def matmul(A, B):
    n_rows = len(A)
    n_cols = len(B[0])
    k = len(B)  # check match
    if any(len(row)!=k for row in A):
        raise ValueError("A dimensions don't match B")
    # transpose B for better cache
    B_T = list(zip(*B))  # returns tuples but we can keep
    result = [[0]*n_cols for _ in range(n_rows)]
    for i in range(n_rows):
        Ai = A[i]
        Ri = result[i]
        for j, Bj in enumerate(B_T):
            s = 0
            for a,b in zip(Ai, Bj):
                s += a*b
            Ri[j] = s
    return result
def matmul(A, B):
    n = len(A)
    m = len(B[0])
    p = len(B)
    # assert p == len(A[0])? we can compute
    result = [[0]*m for _ in range(n)]
    for i in range(n):
        ai = A[i]
        for k in range(p):
            aik = ai[k]
            if aik:
                bk = B[k]
                for j in range(m):
                    result[i][j] += aik * bk[j]
    return result
None
None
def matmul(A, B):
    # Ensure both are lists of lists, etc. 
    return [[sum(a*b for a,b in zip(row,col)) for col in zip(*B)] for row in A]
None
None
def matmul(A, B): return ...
None
def matmul(A, B): return ...
None
def matmul(A, B):
    """
    Multiplies two matrices A and B using pure Python lists.
    Expects A to be m x n and B to be n x p.
    Returns the resulting m x p matrix.
    """
    return [[sum(a * b for a, b in zip(rowA, colB)) for colB in zip(*B)] for rowA in A]
def matmul(A, B):
    """
    Multiply two matrices A and B using only native Python constructs.
    
    Parameters
    ----------
    A : list[list[Number]]
        Left‑hand matrix.
    B : list[list[Number]]
        Right‑hand matrix.
    
    Returns
    -------
    list[list[Number]]
        The product matrix A * B.
    """
    # Transpose B to ease column access
    Bt = list(zip(*B))
    # Compute each entry as the dot product of a row of A and a column of B
    return [[sum(a * b for a, b in zip(row, col))
             for col in Bt] for row in A]
None
def matmul(A, B):
    # assume A: m x n, B: n x p
    n = len(A)
    m = len(A[0])
    p = len(B[0])
    # precompute columns of B via zip
    columns_B = list(zip(*B))
    result = [[sum(a*b for a,b in zip(row, col)) for col in columns_B] for row in A]
    return result
None
def matmul(A, B):
    """
    Multiply two matrices A and B using plain Python.
    A      : list of m rows, each a list of p numbers
    B      : list of p rows, each a list of n numbers
    Returns a list of m rows, each a list of n numbers
    """
    m, p = len(A), len(A[0])           # dimensions of A
    assert p == len(B), "Incompatible matrix dimensions"
    n = len(B[0])                      # number of columns of B

    # initialise result matrix with zeros
    C = [[0] * n for _ in range(m)]

    # iterate over the shared dimension first
    for k in range(p):
        B_row = B[k]
        for i in range(m):
            aik = A[i][k]
            if aik:                   # skip multiplies by zero
                C_row = C[i]
                # the inner loop that does the real work
                for j in range(n):
                    C_row[j] += aik * B_row[j]
    return C
def matmul(A, B):
    m = len(A); n = len(A[0]); p = len(B[0])
    # validate B has proper shape
    result = [[0]*p for _ in range(m)]
    for i in range(m):
        rowA = A[i]
        row_res = result[i]
        for k in range(n):
            aik = rowA[k]
            if aik:
                rowBk = B[k]
                for j in range(p):
                    row_res[j] += aik * rowBk[j]
    return result
None
None
def matmul(A, B):
    if not A or not B:
        return []
    n, m = len(A), len(A[0])
    p, q = len(B), len(B[0])
    if m != p:
        raise ValueError("Inner matrix dimensions must agree.")
    # compute C with zeros
    C = [[0]*q for _ in range(n)]
    for i in range(n):
        Ai = A[i]
        Ci = C[i]
        for k in range(m):
            aik = Ai[k]
            Bk = B[k]
            for j in range(q):
                Ci[j] += aik * Bk[j]
    return C
def matmul(A, B):
    """
    Multiply two matrices A (n × p) and B (p × m) using a cache‑friendly
    block‑style algorithm that reduces Python overhead compared to a
    straightforward triple loop.
    """
    n = len(A)            # rows of A
    p = len(B)            # shared dimension
    m = len(B[0])         # columns of B

    # Result matrix initialised with zeros
    C = [[0] * m for _ in range(n)]

    # Optimised loop ordering: i → k → j
    for i in range(n):
        Ai = A[i]
        Ci = C[i]
        for k in range(p):
            aik = Ai[k]
            if aik:                     # skip zero multiplications
                Bk = B[k]
                for j in range(m):
                    Ci[j] += aik * Bk[j]
    return C
def matmul(A, B):
    # ensure convertible shapes
    m, n = len(A), len(A[0])
    p = len(B[0])
    # Precompute transpose of B to improve cache locality
    BT = list(zip(*B))
    result = [[0]*p for _ in range(m)]
    for i in range(m):
        Ai = A[i]
        Ri = result[i]
        for j, Bj in enumerate(BT):
            s = 0
            for k in range(n):
                s += Ai[k] * Bj[k]
            Ri[j] = s
    return result
None
def matmul(A, B): return ...
None
None
None
None
None
None
def matmul(A, B):
    # ensure shape
    if not A: return []
    m, k = len(A), len(A[0])
    k2, n = len(B), len(B[0])
    assert k == k2
    # transpose B
    B_T = list(zip(*B))
    # compute
    return [[sum(a*b for a,b in zip(row, col)) for col in B_T] for row in A]
None
def matmul(A, B):
    # A is an m x n matrix, B is an n x p matrix
    m = len(A)
    n = len(A[0]) if A else 0
    p = len(B[0]) if B else 0

    # In case the dimensions are incompatible, raise an error
    if n != len(B):
        raise ValueError("Incompatible matrix dimensions: A has %d cols but B has %d rows." % (n, len(B)))

    # Prepare the result matrix filled with zeros
    C = [[0] * p for _ in range(m)]

    # Perform the multiplication using a cache-friendly triple loop
    for i in range(m):
        Ai = A[i]
        Ci = C[i]
        for k in range(n):
            aik = Ai[k]
            Bk = B[k]
            for j in range(p):
                Ci[j] += aik * Bk[j]
    return C
None
def matmul(A, B):
    """
    Multiply two matrices A and B using plain Python lists.

    Parameters
    ----------
    A : list[list[float]]
        Left matrix of shape (m, n).
    B : list[list[float]]
        Right matrix of shape (n, p).

    Returns
    -------
    list[list[float]]
        The product matrix of shape (m, p).

    Raises
    ------
    ValueError
        If the inner dimensions do not agree.
    """
    # Validate input
    if not A or not B:
        return []

    m, n_A = len(A), len(A[0])
    n_B, p = len(B), len(B[0])

    if n_A != n_B:
        raise ValueError("Inner matrix dimensions must agree: "
                         f"{n_A} != {n_B}")

    # Pre‑allocate result matrix with zeros
    C = [[0.0 for _ in range(p)] for _ in range(m)]

    # Classic triple‑loop multiplication
    for i in range(m):
        Ai = A[i]
        Ci = C[i]
        for k in range(n_A):
            aik = Ai[k]
            Bk = B[k]
            # Unroll the inner loop over p
            for j in range(p):
                Ci[j] += aik * Bk[j]

    return C
None
None
def matmul(A, B):
    """
    Multiplies two matrices A and B represented as lists of lists using
    plain Python code. Handles generic rectangular matrices.
    """
    # Basic size validation
    if not A or not B or not A[0] or not B[0]:
        raise ValueError("Matrices cannot be empty")
    rows_a = len(A)
    cols_a = len(A[0])
    rows_b = len(B)
    cols_b = len(B[0])

    if cols_a != rows_b:
        raise ValueError("Inner dimensions must match for multiplication")

    # Compute the product using a straightforward triple loop
    result = [
        [sum(A[i][k] * B[k][j] for k in range(cols_a)) for j in range(cols_b)]
        for i in range(rows_a)
    ]

    return result
def matmul(A, B):
    return ...
def matmul(A, B):
    """
    Multiply two matrices using a naïve algorithm implemented in pure Python.
    """
    # Get dimensions
    na = len(A)
    ma = len(A[0]) if A else 0
    nb = len(B)
    mb = len(B[0]) if B else 0
    
    if ma != nb:
        raise ValueError("Inner dimensions must match for multiplication.")
    
    # Create result matrix
    result = [[0 for _ in range(mb)] for _ in range(na)]
    
    # Standard triple‑loop multiplication
    for i in range(na):
        Ai = A[i]
        for j in range(mb):
            sum = 0
            for k in range(ma):
                sum += Ai[k] * B[k][j]
            result[i][j] = sum
    return result
def matmul(A, B):
    """
    Multiply two matrices A and B and return the product.

    Parameters
    ----------
    A : list[list[Number]]
        Left matrix of shape (m, n).
    B : list[list[Number]]
        Right matrix of shape (n, p).

    Returns
    -------
    C : list[list[Number]]
        Matrix product of shape (m, p).

    Raises
    ------
    ValueError
        If the input matrices cannot be multiplied due to incompatible shapes.
    """
    # Validate inputs
    if not A or not B:
        return []

    # Dimensions
    m, n = len(A), len(A[0]) if A else 0
    n_b, p = len(B), len(B[0]) if B else 0

    if n != n_b:
        raise ValueError(f"Incompatible dimensions: A is {m}x{n}, B is {n_b}x{p}")

    # Pre-allocate result matrix with zeros
    C = [[0] * p for _ in range(m)]

    # Main multiplication loop
    for i in range(m):
        a_row = A[i]
        c_row = C[i]
        for k in range(n):
            aik = a_row[k]
            if aik == 0:
                continue  # Skip multiplication by zero
            b_row_k = B[k]
            # Unroll the inner loop for potential speed
            for j in range(p):
                c_row[j] += aik * b_row_k[j]
    return C
def matmul(A, B):
    return ...
def matmul(A, B):
    return ...
def matmul(A, B):
    """Multiply two matrices using only native Python lists."""
    # Basic dimension checks
    if not A or not B or not A[0] or not B[0]:
        raise ValueError("Input matrices must be non-empty")
    if len(A[0]) != len(B):
        raise ValueError("Number of columns in A must equal number of rows in B")

    # Transpose B to get its columns efficiently
    BT = list(zip(*B))  # each is a tuple: column of B

    # Compute the product
    result = [[sum(a * b for a, b in zip(row, col)) for col in BT] for row in A]
    return result
def matmul(A, B):
    m = len(A)
    p = len(A[0])
    if p==0:
        return [[] for _ in range(m)]
    n = len(B[0])
    # transpose B for cache
    Bt = [[B[i][j] for i in range(p)] for j in range(n)]
    res = [[0]*n for _ in range(m)]
    for i in range(m):
        Ai = A[i]
        Ri = res[i]
        for k in range(p):
            aik = Ai[k]
            Bk = Bt  # not right; we need to use Bt[j][k].
def matmul(A, B): return ...
def matmul(A, B):
    ...
None
def matmul(A, B):
    # A: m x n, B: n x p
    m, n = len(A), len(A[0])
    assert n == len(B), "Incompatible matrices"
    p = len(B[0])
    # pre-transpose B to speed up
    B_T = list(map(list, zip(*B)))  # p x n
    return [[sum(a*b for a, b in zip(row, col)) for col in B_T] for row in A]
None
def matmul(A, B): return ...
def matmul(A, B):
    """Return `A * B` for two 2‑D lists `A` and `B` using pure Python."""
    # Ensure dimensions are compatible
    if not A or not B or not B[0]:
        return []

    # Transpose B once to obtain column access in O(1)
    B_t = list(zip(*B))

    # Compute the matrix product using list comprehensions
    return [
        [sum(a * b for a, b in zip(row, col))
         for col in B_t]  # compute dot product of row with each column of B
        for row in A
    ]
def matmul(A, B):
    """
    Compute the matrix product C = A @ B with pure Python.

    Parameters
    ----------
    A : List[List[float]]
        Left matrix of size m × p.
    B : List[List[float]]
        Right matrix of size p × n.

    Returns
    -------
    List[List[float]]
        Resulting matrix of size m × n.
    """
    m, pA = len(A), len(A[0])
    pB, n = len(B), len(B[0])

    if pA != pB:
        raise ValueError("Inner matrix dimensions must agree (got %d×%d and %d×%d) "
                         % (m, pA, pB, n))

    # Allocate result matrix
    C = [[0.0] * n for _ in range(m)]

    # Blocked matrix multiplication for better cache locality.
    # Choose a block size that works well on most systems
    block = 64

    for i0 in range(0, m, block):
        i_max = min(i0 + block, m)
        for k0 in range(0, pA, block):
            k_max = min(k0 + block, pA)
            for j0 in range(0, n, block):
                j_max = min(j0 + block, n)

                for i in range(i0, i_max):
                    ai = A[i]
                    ci = C[i]
                    for k in range(k0, k_max):
                        aik = ai[k]
                        bk = B[k]
                        for j in range(j0, j_max):
                            ci[j] += aik * bk[j]
    return C
None
def matmul(A, B):
    if not A or not B:
        return []
    m = len(A)
    n = len(A[0])
    p = len(B[0])
    # check compatibility: len(B)==n
    # compute result matrix with nested loops
    res = [[0]*p for _ in range(m)]
    for i in range(m):
         Ai = A[i]
         for j in range(p):
             s = 0
             for k in range(n):
                  s += Ai[k]*B[k][j]
             res[i][j] = s
    return res
def matmul(A, B):
    # check dimensions
    ...
    # compute product
def matmul(A, B):
    # Extract dimensions
    if not A or not B or not A[0] or not B[0]:
        return []  # or raise ValueError...
    # Determine dimensions
    p = len(A[0])  # columns of A, rows of B
    if p != len(B):
        raise ValueError("Incompatible matrix sizes.")
    # Transpose B for efficient column access
    BT = list(zip(*B))
    # Perform multiplication
    result = []
    for row in A:
        new_row = []
        for col in BT:
            # Dot product
            sum_val = 0
            for a, b in zip(row, col):
                sum_val += a * b
            new_row.append(sum_val)
        result.append(new_row)
    return result
def matmul(A, B):
    # optimize: transpose B
    BT = list(zip(*B))  # tuple of tuples (row of B^T)
    return [[sum(a*b for a, b in zip(rowA, colB)) for colB in BT] for rowA in A]
def matmul(A, B):
    n = len(A)
    m = len(B[0])
    p = len(B)
    return [[sum(A[i][k]*B[k][j] for k in range(p)) for j in range(m)] for i in range(n)]
def matmul(A, B):
    """
    Multiply two matrices A and B using pure Python.

    Parameters
    ----------
    A : list[list[float | int | complex]]
        The first matrix of dimensions (m, p).
    B : list[list[float | int | complex]]
        The second matrix of dimensions (p, n).

    Returns
    -------
    list[list[float | int | complex]]
        The product matrix of dimensions (m, n).

    Raises
    ------
    ValueError
        If the inner dimensions do not match.
    """
    # Basic sanity checks for empty/corrupt inputs
    if not A or not B:
        raise ValueError("Input matrices cannot be empty")

    p = len(A[0])
    for row in A:
        if len(row) != p:
            raise ValueError("All rows in A must have the same length")
    q = len(B)
    for row in B:
        if len(row) != len(B[0]):
            raise ValueError("All rows in B must have the same length")

    # Ensure the inner dimensions match
    if p != q:
        raise ValueError("Inner matrix dimensions must agree: len(A[0]) != len(B)")

    # Transpose B to avoid repeated look‑ups in the inner loop
    B_T = list(zip(*B))  # Each element is a tuple representing a column

    # Perform matrix multiplication
    return [[sum(a * b for a, b in zip(row, col)) for col in B_T] for row in A]
def matmul(A, B):
    """Return the matrix product of A and B using native Python."""
    # Transpose B once to avoid repeated column lookups.
    B_T = list(zip(*B))
    # Compute each element of the product using a generator expression.
    return [[sum(a * b for a, b in zip(row, col)) for col in B_T] for row in A]
None
None
None
None
def matmul(A, B):
    if not A or not B: 
        return []
    nrowA = len(A)
    ncolA = len(A[0])
    nrowB = len(B)
    ncolB = len(B[0])
    if ncolA != nrowB:
        raise ValueError("Incompatible dimensions for matrix multiplication")
    # Precompute columns of B
    colsB = list(zip(*B))
    return [
        [sum(a*b for a, b in zip(row, col)) for col in colsB]
        for row in A
    ]
TrainOutput(global_step=100, training_loss=3.209321222243488e-05, metrics={'train_runtime': 5865.3354, 'train_samples_per_second': 0.034, 'train_steps_per_second': 0.017, 'total_flos': 0.0, 'train_loss': 3.209321222243488e-05})

Inference

Now let's try the model we just trained!

[ ]
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-10-17

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Create a new fast matrix multiplication function using only native Python code.
You are given a list of list of numbers.
Output your new function in backticks using the format below:
```python
def matmul(A, B):
    return ...
```<|end|><|start|>assistant<|channel|>analysis<|message|>We need to implement fast matmul in pure Python. Could use Strassen or naive? Since only native Python, maybe implement Strassen. Output function code.<|end|><|start|>assistant<|channel|>final<|message|>```python
def matmul(A, B):
    """Fast matrix multiplication using Strassen's algorithm (naive fallback for small sizes)."""
    # Helper to pad matrices to the next power of two
    def _next_pow2(n):
        return 1 << (n - 1).bit_length()

    def _pad(mat, new_size):
        padded = [[0] * new_size for _ in range(new_size)]
        for i, row in enumerate(mat):
            padded[i][:len(row)] = row
        return padded

    def _unpad(mat, rows, cols):
        return [row[:cols] for row in mat[:rows]]

    # Base case: small matrices use simple multiplication
    def _simple_mul(X, Y):
        n, m, p = len(X), len(Y[0]), len(Y)
        return [[sum(X[i][k] * Y[k][j] for k in range(p)) for j in range(m)] for i in range(n)]

    # Strassen recursion
    def _strassen(X, Y):
        n = len(X)
        if n <= 64:  # threshold for switching to naive multiplication
            return _simple_mul(X, Y)

        k = n // 2

        # Divide matrices into quadrants
        A11 = [row[:k] for row in X[:k]]
        A12 = [row[k:] for row in X[:k]]
        A21 = [row[:k] for row in X[k:]]
        A22 = [row[k:] for row in X[k:]]

        B11 = [row[:k] for row in Y[:k]]
        B12 = [row[k:] for row in Y[:k]]
        B21 = [row[:k] for row in Y[k:]]
        B22 = [row[k:] for row in Y[k:]]

        # Compute the 7 products, recursively
        M1 = _strassen(_add(A11, A22), _add(B11, B22))
        M2 = _strassen(_add(A21, A22), B11)
        M3 = _strassen(A11, _sub(B12, B22))
        M4 = _strassen(A22, _sub(B21, B11))
        M5 = _strassen(_add(A11, A12), B22)
        M6 = _strassen(_sub(A21, A11), _add(B11, B12))
        M7 = _strassen(_sub(A12, A22), _add(B21, B22))

        # Combine intermediate products into final quadrants
        C11 = _add(_sub(_add(M1, M4), M5), M7)
        C12 = _add(M3, M5)
        C21 = _add(M2, M4)
        C22 = _add(_sub(_add(M1, M3), M2), M6)

        # Merge quadrants
        C = []
        for i in range(k):
            C.append(C11[i] + C12[i])
        for i in range(k):
            C.append(C21[i] + C22[i])
        return C

    # Basic matrix operations
    def _add(X, Y):
        return [[X[i][j] + Y[i][j] for j in range(len(X[0]))] for i in range(len(X))]

    def _sub(X, Y):
        return [[X[i][j] - Y[i][j] for j in range(len(X[0]))] for i in range(len(X))]

    # Prepare matrices
    n, m = len(A), len(A[0])
    p = len(B[0])
    if len(B) != m:
        raise ValueError("Inner dimensions must agree")
    size = _next_pow2(max(n, m, p))
    A_pad = _pad(A, size)
    B_pad = _pad(B, size)
    # Compute
    C_pad = _strassen(A_pad, B_pad)
    # Trim to original size
    return _unpad(C_pad, n, p)
```<|return|>

Saving to float16 or MXFP4 for VLLM

We also support saving to float16 directly. Select merged_16bit for float16 or mxfp4 for MXFP4 (OpenAI's GPT-OSS native precision). We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See our docs for more deployment options.

[ ]

And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other resources:

  1. Looking to use Unsloth locally? Read our Installation Guide for details on installing Unsloth on Windows, Docker, AMD, Intel GPUs.
  2. Learn how to do Reinforcement Learning with our RL Guide and notebooks.
  3. Read our guides and notebooks for Text-to-speech (TTS) and vision model support.
  4. Explore our LLM Tutorials Directory to find dedicated guides for each model.
  5. Need help with Inference? Read our Inference & Deployment page for details on using vLLM, llama.cpp, Ollama etc.

Join Discord if you need help + ⭐️ Star us on Github ⭐️

This notebook and all Unsloth notebooks are licensed LGPL-3.0