Bnb

mistral-cookbookmethodsquantizationconcept-deep-dive

Bits and Bytes

Bits-and-bytes is a very fast and straightforward approach to quantization, quantizing while loading. However, speed and quality are not optimal, useful for quick quantization and loading of models quantizing in the fly.

Quantizing with transformers

Lets do a short demo and quantize Mistral 7B!

First, we install transformers and all dependencies required.

[1]
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 137.5/137.5 MB 15.8 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
  Building wheel for transformers (pyproject.toml) ... done
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
  Building wheel for peft (pyproject.toml) ... done
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
  Building wheel for accelerate (pyproject.toml) ... done

Once we're done, we can download the model we want to quantize. First, let's log in with a read access token so we have access to the models.

Note: You need to first accept the terms in the repo.

[2]
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful

Now everything is ready, so we can load the model and quantize it! Here, we will quantize the model to 8-bit!

[3]

Contrary to other methods, BnB is pretty fast and efficient. We do not necessarily need to quantize it beforehand, we can do it on the fly!

[4]
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]
`low_cpu_mem_usage` was None, now set to True since model is quantized.
model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]
Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]
model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]
model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]
model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]
tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Once ready you can use the model as follows:

[5]
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
<s>[INST] Tell me a joke.[/INST]
<s>[INST] Tell me a joke.[/INST] Sure, here's a classic one for you:

Why don't scientists trust atoms?

Because they make up everything!

I hope that made you smile! If you'd like, I can tell you another one. Just let me know!</s>