Tool Usage

mistral-cookbookocrmistral

Document Comprehension with Any Model via Tool Usage and OCR


Optical Character Recognition (OCR) transforms text-based documents and images into pure text outputs and markdown. By leveraging this feature, you can enable any Large Language Model (LLM) to reliably understand documents efficiently and cost-effectively.

In this guide, we will demonstrate how to use OCR with our models to discuss any text-based document, whether it's a PDF, photo, or screenshot, via URLs.


Method

We will leverage Tool Usage to open any URL on demand by the user.

Other Methods

We also have a built-in feature for document understanding leveraging our OCR model, to learn more about it visit our Document Understanding docs

Tool Usage

To achieve this, we will first send our question, which may or may not include URLs pointing to documents that we want to perform OCR on. Mistral Small will then decide, using the open_urls tool ( extracting the URLs directly ), whether it needs to perform OCR on any URL or if it can directly answer the question.

image.png

Setup

First, let's install mistralai

[ ]
Collecting mistralai
  Downloading mistralai-1.5.0-py3-none-any.whl.metadata (29 kB)
Collecting eval-type-backport>=0.2.0 (from mistralai)
  Downloading eval_type_backport-0.2.2-py3-none-any.whl.metadata (2.2 kB)
Requirement already satisfied: httpx>=0.27.0 in /usr/local/lib/python3.11/dist-packages (from mistralai) (0.28.1)
Collecting jsonpath-python>=1.0.6 (from mistralai)
  Downloading jsonpath_python-1.0.6-py3-none-any.whl.metadata (12 kB)
Requirement already satisfied: pydantic>=2.9.0 in /usr/local/lib/python3.11/dist-packages (from mistralai) (2.10.6)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from mistralai) (2.8.2)
Collecting typing-inspect>=0.9.0 (from mistralai)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Requirement already satisfied: anyio in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->mistralai) (3.7.1)
Requirement already satisfied: certifi in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->mistralai) (2025.1.31)
Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->mistralai) (1.0.7)
Requirement already satisfied: idna in /usr/local/lib/python3.11/dist-packages (from httpx>=0.27.0->mistralai) (3.10)
Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.11/dist-packages (from httpcore==1.*->httpx>=0.27.0->mistralai) (0.14.0)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.9.0->mistralai) (0.7.0)
Requirement already satisfied: pydantic-core==2.27.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.9.0->mistralai) (2.27.2)
Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.11/dist-packages (from pydantic>=2.9.0->mistralai) (4.12.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->mistralai) (1.17.0)
Collecting mypy-extensions>=0.3.0 (from typing-inspect>=0.9.0->mistralai)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl.metadata (1.1 kB)
Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.11/dist-packages (from anyio->httpx>=0.27.0->mistralai) (1.3.1)
Downloading mistralai-1.5.0-py3-none-any.whl (271 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 271.6/271.6 kB 14.4 MB/s eta 0:00:00
Downloading eval_type_backport-0.2.2-py3-none-any.whl (5.8 kB)
Downloading jsonpath_python-1.0.6-py3-none-any.whl (7.6 kB)
Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Installing collected packages: mypy-extensions, jsonpath-python, eval-type-backport, typing-inspect, mistralai
Successfully installed eval-type-backport-0.2.2 jsonpath-python-1.0.6 mistralai-1.5.0 mypy-extensions-1.0.0 typing-inspect-0.9.0

We can now set up our client. You can create an API key on our Plateforme.

[ ]

System and Tool

For the model to be aware of its purpose and what it can do, it's important to provide a clear system prompt with instructions and explanations of any tools it may have access to.

Let's define a system prompt and the tools it will have access to, in this case, open_urls.

Note: open_urls can easily be customized with other resources and models ( for summarization, for example ) and many other features. In this demo, we are going for a simpler approach.

[ ]
[ ]
[ ]

We also have to define the Tool Schema that will be provided to our API and model.

By following the documentation, we can create something like this:

[ ]
[ ]

Test

Everything is ready; we can quickly create a while loop to chat with our model directly in the console.

The model will use open_urls each time URLs are mentioned. If they are PDFs or photos, it will perform OCR and provide the raw text contents to the model, which will then use them to answer the user.

Example Prompts ( PDF & Image )

[ ]
Assistant > The research paper titled "Pixtral 12B" introduces a 12-billion-parameter multimodal language model designed to understand both natural images and documents. The model is trained on a large-scale dataset of interleaved image and text documents, enabling it to perform multi-turn, multi-image conversations. Pixtral 12B is built on a transformer architecture and includes a new vision encoder, PixtralViT, which allows it to process images at their native resolution and aspect ratio. This flexibility is achieved through a novel RoPE-2D implementation, which supports variable image sizes and aspect ratios without the need for interpolation.

The model's performance is evaluated on various multimodal benchmarks, where it outperforms other open-source models of similar sizes, such as Qwen-2-VL 7B and Llama-3.2 11B. Pixtral 12B also matches or exceeds the performance of much larger models like Llama-3.2 90B and closed-source models like Claude-3 Haiku and Gemini-1.5 Flash 8B. The paper introduces a new benchmark, MM-MT-Bench, designed to evaluate multimodal models in practical scenarios, and provides detailed analysis and code for standardized evaluation protocols.

The architecture of Pixtral 12B consists of a multimodal decoder and a vision encoder. The vision encoder, PixtralViT, is trained from scratch and includes several key features such as break tokens, gating in the feedforward layer, sequence packing, and RoPE-2D for relative position encoding. The model is evaluated under various prompts and metrics, demonstrating its robustness and flexibility in handling different types of multimodal tasks.

The paper also discusses the importance of standardized evaluation protocols and the impact of prompt design on model performance. It highlights that Pixtral 12B performs well under both 'Explicit' and 'Naive' prompts, with only minor regressions on specific benchmarks. The model's performance is further analyzed under flexible parsing constraints, showing that it benefits very little from relaxed metrics and continues to lead even when flexible parsing is accounted for.

In summary, Pixtral 12B is a state-of-the-art multimodal model that excels in both text-only and multimodal tasks. Its novel architecture, flexibility in processing images, and strong performance across various benchmarks make it a versatile tool for complex multimodal applications. The model is released under the Apache 2.0 license, making it accessible for further research and development.
Assistant > The text written on the image is:

"This is a lot of 12 point text to test the ocr code and see if it works on all types of file format. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox."
Assistant > You're welcome! If you have any more questions or need further assistance, feel free to ask.