Hybrid Search Legal
Copyright 2025 Google LLC.
Using Gemini API with Qdrant vector search for hybrid retrieval in legal AI
|
This notebook was contributed by Jenny.Jenny's LinkedInHave a cool Gemini example? Feel free to share it too! |
| ⚠️ |
This notebook requires paid tier rate limits to run properly.
|
Overview

In the legal domain, accuracy and factual correctness are immensely critical.
A Legal AI startup that collaborated with Qdrant has outlined an approach to securing both in Legal AI applications (for example, Retrieval Augmented Generation (RAG)-based or agentic):
“Turn everything into a retrieval problem where you're retrieving ground truth. If you frame it that way, you don't have to worry about hallucinations, as everything given to the user is grounded in some part of a valid document.”
Truly, many Legal AI businesses require high-quality retrieval in their applications. To get there, you need:
- The knowledge of the right tools and techniques that increase search relevance;
- A well-suited embedding model;
- Being ready to experiment!:)
This notebook
In this notebook, you’ll learn how to combine gemini-embedding-001 with the tools provided by the Qdrant vector search engine to build a legal QA retrieval pipeline.
You'll learn how to:
- Set up a hybrid search (dense + keyword) in Qdrant;
- Use Matryoshka Representations of Gemini embeddings to trade off quality vs. cost.
Setup
Install SDK
google-genaiforgemini-embedding-001embeddings;qdrant-client[fastembed]- the Qdrant's python client;- HuggingFace
datasets- to load open sourced legal Q&A datasets
Set up your API keys:
-
GOOGLE_API_KEY, required for usinggemini-embedding-001embeddings
(look up how to generate it here) -
QDRANT_API_KEYandQDRANT_URLfrom a free-forever Qdrant Cloud cluster
(you'll be guided on how to get both in the Qdrant Cloud UI)
To run the following cell, your API keys must be stored in a Colab Secret tab.
Step 1: Download the Dataset
You'll use one of the Hugging Face datasets from Isaacus, a legal artificial intelligence research company.
A common use case in legal AI is a Retrieval-Augmented Generation (RAG) chatbot. To evaluate retrieval performance for such applications, you need a Question-Answer (QA) dataset.
Choosing a Dataset
-
Open Australian Legal QA looks interesting. However, all its LLM-generated questions mention the exact name of the legal case, which also appears in the answer. The dataset maps each question to one answer (1:1), making it trivial to build a perfect retriever => not even close to real-life scenarios:)
-
Instead, let's consider LegalQAEval. It looks more like the kind of questions a user might ask a RAG-based legal chatbot. For example:
- "How are pharmacists regulated in most jurisdictions?"
- "what is ncts"
LegalQAEval
This dataset contains ~2400 QA pairs and includes:
id: a unique string identifier;question: a natural language question;text: a chunk of text that may contain the answer;answers: a list of answers (and their positions within the text), ornullif thetextdoes not have the answer.
Load the legal QA corpus; you'll use all available splits.
Text chunks deduplication
Since the dataset can contain text chunks with multiple questions related to them, initially deduplicate text fields to not store identical information several times.
Step 2: Define the use case configuration
In a typical legal chatbot scenario, users ask a question, and an LLM generates an answer based on a relevant text chunk.
To imitate it, you'll need to store in Qdrant numerical representations (embeddings) of text chunks.
During retrieval, a question will be converted into a numerical representation in the same embedding space. Then, (approximately) the nearest text chunk will be found in the vector index.
The Gemini embedding model supports RAG-style Q&A retrieval (task type
QUESTION_ANSWERING).
Now, to fully define our storage configuration, let's consider several factors relevant to a common RAG use case in the legal AI domain.
Cost versus accuracy: matryoshka representations
Gemini gemini-embedding-001 embeddings are 3072-dimensional.
In a RAG setup with ~1 million chunks, storing such embeddings in RAM (for fast retrieval) would require about 12 GB.
The Gemini embedding model supports an approach to balance accuracy & cost of retrieval. It is trained using Matryoshka Representation Learning (MRL), meaning that the most important information about the encoded text is stored in the first dimensions of the embedding.
So, you can, for example:
- Use only the first 768 dimensions of the Gemini embedding for faster retrieval;
- And then rerank the retrieved results using the full 3072-dimensional embeddings for higher precision.
Accuracy from the best of both worlds: hybrid search
In legal use cases, it is often beneficial to combine the strengths of:
- Keyword-based search (lexical) for more direct control over matches;
- Embedding-based search (semantic) for handling questions phrased in a conversational way.
In Qdrant, both approaches can be combined in hybrid & multi-stage queries.
For the keyword-based part, Qdrant supports multiple options, from traditional BM25 to sparse neural retrievers like SPLADE. Among the options, there's our custom improvement of BM25 called miniCOIL, which you will use in this notebook.
In Qdrant, keyword-based retrieval is achieved using sparse vectors.
Collection configuration
Configure a Qdrant collection for the legal QA retrieval pipeline.
Step 3: embed texts & index data to Qdrant
To speed up the process of converting the data, you'll:
- Embed with Gemini all
textchunks in batches using theget_embeddings_batchfunction. - Upload the results to Qdrant in batches.
The Qdrant Python client provides the functionsupload_collectionandupload_points. These handle batching, retries, and parallelization. They take generators as input, so you'll create a generator functionqdrant_points_streamfor this purpose.
Note: Qdrant automatically normalizes uploaded embeddings if the distance function in your collection was set to
COSINE(cosine similarity). This means you don’t need to pre-normalize truncated Gemini Matryoshka embeddings, as it's recommended in the Gemini documentation.
Now you'll embed the data and upload the embeddings.
Try experimenting with different batch sizes when generating embeddings and uploading them to Qdrant.
The fastest setup usually depends on your network speed & RAM/CPU/GPU, and keep in mind that embedding inference is not a very fast process.
The representations used in Qdrant for the keyword-based retrieval part of hybrid search are produced by Qdrant.
In Colab, Qdrant will download the required models the first time you use them (in our case, Qdrant/minicoil-v1), as they’re needed for convertingtextchunks to sparse representations.
Step 4: experiment & evaluate
What’s important for every retrieval task is experimenting with different instruments & running evaluations based on a sensible metric.
Metric
In RAG, the goal is usually to get the correct result within the top-N retrieved results, using a very small N, since that’s what the LLM will use to generate a grounded answer, and you'd want to save context window size/reduce token costs.
You'll use the metric hit@1, meaning the top-1 ranked text chunk is actually the answer to the question.
Eval set
For experiments, you should only use questions where the answers field is not null, since this guarantees that this text chunk contains the answer to the question.
Inference Gemini embeddings for all the questions, so you can experiment freely without spending extra time or money.
And randomly select a test subset.
Experiment
There are many ways to improve search results. For example, reranking alone can be done with high-dimensional embeddings like Gemini, multivectors like ColBERT or cross-encoders.
For simplicity, you'll focus on three simple retrieval approaches, three experiments that are a good starting point for high-precision-demanding domains like legal:
Experiment 1: Vanilla Retrieval
Use truncated Gemini embeddings for vanilla retrieval.
This gives you a simple reference point to compare improvements against.
Experiment 2: Reranking
Rerank the retrieved subset with full-sized Gemini embeddings.
Larger embeddings capture finer semantic details that the retriever may miss.
Experiment 3: Hybrid Search
Combine semantic (captures meaning) and keyword-based (ensures exact matches) retrieval in Hybrid Search.
For this toy dataset, keyword matching may not add much, as all questions are very "conversational" style with not-so-many-keywords, but in real-life legal AI retrieval, it makes a difference
The setup is the following:
- Run two searches with the same query.
- Merge the results into a single list with a fusion algorithm. Here you'll use Reciprocal Rank Fusion (RRF), a simple well-known zero-shot method of fusion.
Compare the results of all three experiments on our evaluation set using the chosen metric.
Next steps
In this notebook, you set up a retrieval pipeline behind a typical legal RAG chatbot with Qdrant Vector Search Engine and Gemini Embeddings.
You tried several approaches to retrieval, making use of Gemini's capability to generate Matryoshka representations & Qdrant's tooling for retrieval with reranking and hybrid search.
Of course, legal applications require much more than a plain zero-shot pipeline. Retrieval quality always depends on the dataset and use case, so there’s no silver bullet besides experimenting & iterating.
Where to go from zero-shot:
- Analyze queries-misses;
- Tune vector index parameters (for example,
effor search at scale in Qdrant); - Experiment with different fusion strategies & parameters in hybrid search;
- Try query expansion (or filters extraction).
- ...
Use this notebook as a baseline to build on and experiment to find what works best for you!
