Notebooks
M
Milvus
Bge M3 Embedding

Bge M3 Embedding

image-searchvector-databasesemantic-searchIntegrationmilvusembeddingsunstructured-dataquestion-answeringLLMmilvus-bootcampdeep-learningimage-recognitionimage-classificationaudio-searchPythonbootcampragNLP

Using BGE M3-Embedding Model with Milvus

As Deep Neural Networks continue to advance rapidly, it's increasingly common to employ them for information representation and retrieval. Referred to as embedding models, they can encode information into dense or sparse vector representations within a multi-dimensional space.

On January 30, 2024, a new member called BGE-M3 was released as part of the BGE model series. The M3 represents its capabilities in supporting over 100 languages, accommodating input lengths of up to 8192, and incorporating multiple functions such as dense, lexical, and multi-vec/colbert retrieval into a unified system. BGE-M3 holds the distinction of being the first embedding model to offer support for all three retrieval methods, resulting in achieving state-of-the-art performance on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmark tests.

Milvus, world's first open-source vector database, plays a vital role in semantic search with efficient storage and retrieval for vector embeddings. Its scalability and advanced functionalities, such as metadata filtering, further contribute to its significance in this field.

This tutorial shows how to use BGE M3 embedding model with Milvus for semantic similarity search.

Preparations

We will demonstrate with BAAI/bge-m3 model and Milvus in Standalone mode. The text for searching comes from the M3 paper. For each sentence in the paper, we use BAAI/bge-m3 model to convert the text string into 1024 dimension vector embedding, and store each embedding in Milvus.

We then search a query by converting the query text into a vector embedding, and perform vector Approximate Nearest Neighbor search to find the text strings with cloest semantic.

To run this demo, be sure you have already started up a Milvus instance and installed python packages pymilvus (Milvus client library) and FlagEmbedding (library for BGE models).

[ ]

Import packages.

[1]

Set up the options for Milvus, specify model name as BAAI/bge-m3.

[2]

Let’s try the BGE M3 Embedding service with a text string, print the result vector embedding and get the dimensions of the model.

[3]
Fetching 22 files:   0%|          | 0/22 [00:00<?, ?it/s]
loading existing colbert_linear and sparse_linear---------
----------using 4*GPUs----------
[-0.03415   -0.04712   -0.0009007 -0.04697    0.04025   -0.07654
 -0.001877   0.007637  -0.01312   -0.007435  -0.0712     0.0526
  0.02162   -0.04178    0.000628  -0.05307    0.00796   -0.0431
  0.01224   -0.006145 ] ...

Dimensions of `BAAI/bge-m3` embedding model is: 1024

Load vectors to Milvus

We need creat a collection in Milvus and build index so that we can efficiently search vectors. For more information on how to use Milvus, check out the documentation.

[4]
Status(code=0, message=)

Here we have prepared a data set of text strings from the M3 paper, named m3_paper.txt. It stores each sentence as a line, and we convert each line in the document into a dense vector embedding with BAAI/bge-m3 and then insert these embeddings into Milvus collection.

[5]

Query

Here we will build a semantic_search function, which is used to retrieve the topK most semantically similar document from a Milvus collection.

[6]

Here we ask a question about the embedding models.

[13]
distance = 0.46
Particularly, M3-Embedding is proficient in multilinguality, which is able to support more than 100 world languages.

distance = 0.53
1) We present M3-Embedding, which is the first model which supports multi-linguality, multifunctionality, and multi-granularity.

distance = 0.63
In this paper, we present M3-Embedding, which achieves notable versatility in supporting multilingual retrieval, handling input of diverse granularities, and unifying different retrieval functionalities.

The smaller the distance, the closer the vector is, that is, semantically more similar. We can see that the top 1 result returned "M3-Embedding...more than 100 world languages..." can directly answer the question.

Let's try another question.

[23]
distance = 0.61
The three data sources complement to each other, which are applied to different stages of the training process.

distance = 0.69
Our dataset consists of three sources: 1) the extraction of unsupervised data from massive multi-lingual corpora, 2) the integration of closely related supervised data, 3) the synthesization of scarce training data.

distance = 0.74
In this   Table 1: Specification of training data.

In this example, the top 2 results have enough information to answer the question. By selecting the top K results, semantic search with embedding model and vector retrieval is able to identify the meaning of queries and return the most semantically similar documents. Plugging this solution with Large Language Model (a pattern referred to as Retrieval Augmented Generation), a more human-readable answer can be crafted.

We can delete this collection to save resources.

[24]

In this notebook, we showed how to generate dense vectors with BGE M3 embedding model and use Milvus to perform semantic search. In the upcoming releases, Milvus will support hybrid search with dense and sparse vectors, which BGE M3 model can produce at the same time.

Milvus has integrated with all major model providers, including OpenAI, HuggingFace and many more. You can learn about Milvus at https://milvus.io/docs.