Zilliz Bento RAG
In this tutorial, we show how to use an open-source embedding model and LLM on BentoCloud with vector database on Zilliz Cloud to build a RAG (Retrieval Augmented Generation) application. Specifically, you will do the following:
- Generating vector embeddings with open-source models with BentoML or BentoCloud.
- Inserting your data into a vector database for RAG
- Creating your Zilliz Cloud vector database
- Parsing and embedding your data for insertion
- Setting up your RAG application with an open-source LLM such as Llama 3 or Mistral on BentoCloud
- Composing prompt for LLM with context retrieved from the Zilliz Cloud vector database
- Generating a final answer
BentoCloud is an AI Inference Platform for fast-moving AI teams, offering fully-managed infrastructure tailored for model inference. It works in conjunction with BentoML, an open-source model serving framework, to facilitate the easy creation and deployment of high-performance model services. Zilliz Cloud is fully-managed service for open-source vector database - Milvus, with flexible pricing and ease of management. You can sign-up for free on BentoCloud and Zilliz Cloud. Later we will use the API keys from both services to finish the demo.
We can interact with deployed BentoML Services in Deployments, and the corresponding END_POINT and API are located in Playground -> Python. For Zilliz Cloud vector db, we can access URI and API in Cluster Details.
Access BentoML and corresponding END_POINT and API:
Access Zilliz Cloud and corresponding URI and API:

Serving Embeddings with BentoML/BentoCloud
With BentoCloud, it's easy to spin up an embedding service by choosing one from the Explore Models page. For example, sentence_transformers is a popular embedding model and we will use it in our demo. Simply follow the above screenshot to get the API endpoint and token from the UI.
Alternatively, if you prefer running the embedding model locally, you can use the same model served through BentoML using its Sentence Transformers Embeddings repository. By running service.py file, it spins up a local server and assigns an API endpoint on it. Within the API endpoint, e.g. http://localhost:3000, it is loading the all-MiniLM-L6-v2 Sentence Transformer model from Hugging Face and will use it to create embeddings.
To use this endpoint, the idea is the same: just import bentoml and set up an HTTP client using the SyncHTTPClient by specifying the endpoint and optionally the token (only if you turn on Endpoint Authorization on BentoCloud).
Once we connect to the embedding_client, we can create a function that gets a list of embeddings from a list of strings. It's usually more efficient if the model inference is done in batch, thus why we group 25 text strings in each embedding request.
After splitting the string list into batches, we call the embedding_client we created above to encode these sentence into vectors (this process is typically called "embedding"). The BentoML client returns a list of vectors — effectively a list of float number arrays. We ungroup them and put them in a flat list of vectors to prepare for insertion.
If there are not more than 25 strings in the list of texts, we simply call the encode method from the client on the passed-in list of strings.
Inserting Your Data into a Vector Database for Retrieval
With our embedding function prepared, we can insert the vectors together with metadata into Zilliz Cloud for vector search later. The first step in this section is to start a client by connecting to Zilliz Cloud.
For this part, we simply import the MilvusClient module and initialize a Zilliz Cloud client that connect to your vector database in Zilliz Cloud or Milvus. If you have a self-hosted Milvus instance, you can reuse the code by simplying replacing the URI and Token with your Milvus credentials as Zilliz Cloud and Milvus share the exact same API. In the following part, we will use the Zilliz Cloud as an example to do the rest of vector database operations. The code block below also defines two constants: a collection name and the dimension. You can make up whatever collection name you want. The dimension size comes from the size of the embedding model, e.g. the Sentence Transformer model all-MiniLM-L6-v2 produces vectors of 384 dimension. You can get the dimension spec by looking up the description of the embedding models on resources such as HuggingFace.
Creating Your Zilliz Cloud Collection
Creating a collection on Zilliz Cloud involves two steps: first, defining the schema, and second, defining the index. For this section, we need one module: DataType tells us what type of data will be in a field. We also need to use two functions to create schema and add fields. create_schema(): creates a collection schema, add_field: adds a field to the schema of a collection.
We can define the entire schema for the collection here. Or, we can simply define the two necessary pieces: id and embedding. Then, when it comes time to define the schema, we pass a parameter, enabled_dynamic_field, that lets us insert whatever fields we want as long as we also have the id and embedding fields. This lets us treat inserting data into Zilliz Cloud the same way we would treat a NoSQL database like MongoDB.
{'auto_id': True, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'embedding', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 384}}], 'enable_dynamic_field': True} Now that we have created our schema and successfully defined data field, we need to define the index. In terms of search, an "index" defines how we are going to map our data out for retrieval. We use the default choice AUTOINDEX to index our data for this project. We also need to define how we’re going to measure vector distance. In this example, we use the COSINE.
If you have advanced use case which needs to specify a specific index type, there are many index types offered in Zilliz Cloud such as IVF and HNSW.
Once the index is defined, we create the index on the vector field — in this case, embedding.
Next, we simply create the collection with the previously given name, schema and index.
Parsing and Embedding Your Data for Insertion
With Zilliz Cloud ready and the connection made, we can insert data into our vector database. But, we have to get the data ready to insert first. For this example, we have a bunch of txt files available in the data folder of the repo. We split this data into chunks, embed it and store it in Zilliz Cloud.
The text are usually too long for an embedding model to take as input. Let’s start by creating a function that chunks this text up. There are many ways to do chunking, but we’ll do this naively by splitting it at every new line. It returns the newly created list of strings.
Next, we process each of the files we have. We get a list of all the file names and create an empty list to hold the chunked information. Then, we loop through all the files and run the above function on each to get a naive chunking of each file.
Before we store the chunks, we need to clean them. If you look at how an individual file is chunked, you’ll see many empty lines, and we don’t want empty lines. Some lines are just tabs or other special characters. To avoid those, we create an empty list and store only the chunks above a certain length. For simplicity, we can use seven characters.
Once we have a cleaned list of chunks from each document, we can store our data. We create a dictionary that maps each list of chunks to the document’s name — in this case, the city name. Then, we append all of these to the empty list we made above. This is saying that we will have a dictionary that contain a mapping between each city and their corresponding clean text.
With a set of chunked texts for each city ready to go, it’s time to get some embeddings. Zilliz Cloud can take a list of dictionaries to insert into a collection so we can start with another empty list. For each of the dictionaries we created above, we need to get a list of embeddings to match the list of sentences.
We do this by directly calling the get_embeddings function we created in the section using BentoML on each of the list of chunks. Now, we need to match them up. Since the list embeddings and the list of sentences should match by index, we can enumerate through either list to match them up.
We match them up by creating a dictionary representing a single entry into Zilliz Cloud. Each entry includes the embedding, the related sentence and the city. It’s optional to include the city, but let’s include it so we can use it. Notice there’s no need to include an id in this entry. That’s because we chose to auto-increment the id when we made the schema above.
We add each of these entries to the list as we loop through them. At the end, we have a list of dictionaries with each dictionary representing a single-row entry to Zilliz Cloud. We can then simply insert these entries into our Zilliz Cloud collection.
Set up Your LLM for RAG
To build a RAG app, we need to deploy an LLM on BentoCloud. Let’s use the latest Llama3 LLM. Once it is up and running, simply copy the endpoint and token of this model service and set up a client for it.

Giving the LLM Instructions
There are two things the LLM needs to know to do RAG: the question and the context. We can pass both of these at once by creating a function that takes two strings: the question and the context.
Using this function, we use the BentoML client’s chat completion to call an LLM. For this example, we use the llama2-7b-chat model that already implemented by the BentoML.
We give this model two “messages” that indicate how it should behave. First, we give a message to the LLM to tell it that it is answering a question from the user based solely on the given context. Next, we tell it that there will be a user, and we simply pass in the question.
The other parameters are for tuning the model behavior. We can control the maximum number of tokens the model can produce.
The function then returns the output from the client in a string format.
A RAG Example
Now we’re ready. It’s time to ask a question. We can probably do this without creating a function, but making a function makes it nice and repeatable. This function simply intakes a question and then does RAG to answer it.
We start by embedding the question using the same embedding model we used to embed the documents. Next, we execute a search on Zilliz Cloud.
Notice that we pass the question into the get_embeddings function in list format, and then pass the outputted list directly into the data section of our Zilliz Cloud search. This is because of the way that the function signatures are set up; it’s easier to reuse them than rewrite multiple functions.
Inside our search call, we also need to provide a few more parameters. anns_field tells Zilliz Cloud which field to do an approximate nearest neighbor search (ANNS) on.
Next, we also pass a limit parameter which tells us how many results to get back from Zilliz Cloud. For this example, we can just go with five.
The last search parameter defines which fields we want back from our search. For this example, we can just get the sentence, which is the field we used to store our chunk of text.
Once we have our search results back, we need to process them. Zilliz Cloud returns an entity with hits in it, so we grab the “sentence” from all five hits and join them with a period so it forms a list paragraph.
Then, we pass the question that the user asked along with the context into the dorag function we created above and return the response.
{'id': 448985292834734942, 'distance': 0.8144542574882507, 'entity': {'sentence': 'Cambridge is located in eastern Massachusetts, bordered by:'}}
{'id': 448985292834734969, 'distance': 0.7613034248352051, 'entity': {'sentence': 'Areas of Cambridge'}}
{'id': 448985292834735059, 'distance': 0.7230358123779297, 'entity': {'sentence': 'Cambridge College is named for Cambridge and was based in Cambridge until 2017, when it consolidated to a new headquarters in neighboring Boston.'}}
{'id': 448985292834735065, 'distance': 0.6981460452079773, 'entity': {'sentence': 'Cambridgeport School'}}
{'id': 448985292834735141, 'distance': 0.6944277882575989, 'entity': {'sentence': 'Cambridge, Massachusetts at Curlie'}}
Cambridge is located in eastern Massachusetts, bordered by:. Areas of Cambridge. Cambridge College is named for Cambridge and was based in Cambridge until 2017, when it consolidated to a new headquarters in neighboring Boston.. Cambridgeport School. Cambridge, Massachusetts at Curlie
Hello! I'm here to help you with your question. Based on the context provided, Cambridge is located in the state of Massachusetts. Specifically, it is situated in eastern Massachusetts, bordered by: * Boston to the south * Somerville to the west * Arlington to the north * Lexington to the northwest * Belmont to the west Cambridge is home to Cambridge College, which was named for Cambridge and was based in Cambridge until 2017, when it consolidated to a new headquarters in neighboring Boston. Additionally, Cambridgeport School is located in Cambridge. I hope this information helps answer your question! If you have any further queries, please feel free to ask.
For the example question asking which state Cambridge is in, we can print the entire response from BentoML. However, if we take the time to parse through it, it just looks nicer, and it should tell us that Cambridge is located in Massachusetts.
Summary: BentoML and Zilliz Cloud for RAG
This example covered how you can do RAG without OpenAI or a framework. This time, our stack was the BentoML and Zilliz Cloud. We used BentoML’s serving capabilities to serve an embedding model endpoint and LLM endpoints to access an open source model and Zilliz Cloud as our vector database.
There are many ways to structure the order in which we use these different puzzle pieces. For this example, we started this RAG project base on the cloud service from BentoML and Zilliz Cloud. We can actually start them locally as well.
We used a simple method to chunk up our data, which was scraped from Wikipedia. Then, we took those chunks and passed them to our embedding model, hosted on BentoML, to get the vector embeddings to put into Zilliz Cloud. With all of the vector embeddings in Zilliz Cloud, we were fully set to do RAG.
The LLM we chose this time was the Llama-3-8B-Instruct model, one of many open source models available on BentoML. We created two functions to enable RAG: one function that passed in the question and context to the LLM; and another function that embedded the user question, searched Zilliz Cloud and then passed in the search results along with the question to the original RAG function. At the end, we tested our RAG with a simple question as a sanity check.