Langchain Basic RAG
RAG Example Using NVIDIA API Catalog and LangChain
This notebook introduces how to use LangChain to interact with NVIDIA hosted NIM microservices like chat, embedding, and reranking models to build a simple retrieval-augmented generation (RAG) application.
Terminology
RAG
- RAG is a technique for augmenting LLM knowledge with additional data.
- LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on.
- If you want to build AI applications that can reason about private data or data introduced after a model's cutoff date, you need to augment the knowledge of the model with the specific information it needs.
- The process of bringing the appropriate information and inserting it into the model prompt is known as retrieval augmented generation (RAG).
The preceding summary of RAG originates in the LangChain v0.2 tutorial Build a RAG App tutorial in the LangChain v0.2 documentation.
NIM
- NIM microservices are containerized microservices that simplify the deployment of generative AI models like LLMs and are optimized to run on NVIDIA GPUs.
- NIM microservices support models across domains like chat, embedding, reranking, and more from both the community and NVIDIA.
NVIDIA API Catalog
- NVIDIA API Catalog is a hosted platform for accessing a wide range of microservices online.
- You can test models on the catalog and then export them with an NVIDIA AI Enterprise license for on-premises or cloud deployment
langchain-nvidia-ai-endpoints
- The
langchain-nvidia-ai-endpointsPython package contains LangChain integrations for building applications that communicate with NVIDIA NIM microservices.
Installation and Requirements
Create a Python environment (preferably with Conda) using Python version 3.10.14. To install Jupyter Lab, refer to the installation page.
Getting Started!
To get started you need an NVIDIA_API_KEY to use the NVIDIA API Catalog:
- Create a free account with NVIDIA.
- Click on your model of choice.
- Under Input select the Python tab, and click Get API Key and then click Generate Key.
- Copy and save the generated key as NVIDIA_API_KEY. From there, you should have access to the endpoints.
Enter your NVIDIA API key: ········
RAG Example using LLM & Embedding
1) Initialize the LLM
The ChatNVIDIA class is part of LangChain's integration (langchain_nvidia_ai_endpoints) with NVIDIA NIM microservices. It allows access to NVIDIA NIM for chat applications, connecting to hosted or locally-deployed microservices.
Here we will use mixtral-8x7b-instruct-v0.1
2) Intiatlize the embedding
NVIDIAEmbeddings is a client to NVIDIA embeddings models that provides access to a NVIDIA NIM for embedding. It can connect to a hosted NIM or a local NIM using a base URL
We selected NV-Embed-QA as the embedding
3) Obtain some toy text dataset
Here we are loading a toy data from a text documents and in real-time data can be loaded from various sources. Read here for loading data from different sources
(400, , 230, , 'Sweden, formally the Kingdom of Sweden, is a Nordic country located on the Scandinavian Peninsula in Northern Europe. It borders Norway to the west and north, Finland to the east, and is connected to Denmark in the southwest by a bridge–tunnel across the Öresund. At 447,425 square kilometres (172,752 sq mi), Sweden is the largest Nordic country, the third-largest country in the European Union, and the fifth-largest country in Europe. The capital and largest city is Stockholm. Sweden has a total population of 10.5 million, and a low population density of 25.5 inhabitants per square kilometre (66/sq mi), with around 87% of Swedes residing in urban areas, which cover 1.5% of the entire land area, in the central and southern half of the country.\n')
4) Process the documents into vectorstore and save it to disk
Real world documents can be very long, this makes it hard to fit in the context window of many models. Even for those models that could fit the full post in their context window, models can struggle to find information in very long inputs.
To handle this we’ll split the Document into chunks for embedding and vector storage. More on text splitting here
To enable runtime search, we index text chunks by embedding each document split and storing these embeddings in a vector database. Later to search, we embed the query and perform a similarity search to find the stored splits with embeddings most similar to the query.
5) Read the previously processed & saved vectore store back
6) Wrap the restored vectorsore into a retriever and ask our question
" Sweden is the 55th-largest country in the world, the fifth-largest country in Europe, and the largest country in Northern Europe, with a total area of 449,964 km2 (173,732 sq mi). In terms of elevation, the lowest point in Sweden is in the bay of Lake Hammarsjön, near Kristianstad, at -2.41 m (-7.91 ft) below sea level, while the highest point is Kebnekaise, which is 2,111 m (6,926 ft) above sea level.\n\nSweden has a Nordic social welfare system that provides universal health care and tertiary education for its citizens. The country has a high standard of living and ranks very highly in various international metrics, including quality of life, health, education, protection of civil liberties, economic competitiveness, income equality, gender equality, and prosperity. Sweden's GDP per capita is the world's 14th highest.\n\nHistorically, Sweden has been both a kingdom and an empire. Currently, it is a constitutional monarchy and a parliamentary democracy, with a popularly elected parliament and a monarch who serves a ceremonial role. Sweden is a member of the European Union but has opted to remain outside the Eurozone."
RAG Example with LLM, Embedding & Reranking
" The documents provided do not include information about Gustav's grandson ascending the throne. Gustav had several grandchildren, and the documents do not specify which one you are referring to. Moreover, the documents do not provide enough information about the timeline of Gustav's grandson's ascension to the throne. Therefore, it is not possible to answer this question without additional context."
Enhancing accuracy for single data sources
This example demonstrates how a re-ranking model can be used to combine retrieval results and improve accuracy during retrieval of documents.
Typically, reranking is a critical piece of high-accuracy, efficient retrieval pipelines. Generally, there are two important use cases:
- Combining results from multiple data sources
- Enhancing accuracy for single data sources
Here, we focus on demonstrating only the second use case. If you want to know more, check here
" Gustav's grandson, Sigismund, ascended the throne in 1592."
Note:
- In this notebook, we have used NVIDIA NIM microservices from the NVIDIA API Catalog.
- The above APIs, ChatNVIDIA, NVIDIAEmbedding, and NVIDIARerank, also support self-hosted NIM microservices.
- Change the
base_urlto your deployed NIM URL. - Example:
llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama3-8b-instruct") - NIM can be hosted locally using Docker, following the NVIDIA NIM for LLMs documentation.