Build a RAG System with Tavily Crawl
In this tutorial, you'll learn how to turn any website into a searchable knowledge base. We'll use Tavily's crawl API to extract information from websites, convert the content into a searchable vector index with OpenAI embeddings and in-memory Chroma vector store, and create a RAG question-answering system. This tutorial is self-contained and requires no additional setup.
Getting Started
Follow these steps to set up:
-
Sign up for Tavily at app.tavily.com to get your API key.
-
Sign up for OpenAI to get your API key. Feel free to substitute any other LLM provider.
-
Copy your API keys from your Tavily and OpenAI account dashboard.
-
Paste your API keys into the cell below and execute the cell.
Install dependencies in the cell below.
Setting Up Your Tavily API Client
The code below will instantiate the Tavily client with your API key.
Step 1: Define the Target Website
Now let's utilize Tavily to crawl a webpage and retrieve all of its links. Web crawling involves automatically traversing websites by following hyperlinks to uncover various web pages and URLs. Tavily's crawl feature is AI-native, offering rapid responses via parallelized, graph-based processing.
For this example, we're using www.tavily.com.
When crawling web pages, we can specify the output format as either "text" (clean text) or "markdown". For this tutorial, we'll use "text" format since it's better suited for creating embeddings later.
Now let's examine all the nested URLs.
Let's run a second crawl with natural language instructions to specifically target developer documentation pages. This demonstrates how we can focus the crawler on specific types of content.
Now, the results will only include developer docs from the Tavily webpage.
Step 2: Preview the Raw Content
Let's examine a sample of the raw content from one of the crawled pages to understand the webpage data we're working with:
Step 3: Process Content into Documents
We'll convert the crawled content into LangChain Document objects, which will allow us to:
- Maintain important metadata (source URL, page name)
- Prepare the text for chunking
- Make the content ready for vectorization
Let's run this on the generic crawl results and the developer-specific crawl results.
Step 4: Split Documents into Chunks
We'll split the documents into smaller, more manageable chunks using the RecursiveCharacterTextSplitter and preview the result.
Step 5: Create Vector Embeddings
Now we'll create vector embeddings for our document chunks using OpenAI's embedding model and store them in a Chroma vector database. This allows us to perform semantic search on our document collection.
Step 6: Build the Question-Answering System
Finally, we'll create a retrieval-based question-answering system using gpt-4.1-mini. We use the "stuff" chain type, which combines all relevant retrieved documents into a single context for the model.
Step 7: Test the System
Let's test our RAG system by asking a question about Tavily's documentation.
First, let's ask a generic question about Tavily.
For the developer-specific index, let's ask a detailed question.
Conclusion
We've successfully built a complete RAG system that can:
- Crawl web content from a specific domain
- Process and structure the content
- Create vector embeddings for semantic search
- Answer questions based on the crawled information
This approach can be extended to create knowledge bases from any website, documentation, or content repository, making it a powerful tool for building domain-specific assistants and search systems.
For a more advanced implementation of this concept:
- Try out our hosted demo application
- View the complete source code
For more information, read the crawl API reference and best practices guide.