Tavily AI Crawl To Rag

Crawl To Rag

tavily-cookbookcookbookscrawl

alph-notebooks/tavily-cookbook / crawl_to_rag.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Build a RAG System with Tavily Crawl

In this tutorial, you'll learn how to turn any website into a searchable knowledge base. We'll use Tavily's crawl API to extract information from websites, convert the content into a searchable vector index with OpenAI embeddings and in-memory Chroma vector store, and create a RAG question-answering system. This tutorial is self-contained and requires no additional setup.

Getting Started

Follow these steps to set up:

Sign up for Tavily at app.tavily.com to get your API key.
Sign up for OpenAI to get your API key. Feel free to substitute any other LLM provider.
Copy your API keys from your Tavily and OpenAI account dashboard.
Paste your API keys into the cell below and execute the cell.

[ ]

Install dependencies in the cell below.

[ ]

Setting Up Your Tavily API Client

The code below will instantiate the Tavily client with your API key.

[ ]

Step 1: Define the Target Website

Now let's utilize Tavily to crawl a webpage and retrieve all of its links. Web crawling involves automatically traversing websites by following hyperlinks to uncover various web pages and URLs. Tavily's crawl feature is AI-native, offering rapid responses via parallelized, graph-based processing.

For this example, we're using www.tavily.com.

[ ]

When crawling web pages, we can specify the output format as either "text" (clean text) or "markdown". For this tutorial, we'll use "text" format since it's better suited for creating embeddings later.

[ ]

Now let's examine all the nested URLs.

[ ]

Let's run a second crawl with natural language instructions to specifically target developer documentation pages. This demonstrates how we can focus the crawler on specific types of content.

[ ]

Now, the results will only include developer docs from the Tavily webpage.

[ ]

Step 2: Preview the Raw Content

Let's examine a sample of the raw content from one of the crawled pages to understand the webpage data we're working with:

[ ]

Step 3: Process Content into Documents

We'll convert the crawled content into LangChain Document objects, which will allow us to:

Maintain important metadata (source URL, page name)
Prepare the text for chunking
Make the content ready for vectorization

[ ]

Let's run this on the generic crawl results and the developer-specific crawl results.

[ ]

Step 4: Split Documents into Chunks

We'll split the documents into smaller, more manageable chunks using the RecursiveCharacterTextSplitter and preview the result.

[ ]

Step 5: Create Vector Embeddings

Now we'll create vector embeddings for our document chunks using OpenAI's embedding model and store them in a Chroma vector database. This allows us to perform semantic search on our document collection.

[ ]

Step 6: Build the Question-Answering System

Finally, we'll create a retrieval-based question-answering system using gpt-4.1-mini. We use the "stuff" chain type, which combines all relevant retrieved documents into a single context for the model.

[ ]

Step 7: Test the System

Let's test our RAG system by asking a question about Tavily's documentation.

First, let's ask a generic question about Tavily.

[ ]

For the developer-specific index, let's ask a detailed question.

[ ]

Conclusion

We've successfully built a complete RAG system that can:

Crawl web content from a specific domain
Process and structure the content
Create vector embeddings for semantic search
Answer questions based on the crawled information

This approach can be extended to create knowledge bases from any website, documentation, or content repository, making it a powerful tool for building domain-specific assistants and search systems.

For a more advanced implementation of this concept:

Try out our hosted demo application
View the complete source code

For more information, read the crawl API reference and best practices guide.