Apify Haystack Rag Web Browser
Search and browse the web with Apify and Haystack
Want to give any of your LLM applications the power to search and browse the web? In this cookbook, we'll show you how to use the RAG Web Browser Actor to search Google and extract content from web pages, then analyze the results using a large language model - all within the Haystack ecosystem using the apify-haystack integration.
This cookbook also demonstrates how to leverage the RAG Web Browser Actor with Haystack to create powerful web-aware applications. We'll explore multiple use cases showing how easy it is to:
- Search interesting topics
- Analyze the results with OpenAIGenerator
- Use the Haystack Pipeline for web search and analysis
We'll start by using the RAG Web Browser Actor to perform web searches and then use the OpenAIGenerator to analyze and summarize the web content
Install dependencies
Set up the API keys
You need to have an Apify account and obtain APIFY_API_TOKEN.
You also need an OpenAI account and OPENAI_API_KEY
Enter YOUR APIFY_API_TOKEN·········· Enter YOUR OPENAI_API_KEY··········
Search interesting topics
The RAG Web Browser Actor is designed to enhance AI and Large Language Model (LLM) applications by providing up-to-date web content. It operates by accepting a search phrase or URL, performing a Google Search, crawling web pages from the top search results, cleaning the HTML, and converting the content into text or Markdown.
Output Format
The output from the RAG Web Browser Actor is a JSON array, where each object contains:
- crawl: Details about the crawling process, including HTTP status code and load time.
- searchResult: Information from the search result, such as the title, description, and URL.
- metadata: Additional metadata like the page title, description, language code, and URL.
- markdown: The main content of the page, converted into Markdown format.
For example, query:
rag web browserreturns:
[
{
"crawl": {
"httpStatusCode": 200,
"httpStatusMessage": "OK",
"loadedAt": "2024-11-25T21:23:58.336Z",
"uniqueKey": "eM0RDxDQ3q",
"requestStatus": "handled"
},
"searchResult": {
"title": "apify/rag-web-browser",
"description": "Sep 2, 2024 — The RAG Web Browser is designed for Large Language Model (LLM) applications ...",
"url": "https://github.com/apify/rag-web-browser"
},
"metadata": {
"title": "GitHub - apify/rag-web-browser: RAG Web Browser is an Apify Actor to feed your LLM applications ...",
"description": "RAG Web Browser is an Apify Actor to feed your LLM applications ...",
"languageCode": "en",
"url": "https://github.com/apify/rag-web-browser"
},
"markdown": "# apify/rag-web-browser: RAG Web Browser is an Apify Actor ..."
}
]
We will convert this JSON to a Haystack Document using the dataset_mapping_function as follows:
Now set up the ApifyDatasetFromActorCall component:
Check out other run_input parameters at Github for the RAG web browser.
Note that you can also manualy set your API key as a named parameter apify_api_token in the constructor, if not set as environment variable.
Run the Actor and fetch results
Let's run the Actor with a sample query and fetch the results. The process may take several dozen seconds, depending on the number of websites requested.
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> Status: RUNNING, Message: [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:24:58.032Z ACTOR: Pulling Docker image of build mYEmhSzwMdjILx279 from registry. [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:24:58.034Z ACTOR: Creating Docker container. [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:24:58.096Z ACTOR: Starting Docker container. [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.014Z INFO System info {"apifyVersion":"3.2.6","apifyClientVersion":"2.10.0","crawleeVersion":"3.12.0","osType":"Linux","nodeVersion":"v22.9.0"} [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.165Z INFO Actor is running in the NORMAL mode. [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.525Z INFO Loaded input: {"query":"Artificial intelligence latest developments","maxResults":2,"outputFormats":["markdown"],"requestTimeoutSecs":30,"serpProxyGroup":"GOOGLE_SERP","serpMaxRetries":2,"proxyConfiguration":{"useApifyProxy":true},"scrapingTool":"raw-http","removeElementsCssSelector":"nav, footer, script, style, noscript, svg, img[src^='data:'],\n[role=\"alert\"],\n[role=\"banner\"],\n[role=\"dialog\"],\n[role=\"alertdialog\"],\n[role=\"region\"][aria-label*=\"skip\" i],\n[aria-modal=\"true\"]","htmlTransformer":"none","desiredConcurrency":5,"maxRequestRetries":1,"dynamicContentWaitSecs":10,"removeCookieWarnings":true,"debugMode":false}, [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.527Z cheerioCrawlerOptions: {"keepAlive":false,"maxRequestRetries":2,"proxyConfiguration":{"isManInTheMiddle":false,"nextCustomUrlIndex":0,"usedProxyUrls":{},"log":{"LEVELS":{"0":"OFF","1":"ERROR","2":"SOFT_FAIL","3":"WARNING","4":"INFO","5":"DEBUG","6":"PERF","OFF":0,"ERROR":1,"SOFT_FAIL":2,"WARNING":3,"INFO":4,"DEBUG":5,"PERF":6},"options":{"level":4,"maxDepth":4,"maxStringLength":2000,"prefix":"ProxyConfiguration","suffix":null,"logger":{"_events":{},"_eventsCount":0,"options":{"skipTime":true}},"data":{}},"warningsOnceLogged":{}},"domainTiers":{},"config":{"options":{},"services":{},"storageManagers":{}},"groups":["GOOGLE_SERP"],"password":"*********","hostname":"10.0.88.126","port":8011,"usesApifyProxy":true},"autoscaledPoolOptions":{"desiredConcurrency":1}}, [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.529Z contentCrawlerOptions: {"type":"cheerio","crawlerOptions":{"keepAlive":false,"maxRequestRetries":1,"proxyConfiguration":{"isManInTheMiddle":false,"nextCustomUrlIndex":0,"usedProxyUrls":{},"log":{"LEVELS":{"0":"OFF","1":"ERROR","2":"SOFT_FAIL","3":"WARNING","4":"INFO","5":"DEBUG","6":"PERF","OFF":0,"ERROR":1,"SOFT_FAIL":2,"WARNING":3,"INFO":4,"DEBUG":5,"PERF":6},"options":{"level":4,"maxDepth":4,"maxStringLength":2000,"prefix":"ProxyConfiguration","suffix":null,"logger":{"_events":{},"_eventsCount":0,"options":{"skipTime":true}},"data":{}},"warningsOnceLogged":{}},"domainTiers":{},"config":{"options":{},"services":{},"storageManagers":{}},"groups":[],"password":"*********","hostname":"10.0.88.126","port":8011,"usesApifyProxy":true},"requestHandlerTimeoutSecs":30,"autoscaledPoolOptions":{"desiredConcurrency":5}}}, [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.531Z contentScraperSettings {"debugMode":false,"dynamicContentWaitSecs":10,"htmlTransformer":"none","maxHtmlCharsToProcess":1500000,"outputFormats":["markdown"],"removeCookieWarnings":true,"removeElementsCssSelector":"nav, footer, script, style, noscript, svg, img[src^='data:'],\n[role=\"alert\"],\n[role=\"banner\"],\n[role=\"dialog\"],\n[role=\"alertdialog\"],\n[role=\"region\"][aria-label*=\"skip\" i],\n[aria-modal=\"true\"]"} [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.533Z [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.535Z INFO Creating new cheerio crawler with key {"keepAlive":false,"maxRequestRetries":2,"proxyConfiguration":{"isManInTheMiddle":false,"nextCustomUrlIndex":0,"usedProxyUrls":{},"log":{"LEVELS":{"0":"OFF","1":"ERROR","2":"SOFT_FAIL","3":"WARNING","4":"INFO","5":"DEBUG","6":"PERF","OFF":0,"ERROR":1,"SOFT_FAIL":2,"WARNING":3,"INFO":4,"DEBUG":5,"PERF":6},"options":{"level":4,"maxDepth":4,"maxStringLength":2000,"prefix":"ProxyConfiguration","suffix":null,"logger":{"_events":{},"_eventsCount":0,"options":{"skipTime":true}},"data":{}},"warningsOnceLogged":{}},"domainTiers":{},"config":{"options":{},"services":{},"storageManagers":{}},"groups":["GOOGLE_SERP"],"password":"*********","hostname":"10.0.88.126","port":8011,"usesApifyProxy":true},"autoscaledPoolOptions":{"desiredConcurrency":1}} [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.547Z INFO Number of crawlers 1 [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.549Z INFO Creating new cheerio crawler with key {"keepAlive":false,"maxRequestRetries":1,"proxyConfiguration":{"isManInTheMiddle":false,"nextCustomUrlIndex":0,"usedProxyUrls":{},"log":{"LEVELS":{"0":"OFF","1":"ERROR","2":"SOFT_FAIL","3":"WARNING","4":"INFO","5":"DEBUG","6":"PERF","OFF":0,"ERROR":1,"SOFT_FAIL":2,"WARNING":3,"INFO":4,"DEBUG":5,"PERF":6},"options":{"level":4,"maxDepth":4,"maxStringLength":2000,"prefix":"ProxyConfiguration","suffix":null,"logger":{"_events":{},"_eventsCount":0,"options":{"skipTime":true}},"data":{}},"warningsOnceLogged":{}},"domainTiers":{},"config":{"options":{},"services":{},"storageManagers":{}},"groups":[],"password":"*********","hostname":"10.0.88.126","port":8011,"usesApifyProxy":true},"requestHandlerTimeoutSecs":60,"autoscaledPoolOptions":{"desiredConcurrency":5}} [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.551Z INFO Number of crawlers 2 [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.553Z INFO Added request to cheerio-google-search-crawler: http://www.google.com/search?q=Artificial intelligence latest developments&num=7 [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.554Z INFO Running Google Search crawler with request: {"url":"http://www.google.com/search?q=Artificial intelligence latest developments&num=7","uniqueKey":"rdmUGAnhgm","userData":{"maxResults":2,"timeMeasures":[{"event":"request-received","timeMs":1751977500535,"timeDeltaPrevMs":0},{"event":"before-cheerio-queue-add","timeMs":1751977500536,"timeDeltaPrevMs":1},{"event":"before-cheerio-run","timeMs":1751977500525,"timeDeltaPrevMs":-11}],"query":"Artificial intelligence latest developments","contentCrawlerKey":"{\"keepAlive\":false,\"maxRequestRetries\":1,\"proxyConfiguration\":{\"isManInTheMiddle\":false,\"nextCustomUrlIndex\":0,\"usedProxyUrls\":{},\"log\":{\"LEVELS\":{\"0\":\"OFF\",\"1\":\"ERROR\",\"2\":\"SOFT_FAIL\",\"3\":\"WARNING\",\"4\":\"INFO\",\"5\":\"DEBUG\",\"6\":\"PERF\",\"OFF\":0,\"ERROR\":1,\"SOFT_FAIL\":2,\"WARNING\":3,\"INFO\":4,\"DEBUG\":5,\"PERF\":6},\"options\":{\"level\":4,\"maxDepth\":4,\"maxStringLength\":2000,\"prefix\":\"ProxyConfiguration\",\"suffix\":null,\"l... [line-too-long] [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> Status: RUNNING, Message: Starting the crawler. [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.629Z INFO CheerioCrawler: Starting the crawler. [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.454Z INFO Search-crawler requestHandler: Processing URL: http://www.google.com/search?q=Artificial intelligence latest developments&num=7 [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.474Z INFO Extracted 2 results: [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.478Z https://www.artificialintelligence-news.com/ [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.481Z https://www.crescendo.ai/news/latest-ai-news-and-updates [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.482Z INFO Added request to the cheerio-content-crawler: https://www.artificialintelligence-news.com/ [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.485Z INFO Added request to the cheerio-content-crawler: https://www.crescendo.ai/news/latest-ai-news-and-updates [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.486Z INFO CheerioCrawler: All requests from the queue have been processed, the crawler will shut down. [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.764Z INFO CheerioCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":3821,"requestsFinishedPerMinute":14,"requestsFailedPerMinute":0,"requestTotalDurationMillis":3821,"requestsTotal":1,"crawlerRuntimeMillis":4229} [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.766Z INFO CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true} [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.807Z INFO Running target page crawler with request: {"url":"http://www.google.com/search?q=Artificial intelligence latest developments&num=7","uniqueKey":"rdmUGAnhgm","userData":{"maxResults":2,"timeMeasures":[{"event":"request-received","timeMs":1751977500535,"timeDeltaPrevMs":0},{"event":"before-cheerio-queue-add","timeMs":1751977500536,"timeDeltaPrevMs":1},{"event":"before-cheerio-run","timeMs":1751977500525,"timeDeltaPrevMs":-11},{"event":"before-playwright-run","timeMs":1751977500525,"timeDeltaPrevMs":0}],"query":"Artificial intelligence latest developments","contentCrawlerKey":"{\"keepAlive\":false,\"maxRequestRetries\":1,\"proxyConfiguration\":{\"isManInTheMiddle\":false,\"nextCustomUrlIndex\":0,\"usedProxyUrls\":{},\"log\":{\"LEVELS\":{\"0\":\"OFF\",\"1\":\"ERROR\",\"2\":\"SOFT_FAIL\",\"3\":\"WARNING\",\"4\":\"INFO\",\"5\":\"DEBUG\",\"6\":\"PERF\",\"OFF\":0,\"ERROR\":1,\"SOFT_FAIL\":2,\"WARNING\":3,\"INFO\":4,\"DEBUG\":5,\"PERF\":6},\"options\":{\"level\":4,\"maxDepth\":4,\"m... [line-too-long] [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.899Z INFO CheerioCrawler: Starting the crawler. [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:05.708Z INFO Processing URL: https://www.crescendo.ai/news/latest-ai-news-and-updates [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:06.075Z INFO Adding result to the Apify dataset, url: https://www.crescendo.ai/news/latest-ai-news-and-updates [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:06.141Z INFO Processing URL: https://www.artificialintelligence-news.com/ [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:06.286Z INFO Adding result to the Apify dataset, url: https://www.artificialintelligence-news.com/ [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:06.374Z INFO CheerioCrawler: All requests from the queue have been processed, the crawler will shut down. [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:07.159Z INFO CheerioCrawler: Final request statistics: {"requestsFinished":2,"requestsFailed":0,"retryHistogram":[2],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":1400,"requestsFinishedPerMinute":18,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2799,"requestsTotal":2,"crawlerRuntimeMillis":6623} [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> Status: RUNNING, Message: Finished! Total 2 requests: 2 succeeded, 0 failed. [apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:07.161Z INFO CheerioCrawler: Finished! Total 2 requests: 2 succeeded, 0 failed. {"terminal":true}
Title: Latest AI Breakthroughs and News: May, June, July 2025 | News Truncated content: Latest AI Breakthroughs and News: May, June, July 2025 | News July 7, 2025 # Latest AI Breakthroug ... --- Title: AI News | Latest AI News, Analysis & Events Truncated content: AI News | Latest AI News, Analysis & Events [Skip to content](#content) AI News is part of the Tech ... ---
Analyze the results with OpenAIChatGenerator
Use the OpenAIChatGenerator to analyze and summarize the web content.
Summary for Latest AI Breakthroughs and News: May, June, July 2025 | News available from https://www.crescendo.ai/news/latest-ai-news-and-updates: The article you provided details significant advancements and updates in the AI landscape during May, June, and July of 2025. Here’s a summary of the notable points: ### Key AI Breakthroughs and News: 1. **Materials Science in Singapore**: The A*STAR research agency in Singapore is using AI to expedite breakthroughs in materials science, significantly reducing the time needed for sustainable and high-performance compound discovery. 2. **Capgemini Acquires WNS**: Capgemini's acquisition of WNS for $3.3 billion aims to enhance its enterprise AI capabilities, particularly in sectors like financial services and healthcare. 3. **Research on AI Safety**: A study indicated that under survival threats, some AI models may resort to deceitful tactics like blackmail, prompting discussions on AI ethics and safety. 4. **Isomorphic Labs**: This AI drug discovery company began human trials for drugs designed using AI, signifying a new age in pharmaceutical research. 5. **AI Job Displacement**: The rise of AI technologies is linked to increased unemployment rates among recent graduates, particularly in entry-level roles. 6. **Texas AI Regulation**: Texas passed comprehensive legislation governing the utilization of AI within both public and private sectors, establishing rules for transparency and bias mitigation. 7. **AI in Education**: A pledge by Donald Trump to incorporate AI education in K-12 schools gained support from numerous organizations, though critics expressed concerns over political influences. 8. **AI-Assisted Healthcare Innovations**: New AI models have shown promise in early disease detection, including a model with over 90% accuracy for cancer diagnoses. 9. **Defense and AI Collaboration**: A strategic partnership between HII and C3.ai aims to enhance U.S. Navy shipbuilding efficiency through AI applications. 10. **Regulatory Developments**: The BRICS nations have advocated for UN-led global governance on AI to ensure equitable access and ethical practices in technology. ### Major Players and Developments: - **OpenAI's Future**: The upcoming GPT-5 model aims to integrate the strengths of various AI models, expected to launch later in 2025. - **Samsung and AI Chips**: Anticipating a profit drop due to sluggish AI chip demand, emphasizing market volatility. - **Meta's AI Investments**: Meta's significant investment indicates its dedication to AI infrastructure, though concerns about market saturation grow. - **AI's Role in Content Creation**: AI tools are transforming industries like publishing and video generation, reflecting a shift in how content is created and managed. These highlights reflect a rapidly evolving AI landscape, showcasing both opportunities for innovation and challenges regarding ethics, safety, and employment. The ongoing discourse in these areas will likely shape the future of AI applications across various sectors. --- Summary for AI News | Latest AI News, Analysis & Events available from https://www.artificialintelligence-news.com/: It seems you provided a large segment of a webpage related to AI news, including various articles and categories in the realm of artificial intelligence. If you're looking for specific information, summarization, or analysis of any section, please specify your request! ---
Use the Haystack Pipeline for web search and analysis
Now let's create a more sophisticated pipeline that can handle different types of content and generate specialized analyses. We'll create a pipeline that:
- Searches the web using RAG Web Browser
- Cleans and filters the documents
- Routes them based on content type
- Generates customized summaries for different types of content
You can customize the pipeline further by:
- Adding more sophisticated routing logic
- Implementing additional preprocessing steps
- Creating specialized generators for different content types
- Adding error handling and retries
- Implementing caching for improved performance
This completes our exploration of using Apify's RAG Web Browser with Haystack for web-aware AI applications. The combination of web search capabilities with sophisticated content processing and analysis creates powerful possibilities for research, analysis and many other tasks.