What problem does this tool solve?

This tool overcomes an LLM's knowledge-cutoff limitation by fetching live web search results and feeding them to the model via Retrieval-Augmented Generation (RAG), allowing it to answer questions about recent events or information beyond its training cutoff.

What are the three main steps in this workflow?

The tool has three steps: (1) set up a search engine using Google's Custom Search API with query expansion and domain filtering, (2) build a search dictionary by scraping result URLs, filtering out scripts/ads, and summarizing content with an LLM, and (3) generate a RAG response by passing the structured search data and original query to the LLM.

What credentials and packages do I need to run this?

You need Python 3.12+, the `requests`, `beautifulsoup4`, and `openai` packages, a Google Custom Search API key and Custom Search Engine ID (CSE ID), and an `OPENAI_API_KEY` environment variable.

When should I use this tool versus a hosted search solution?

Use this when you need current information synthesized into an LLM response with citations and want direct control over the search, scrape, and summarize pipeline rather than relying on a hosted search tool.

Prompt Chain

Augment LLM Knowledge with Real-Time Web Search

Name: Web Search With Google Api Bring Your Own Browser Tool
Availability: OnlineOnly
Author: OpenAI Cookbook

OpenAI cookbook building a Bring Your Own Browser (BYOB) tool with Google Custom Search API and RAG to beat knowledge cutoffs.

Copy chain

Works with google

OpenAI Cookbook

Maintainer?

Spark score

out of 100

Updated 20 days ago

Version 1.0.0

Models

gpt 4o gemini 2 0

Add to Favorites

Why it matters

Overcome LLM knowledge cutoffs by integrating real-time web search. This asset retrieves current information, processes it, and feeds it to an LLM for up-to-date, relevant responses.

Outcomes

What it gets done

Configure Google Custom Search API for targeted web queries.

Scrape and summarize content from relevant web pages.

Implement Retrieval-Augmented Generation (RAG) for LLM responses.

Structure search results into a usable dictionary for LLM input.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/oai-websearchwithgoogleapibringyourownbrowsertool | bash

Steps

Steps in the chain

Step 1: Set Up a Search Engine to Provide Web Search Results

Step 2: Build a Search Dictionary with Titles, URLs, and Summaries

Step 3: Pass Information to Model to Generate RAG Response

Overview

Web Search With Google Api Bring Your Own Browser Tool

OpenAI cookbook building a Bring Your Own Browser (BYOB) tool that pairs Google's Custom Search API with Retrieval-Augmented Generation so an LLM can answer with current, post-knowledge-cutoff information, illustrated via recent OpenAI product launches. Use when an application needs LLM answers grounded in current, post-cutoff web information and you want to build the search-plus-RAG pipeline yourself.

What it does

This OpenAI cookbook builds a Bring Your Own Browser (BYOB) tool in Python that overcomes an LLM's knowledge-cutoff limitation by giving it live web-search access, illustrated with GPT-4o not knowing about the o1-preview model launched in September 2024. It uses Google's Custom Search API (rather than any other public search API) to retrieve relevant web pages, then passes the retrieved information to the LLM through Retrieval-Augmented Generation (RAG) to produce an answer grounded in current information. The tool has three steps. Step 1 sets up the search engine: using a Google API key and Custom Search Engine (CSE) ID from the Google Developers Console, a search function takes a search term plus a result count and supports a site_filter parameter (used here to restrict results to the openai.com domain); before calling it, the guide applies query expansion - broadening a natural-language user question into a more specific, succinct search phrase, since search engines match specific terms better than full natural-language prompts. Step 2 builds a structured search dictionary: for each result URL, a retrieve_content function scrapes the page and filters out non-relevant material like scripts and ads, and a summarize_content function uses an LLM to condense that scraped text into a concise, query-focused summary, with title, link, and summary organized into a dictionary or DataFrame. Step 3 passes this structured (JSON) search data along with the original user query to the LLM, which generates a final RAG response that includes information beyond the model's original knowledge cutoff, with appropriate citations back to the source pages.

When to use - and when NOT to

Use this pattern when an application needs answers grounded in current, post-cutoff information - recent product launches, news, or any time-sensitive query - and you want to build the web-search-plus-summarization pipeline yourself rather than rely on a hosted browsing tool. It is not intended for production scraping at scale (the cookbook explicitly restricts its example searches to the openai.com domain and is educational only) and does not cover legal/compliance handling of web scraping and search API terms of service, which the guide explicitly tells developers to check themselves.

Inputs and outputs

Input is a natural-language user query needing current information. Output is a RAG-generated answer that incorporates post-cutoff facts, backed by a structured dictionary of retrieved web pages (title, URL, summary) used as grounding context and citation sources.

Integrations

Built on Google's Custom Search API (API key plus Custom Search Engine ID) for web search, the OpenAI API for both content summarization and final RAG response generation, and Python 3.12+ with the requests, beautifulsoup4, and openai packages, reading OPENAI_API_KEY from the environment.

Who it's for

Developers building applications that need LLM answers grounded in current web information beyond the model's training cutoff, who want a from-scratch, educational reference for wiring a search API and RAG pipeline together rather than a managed browsing tool.

Source README

Building a Bring Your Own Browser (BYOB) Tool for Web Browsing and Summarization

Disclaimer: This cookbook is for educational purposes only. Ensure that you comply with all applicable laws and service terms when using web search and scraping technologies. This cookbook will restrict the search to openai.com domain to retrieve the public information to illustrate the concepts.

Large Language Models (LLMs) such as GPT-4o have a knowledge cutoff date, which means they lack information about events that occurred after that point. In scenarios where the most recent data is essential, it's necessary to provide LLMs with access to current web information to ensure accurate and relevant responses.

In this guide, we will build a Bring Your Own Browser (BYOB) tool using Python to overcome this limitation. Our goal is to create a system that provides up-to-date answers in your application, including the most recent developments such as the latest product launches by OpenAI. By integrating web search capabilities with an LLM, we'll enable the model to generate responses based on the latest information available online.

While you can use any publicly available search APIs, we'll utilize Google's Custom Search API to perform web searches. The retrieved information from the search results will be processed and passed to the LLM to generate the final response through Retrieval-Augmented Generation (RAG).

Bring Your Own Browser (BYOB) tools allow users to perform web browsing tasks programmatically. In this notebook, we'll create a BYOB tool that:

#1. Set Up a Search Engine: Use a public search API, such as Google's Custom Search API, to perform web searches and obtain a list of relevant search results.

#2. Build a Search Dictionary: Collect the title, URL, and a summary of each web page from the search results to create a structured dictionary of information.

#3. Generate a RAG Response: Implement Retrieval-Augmented Generation (RAG) by passing the gathered information to the LLM, which then generates a final response to the user's query.

Use Case

In this cookbook, we'll take the example of a user who wants to list recent product launches by OpenAI in chronological order. Because the current GPT-4o model has a knowledge cutoff date, it is not expected that the model will know about recent product launches such as the o1-preview model launched in September 2024.

Given the knowledge cutoff, as expected the model does not know about the recent product launches by OpenAI.

Setting up a BYOB tool

To provide the model with recent events information, we'll follow these steps:

Step 1: Set Up a Search Engine to Provide Web Search Results

Step 2: Build a Search Dictionary with Titles, URLs, and Summaries of Web Pages

Step 3: Pass the information to the model to generate a RAG Response to the User Query

Before we begin, ensure you have the following: Python 3.12 or later installed on your machine. You will also need a Google Custom Search API key and Custom Search Engine ID (CSE ID). Necessary Python packages installed: requests, beautifulsoup4, openai. And ensure the OPENAI_API_KEY is set up as an environment variable.

Step 1: Set Up a Search Engine to Provide Web Search Results

You can use any publicly available web search APIs to perform this task. We will configure a custom search engine using Google's Custom Search API. This engine will fetch a list of relevant web pages based on the user's query, focusing on obtaining the most recent and pertinent results.

a. Configure Search API key and Function: Acquire a Google API key and a Custom Search Engine ID (CSE ID) from the Google Developers Console. You can navigate to this Programmable Search Engine Link to set up an API key as well as Custom Search Engine ID (CSE ID).

The search function below sets up the search based on search term, the API and CSE ID keys, as well as number of search results to return. We'll introduce a parameter site_filter to restrict the output to only openai.com

b. Identify the search terms for search engine: Before we can retrieve specific results from a 3rd Party API, we may need to use Query Expansion to identify specific terms our browser search API should retrieve. Query expansion is a process where we broaden the original user query by adding related terms, synonyms, or variations. This technique is essential because search engines, like Google's Custom Search API, are often better at matching a range of related terms rather than just the natural language prompt used by a user.

For example, searching with only the raw query "List the latest OpenAI product launches in chronological order from latest to oldest in the past 2 years" may return fewer and less relevant results than a more specific and direct search on a succinct phrase such as "Latest OpenAI product launches". In the code below, we will use the user's original search_query to produce a more specific search term to use with the Google API to retrieve the results.

c. Invoke the search function: Now that we have the search term, we will invoke the search function to retrieve the results from Google search API. The results only have the link of the web page and a snippet at this point. In the next step, we will retrieve more information from the webpage and summarize it in a dictionary to pass to the model.

Step 2: Build a Search Dictionary with Titles, URLs, and Summaries of Web Pages

After obtaining the search results, we'll extract and organize the relevant information, so it can be passed to the LLM for final output.

a. Scrape Web Page Content: For each URL in the search results, retrieve the web page to extract textual content while filtering out non-relevant data like scripts and advertisements as demonstrated in function retrieve_content.

b. Summarize Content: Use an LLM to generate concise summaries of the scraped content, focusing on information pertinent to the user's query. Model can be provided the original search text, so it can focus on summarizing the content for the search intent as outlined in function summarize_content.

c. Create a Structured Dictionary: Organize the data into a dictionary or a DataFrame containing the title, link, and summary for each web page. This structure can be passed on to the LLM to generate the summary with the appropriate citations.

We retrieved the most recent results. (Note these will vary depending on when you execute this script.)

Step 3: Pass the information to the model to generate a RAG Response to the User Query

With the search data organized in a JSON data structure, we will pass this information to the LLM with the original user query to generate the final response. Now, the LLM response includes information beyond its original knowledge cutoff, providing current insights.

Conclusion

Large Language Models (LLMs) have a knowledge cutoff and may not be aware of recent events. To provide them with the latest information, you can build a Bring Your Own Browser (BYOB) tool using Python. This tool retrieves current web data and feeds it to the LLM, enabling up-to-date responses.

The process involves three main steps:

#1 Set Up a Search Engine: Use a public search API, like Google's Custom Search API, to perform web searches and obtain a list of relevant search results.

#2 Build a Search Dictionary: Collect the title, URL, and a summary of each web page from the search results to create a structured dictionary of information.

#3. Generate a RAG Response: Implement Retrieval-Augmented Generation (RAG) by passing the gathered information to the LLM, which then generates a final response to the user's query.

By following these steps, you enhance the LLMs ability to provide up-to-date answers in your application that include the most recent developments, such as the latest product launches by OpenAI.

FAQ

Common questions

Discussion

Augment LLM Knowledge with Real-Time Web Search

What it gets done

Add it to your toolbox

Steps in the chain

Web Search With Google Api Bring Your Own Browser Tool

What it does

When to use - and when NOT to

Inputs and outputs

Integrations

Who it's for

Building a Bring Your Own Browser (BYOB) Tool for Web Browsing and Summarization

Use Case

Setting up a BYOB tool

Step 1: Set Up a Search Engine to Provide Web Search Results

Step 2: Build a Search Dictionary with Titles, URLs, and Summaries of Web Pages

Step 3: Pass the information to the model to generate a RAG Response to the User Query

Step 1: Set Up a Search Engine to Provide Web Search Results

Step 2: Build a Search Dictionary with Titles, URLs, and Summaries of Web Pages

Step 3: Pass the information to the model to generate a RAG Response to the User Query

Conclusion

Common questions

Questions & comments · 0