Copyright 2025 Google LLC.
This notebook introduces context caching with the Gemini API and provides examples of interacting with the Apollo 11 transcript using the Python SDK. For a more comprehensive look, check out the caching guide.
Install dependencies
Configure your API key
To run the following cell, your API key must be stored it in a Colab Secret named GOOGLE_API_KEY. If you don't already have an API key, or you're not sure how to create a Colab Secret, see Authentication
for an example.
Upload a file
A common pattern with the Gemini API is to ask a number of questions of the same document. Context caching is designed to assist with this case, and can be more efficient by avoiding the need to pass the same tokens through the model for each new request.
This example will be based on the transcript from the Apollo 11 mission.
Start by downloading that transcript.
INTRODUCTION This is the transcription of the Technical Air-to-Ground Voice Transmission (GOSS NET 1) from the Apollo 11 mission. Communicators in the text may be identified according to the following list. Spacecraft: CDR Commander Neil A. Armstrong CMP Command module pilot Michael Collins LMP Lunar module pilot Edwin E. ALdrin, Jr.
Now upload the transcript using the File API.
Cache the prompt
Next create a CachedContent object specifying the prompt you want to use, including the file and other fields you wish to cache. In this example the system_instruction has been set, and the document was provided in the prompt.
Note that caches are model specific. You cannot use a cache made with a different model as their tokenization might be slightly different.
CachedContent( , create_time=datetime.datetime(2025, 8, 6, 13, 48, 36, 419118, tzinfo=TzInfo(UTC)), , display_name='', , expire_time=datetime.datetime(2025, 8, 6, 14, 48, 36, 38936, tzinfo=TzInfo(UTC)), , model='models/gemini-2.5-flash', , name='cachedContents/0c5j38gpopx49ok6x7kedvbpy65d1bzkq8i5vldr', , update_time=datetime.datetime(2025, 8, 6, 13, 48, 36, 419118, tzinfo=TzInfo(UTC)), , usage_metadata=CachedContentUsageMetadata( , total_token_count=322698 , ) ,)
Manage the cache expiry
Once you have a CachedContent object, you can update the expiry time to keep it alive while you need it.
CachedContent( , create_time=datetime.datetime(2025, 8, 6, 13, 48, 36, 419118, tzinfo=TzInfo(UTC)), , display_name='', , expire_time=datetime.datetime(2025, 8, 6, 15, 48, 36, 651814, tzinfo=TzInfo(UTC)), , model='models/gemini-2.5-flash', , name='cachedContents/0c5j38gpopx49ok6x7kedvbpy65d1bzkq8i5vldr', , update_time=datetime.datetime(2025, 8, 6, 13, 48, 36, 691886, tzinfo=TzInfo(UTC)), , usage_metadata=CachedContentUsageMetadata( , total_token_count=322698 , ) ,)
Use the cache for generation
As the CachedContent object refers to a specific model and parameters, you must create a GenerativeModel using from_cached_content. Then, generate content as you would with a directly instantiated model object.
You can inspect token usage through usage_metadata. Note that the cached prompt tokens are included in prompt_token_count, but excluded from the total_token_count.
GenerateContentResponseUsageMetadata( , cache_tokens_details=[ , ModalityTokenCount( , modality=<MediaModality.TEXT: 'TEXT'>, , token_count=322698 , ), , ], , cached_content_token_count=322698, , candidates_token_count=282, , prompt_token_count=322707, , prompt_tokens_details=[ , ModalityTokenCount( , modality=<MediaModality.TEXT: 'TEXT'>, , token_count=322707 , ), , ], , thoughts_token_count=4049, , total_token_count=327038 ,)
You can ask new questions of the model, and the cache is reused.
GenerateContentResponseUsageMetadata( , cache_tokens_details=[ , ModalityTokenCount( , modality=<MediaModality.TEXT: 'TEXT'>, , token_count=322698 , ), , ], , cached_content_token_count=322698, , candidates_token_count=239, , prompt_token_count=322795, , prompt_tokens_details=[ , ModalityTokenCount( , modality=<MediaModality.TEXT: 'TEXT'>, , token_count=322795 , ), , ], , thoughts_token_count=902, , total_token_count=323936 ,)
Since the cached tokens are cheaper than the normal ones, it means this prompt was much cheaper that if you had not used caching. Check the pricing here for the up-to-date discount on cached tokens.
Delete the cache
The cache has a small recurring storage cost (cf. pricing) so by default it is only saved for an hour. In this case you even set it up for a shorter amont of time (using "ttl") of 2h.
Still, if you don't need you cache anymore, it is good practice to delete it proactively.
cachedContents/0c5j38gpopx49ok6x7kedvbpy65d1bzkq8i5vldr
DeleteCachedContentResponse()
Next Steps
Useful API references:
If you want to know more about the caching API, you can check the full API specifications and the caching documentation.
Continue your discovery of the Gemini API
Check the File API notebook to know more about that API. The vision capabilities of the Gemini API are a good reason to use the File API and the caching. The Gemini API also has configurable safety settings that you might have to customize when dealing with big files.