Developer's Corner

Google Finds a Way to Slash AI Models Cost by 75%

Implicit caching is automatic and enabled by default

The Left Shift Bureau

09 May 2025 — 2 min read

Google is rolling out a new feature called "implicit caching" in its Gemini API, aimed at reducing the cost of using its AI models. The feature, available for Gemini 2.5 Pro and 2.5 Flash, promises up to 75% savings on repetitive context shared across API requests—offering significant relief to developers burdened by high model usage costs.

Unlike previous explicit caching, which required developers to manually define reusable prompts, implicit caching is automatic and enabled by default. If a new request shares a common prefix with a past one, the system automatically applies savings, simplifying workflow and reducing unexpected billing.

"Implicit caching directly passes cache cost savings to developers without the need to create an explicit cache. Now, when you send a request to one of the Gemini 2.5 models, if the request shares a common prefix as one of previous requests, then it’s eligible for a cache hit. We will dynamically pass cost savings back to you, providing the same 75% token discount," Google said in a blog post.

The move follows developer backlash over costly and inconsistent behavior from Gemini 2.5’s explicit caching. In response, Google’s Gemini team issued an apology and committed to fixes.

While the feature looks promising, developers are advised to structure prompts carefully—placing static content first—to maximize the chance of cache hits and cost savings.

Google recommends that developers place consistent content at the beginning of prompts and move variable elements—such as user queries or changing context—to the end. This increases the likelihood of triggering a cache hit. According to the company, this best practice helps optimize the effectiveness of its new implicit caching feature.

If you want to guarantee cost savings, you can continue to use the explicit caching API we shipped last May.

Also, make sure to keep the initial content of the requests the same if you want them to hit the cache. More details on the launch here:https://t.co/5fubUe8CB6
— Logan Kilpatrick (@OfficialLoganK) May 8, 2025

To further improve cache eligibility, Google has also lowered the minimum request size requirements: 1,024 tokens for Gemini 2.5 Flash and 2,048 tokens for Gemini 2.5 Pro. Additional guidance is available in the Gemini API documentation.

"In cases where you want to guarantee cost savings, you can still use our explicit caching API, which supports our Gemini 2.5 and 2.0 models. If you are using Gemini 2.5 models right now, you will start to see cached_content_token_count in the usage metadata which indicates how many tokens in the request were cached and therefore will be charged at the lower price," Google added.

Google Finds a Way to Slash AI Models Cost by 75%

The Left Shift Bureau

Read more

NES Data to Launch Advanced Edge and Containerised Data Centres Across India

Okta Unveils Protocol to Secure AI Agent Access Across Enterprise Apps

Sam Altman Slams NYT Over ChatGPT User Privacy in Ongoing Legal Battle

AMD Funds Indian Semiconductor Startup AAGYAVISION