What are the three main caching strategies this skill provides?

Anthropic native prompt caching (marking stable segments with cache_control), full response caching (using SHA-256 hashing or semantic similarity via embeddings), and Cache Augmented Generation (CAG) which pre-builds and caches formatted document context instead of using RAG retrieval.

When should I use Cache Augmented Generation (CAG) instead of RAG?

Use CAG when your document corpus is under ~100K tokens, update frequency is low, latency is critical, and queries are general rather than specific. RAG is better for larger corpora, frequent updates, flexible latency, and specific queries.

What cost and latency improvements does Anthropic prompt caching provide?

Anthropic prompt caching yields roughly 90% cost reduction on cached tokens and up to 2x latency improvement by marking stable prompt segments with cache_control and exposing cache read/write token usage on the response.

What are the three sharp edges documented with fixes?

Cache misses causing worse latency than no caching (fixed via non-blocking cache checks with timeout), cached responses becoming stale (fixed via version-based or event-based invalidation), and prompt caching failing due to prefix changes (fixed by keeping dynamic content out of cached prefixes and ensuring consistent ordering).

Does this skill handle CDN caching or database query caching?

No, this skill is strictly focused on prompt and response caching for LLMs and does not cover CDN caching, database query caching, or static asset caching.

Skill

Optimize LLM Prompts with Caching Strategies

Name: Prompt Caching
Availability: OnlineOnly
Author: Antigravity

Implements LLM prompt caching: Anthropic prompt caching, response caching, and Cache Augmented Generation (CAG).

Get skill

Works with anthropicredis openai

Antigravity

Own this? Claim it

Spark score

out of 100

Updated last month

Version 13.1.0

Models

claude

Add to Favorites

Why it matters

Reduce LLM costs and latency by implementing advanced caching techniques for prompts and responses. This asset optimizes interactions with LLMs by intelligently storing and retrieving previously generated content.

Outcomes

What it gets done

Implement Anthropic native prompt caching for stable context.

Utilize Redis for efficient response caching of identical queries.

Apply Cache Augmented Generation (CAG) for stable document corpora.

Manage cache invalidation and optimize for cache miss scenarios.

Install

Add it to your toolbox

Run in your project directory:

curl -fsSL https://spark.entire.vc/get/ag-prompt-caching | bash

Overview

Prompt Caching

Implements LLM-specific caching strategies covering Anthropic native prompt caching, response caching, and Cache Augmented Generation. Use when caching stable LLM prompt prefixes, caching full responses, or replacing RAG with a pre-cached document corpus.

What it does

Provides LLM-specific caching strategies - Anthropic native prompt caching, full response caching, and Cache Augmented Generation (CAG) - to cut cost and latency on repeated or stable prompt content.

When to use - and when NOT to

Use this skill when using Claude with a stable system prompt or large static context, caching full responses for repeated queries, or replacing RAG retrieval with a pre-cached document corpus that fits in context. Does not cover CDN caching, database query caching, or static asset caching - strictly focused on prompt and response caching for LLMs.

Inputs and outputs

Anthropic prompt caching marks stable prompt segments (system instructions, a large static knowledge base) with cache_control: { type: "ephemeral" } while dynamic user content stays in the messages array, with cache read/write token usage exposed on the response - yielding roughly 90% cost reduction on cached tokens and up to 2x latency improvement.

Response caching implements exact-match caching via SHA-256 hashing of the prompt with a Redis-backed TTL store, semantic similarity caching using embeddings and a similarity threshold (e.g. 0.95) to serve cached responses for near-duplicate queries, and temperature-aware caching that only caches low-temperature (<=0.5) deterministic responses.

Cache Augmented Generation (CAG) pre-builds and caches a formatted document context directly in the prompt instead of using RAG retrieval, with periodic staleness-based refresh (e.g. re-building after 1 hour). A CAG-vs-RAG decision matrix compares corpus size (CAG better under ~100K tokens, RAG better above), update frequency (CAG for low, RAG for high), latency needs (CAG for critical, RAG for flexible), and query specificity (CAG for general, RAG for specific).

Three sharp edges are documented with fixes: cache misses causing latency spikes worse than no caching at all (fixed via non-blocking cache checks racing against the LLM call with a short timeout, or selective caching only for high-frequency query patterns); cached responses becoming stale/incorrect over time (fixed via version-based invalidation, content-hash validation against current source data, or event-based invalidation tied to source updates); and prompt caching failing to hit due to prefix changes (fixed by keeping dynamic content like timestamps out of the cached system prefix, keeping static content first, and ensuring consistent component ordering since Anthropic caching requires an exact prefix match).

Four validation checks flag common misconfigurations: caching non-deterministic high-temperature responses (warning), caching without a TTL risking indefinitely stale data (warning), dynamic content inside a cached prefix causing misses (warning), and missing cache hit/miss metrics preventing effectiveness measurement (info).

Integrations

Uses the Anthropic SDK's native cache_control mechanism, Redis for response caching, and vector similarity search for semantic cache matching; delegates to context-window-management for broader context optimization, rag-implementation for retrieval systems, and conversation-memory for persistence.

Who it's for

Developers building high-performance LLM applications who need concrete caching implementation patterns and cache-invalidation strategies rather than naive full-response caching that goes stale or misses on near-identical prompts.

system: [{ type: "text", text: LONG_SYSTEM_PROMPT, cache_control: { type: "ephemeral" } }]

FAQ

Common questions

Discussion