GKE Inference Gateway Prefix Caching: Don't Sign Up for the Cache Tax

Blog 12 min read

A staff engineer I trust pinged me last quarter about prefix-cache-aware routing. He had read Google's benchmark cover to cover, had the throughput math exactly right, and had drawn the wrong conclusion from it: that flipping it on would hand his RAG assistant a permanent latency win. He shipped it. For two days the dashboards were beautiful. Then a Friday template tweak shaved his hit rate in half over the weekend and his pager went off Monday morning with nobody able to say why the latency had crept back to baseline. The headline was correct. The operating model behind it was not, and that gap is the whole story.

That gap is the real subject of the GKE Inference Gateway benchmark Google published on June 10, 2026. Google Cloud engineers Bob Tian and Susan Wu report that the gateway, a native extension of the GKE Gateway, routes generative AI traffic by real-time model-server metrics rather than blind round-robin, and an independent Principled Technologies test backs it with hard numbers. The numbers are real and they are good.

My argument is narrower and more useful to anyone who has to operate this. The gateway moves your hardest tuning problem from the cluster into the application. Adopt it without owning that shift and you give back most of the gain.

I run platform infrastructure for AI/ML workloads, so I read this the way I read any cost-and-reliability claim: where does the saving actually come from, what new failure mode does it introduce, and how do I verify it before I route production traffic through it.

What the gateway actually does, stripped of the marketing

Prefix caching is not exotic. An LLM stores the KV cache, the activation states, for a block of tokens it has already processed. When the next request shares that exact prefix word for word, the model skips reprocessing it and computes only the new suffix. The expensive part of inference, the prefill over a long shared context, simply does not run again.

The GKE Inference Gateway turns that model-level trick into a routing decision. It sits in front of your model servers as a proxy paired with an Endpoint Picker extension, and rather than spreading requests evenly it reads the incoming prefix and steers the request to a pod that already holds that prefix in memory. To pick a pod it tracks four signals per server: KV cache utilization, queue length of pending requests, prefix cache indexes that show which prefixes are resident, and active LoRA adapters loaded on the accelerator.

Plumbing-wise it ships as two new Gateway API resource definitions, `InferencePool` and `InferenceModel`, automatically managed from GKE version 1.34.0-gke.1626000 onward. It can also route prompts and responses through Google Cloud Model Armor or NVIDIA NeMo Guardrails for safety filtering.

The thing to internalize is what the routing assumes. It is only useful when requests genuinely share long, identical prefixes. That assumption is doing all the work, and it is also where deployments break.

The benchmark: strong numbers on a workload built to win

Principled Technologies ran GKE with the Inference Gateway against a standard third-party managed Kubernetes service using conventional round-robin HTTP load balancing, on identical hardware: eight NVIDIA A100 40GB GPUs serving Llama 3.1 8B Instruct. The press release framing the test names Amazon EKS as the comparison platform. The workload was deliberately a shared-prefix scenario, which is the case prefix caching is designed for.

MetricGKE Inference GatewayRound-robin baselineGap
Mean output throughput7,169.21 tokens/sec6,042.05 tokens/sec15.7% higher
Mean time to first token188.36 ms2,624.73 ms92.8% shorter
Mean inter-token latency30.20 ms81.03 ms62.6% lower

The headline 92.8% TTFT reduction is the one that will end up on a slide, and it accurately reports what it measures. But notice it is the metric most sensitive to cache hits: time to first token collapses when the prefill is already cached, and barely moves when it is not. The throughput gain of 15.7% is the steadier number to plan capacity around, because it survives a mix of hits and misses better than TTFT does. If your real traffic shares prefixes 40% of the time instead of nearly always, your blended result lands far closer to the throughput figure than the TTFT figure. Plan with the conservative number.

One reference point I would not lean on hard: Google's broader claim of a 30% serving-cost reduction versus other managed offerings is a vendor figure, not a line item from this benchmark. Treat it as a hypothesis to validate on your own workload before you bank on it.

The cache tax nobody benchmarks

Here is the failure mode that does not appear in the report and that I have hit in production. Prefix caching rewards prompt uniformity and punishes prompt drift, and prompt drift is the natural state of any real application.

The hit rate that makes the whole architecture pay off depends on requests sharing byte-identical prefixes. The moment a team A/B-tests a new system instruction, injects a per-user preamble, or reorders a few lines in a template, those requests no longer match the warmed pods. Routing falls back toward round-robin, the prefill runs again, and your latency quietly reverts to baseline. The infrastructure did not fail. Your prompt discipline did. This is the operational cost the marketing calls a feature: you have traded cluster-side tuning for application-side standardization, and the bill for the second is paid by whoever owns the prompt templates, who usually does not know they are now operating a cache.

Snap Inc. Is the customer Google cites, and their number is instructive precisely because of how it is phrased. Vinay Kola, a senior engineering manager at Snap, says they achieved prefix cache hit rates "ranging up to 75-80%" using prefix-cache-aware routing with llm-d on their Envoy-based service mesh. Notice the phrasing: a range with an upper bound, which is a careful way of saying the high end is the best case rather than a guarantee. That is what disciplined prompt structure plus a mature mesh gets you. Read "up to 75-80%" as the ceiling a team reaches after investing in keeping prefixes stable. On day one you inherit something well below it.

There is a second tax: maintaining the cache indexes is not free. The gateway has to poll cache state and queue depth across the cluster to route accurately, and that polling competes for control-plane resources during traffic spikes. And cache capacity is finite. If your prefix diversity outgrows the memory budget, you get thrashing, and a thrashing prefix cache can perform worse than the round-robin you replaced, because now you pay the routing overhead and still miss.

When it pays, and when it is dead weight

The decision is not whether the gateway works. It is whether your traffic shape earns the overhead.

WorkloadPrefix sharingWorth enabling?
RAG over a fixed corpus or codebaseHigh, large static contextYes, the strongest case
Multi-turn chat with a fixed system personaHigh, stable preambleYes
Tool-calling agents with rotating contextMixed, depends on designTest before committing
One-shot generation, unique prompts each timeLow to noneNo, pure overhead

The pattern is consistent. Workloads with a high ratio of static-to-dynamic tokens, a long fixed prefix and a short variable suffix, are where cache locality dominates serving economics. RAG that pins a large documentation set as an immutable prefix and appends only the user's question is the textbook win. Conversely, applications that generate something unique on every call carry the index-maintenance and polling cost and collect none of the reuse, so you are spending control-plane budget for nothing. Adding the gateway to a low-repetition workload does not just fail to help; it adds statefulness to a service that was happily stateless.

The cost angle is real but not where the gateway lives

Control-plane pricing is a footnote in this decision, and it is worth saying so plainly because vendor comparisons love to lead with it. Industry pricing surveys for 2026 put a managed Kubernetes control plane in the same general band across providers, roughly the $73-to-$74-per-month-per-cluster range, with various free-tier credits muddying any clean comparison. That is noise next to accelerator spend. The real money in inference is GPU-hours, and the real cost lever is utilization: keeping expensive accelerators busy with useful work rather than recomputing context they already had.

The same surveys note that EKS bills commonly land 40% to 60% over projection once auxiliary fees like NAT Gateway and cross-AZ transfer are counted. That is a useful reminder that the control-plane sticker price is the least interesting number on the invoice. If the gateway raises your accelerator utilization, that dwarfs any control-plane delta. Spend your optimization energy on the GPUs. The line item everyone quotes will not move your bill.

Before you route production traffic through it

Do not take this on faith, yours or Google's. Run this checklist against your own cluster first.

  1. Measure your actual prefix-sharing rate on a representative day of real traffic. A synthetic shared-prefix test will flatter you; the production mix is what decides whether anything below matters.
  2. Stand up the gateway in shadow or against a fraction of traffic and compare hit rate, TTFT, throughput, and inter-token latency to your round-robin baseline on the same hardware.
  3. Make prefix cache hit rate a first-class dashboard metric with an alert, so a slip shows up as a page rather than a mystery latency regression.
  4. Add a guardrail to your prompt-template change process: treat any edit to system instructions, preambles, or template ordering as a deploy that can move latency, and review it as one.
  5. Watch KV cache utilization and prefix-cache index size for thrashing; if prefix diversity outgrows the memory budget, you are paying routing overhead for misses.
  6. Validate the broader 30% serving-cost claim against your own GPU-hour spend before you put it in a budget, since it is a vendor figure rather than a benchmark line item.

About

I am Alex Kumar, a Senior Platform Engineer and Infrastructure Architect at Rabata.io. My days center on Kubernetes persistent storage, keeping disaster-recovery posture battle-tested, and making cloud spend legible enough that a team can find and cut the waste. A benchmark that promises a big speedup rarely lies about the speedup itself; what it leaves out is the operational cost that surfaces a quarter later, and that omission is the first thing I go looking for.

Prefix caching is a clean example: the later cost is that your prompt structure has quietly turned into a production dependency. Years of running storage and serving systems at scale taught me one durable rule - if you cannot reproduce a performance gain on demand and cannot watch it on a dashboard, you do not own it, you are only borrowing it. Because Rabata.io builds S3-compatible storage for exactly these AI/ML workloads, the data-plumbing layer of inference clusters lands on my desk constantly, which is why the cache-index overhead registered with me well before the throughput chart did.

Conclusion

The GKE Inference Gateway earns its benchmark. KV-cache-aware routing is a genuinely better answer than round-robin for inference, and on a shared-prefix workload the throughput, TTFT, and latency gains are large and well-measured. What the benchmark cannot show you is the obligation that rides along with it. Prefix caching only pays while your prompts stay uniform, and keeping them uniform is now an operational job that lands on the team least equipped to notice when it slips.

So here is the concrete next step. Before you flip it on for production, pull a day of real traffic and compute your prefix-sharing rate. If it clears roughly half, run the six-item checklist above, ship the gateway behind a hit-rate alert, and wire prompt-template changes into your deploy review. If it does not clear that bar, leave round-robin in place and spend the effort on accelerator utilization instead. Either way, let your own numbers make the call. A vendor's benchmark is not the thing serving your users in production; your cluster is.

Frequently Asked Questions

No. It helps only when requests share long, identical prompt prefixes, such as RAG over a fixed corpus or chat with a stable system persona. For one-shot prompts that differ every time, there is little or nothing to cache, so you pay the routing and index-maintenance overhead without the reuse benefit. Match it to your traffic shape first.

Because cache hit rate is not a fixed property of the gateway; it depends on how consistently your prompts share prefixes. Snap's "up to 75-80%" reflects a mature service mesh and disciplined prompt structure with llm-d and Envoy. Treat it as an achievable ceiling for a team that invests in prompt uniformity, not a default you get automatically.

Two things. First, prompt drift quietly erodes your hit rate, so any change to system instructions or templates becomes a latency-affecting deploy. Second, the gateway polls cache and queue state across the cluster, which consumes control-plane resources during spikes, and an oversized prefix set can thrash and perform worse than round-robin. Both are manageable, but only if you monitor for them.

It is accurate for the tested shared-prefix workload, but TTFT is the metric most sensitive to cache hits, so it overstates the gain for mixed traffic. The 15.7% throughput improvement is the steadier number to plan capacity around. Benchmark your own representative prompt mix before promising any specific figure to stakeholders.

The control-plane prices sit in a similar band across providers, so that is not where savings come from. Real inference cost is dominated by GPU-hours, and the gateway's value is raising accelerator utilization by avoiding redundant recomputation. Validate Google's broader 30% cost-reduction claim against your own workload rather than assuming it, since it is a vendor figure, not a benchmark line item.