Cloud Storage FUSE Is the Hidden Bottleneck in Multi-Region TPU

Blog 9 min read

On the surface this reads like another routine multi-region resilience demo: deploy Gemma 3 across two GKE clusters in different regions, wire them together with the multi-cluster Inference Gateway, pull a primary region offline, watch traffic fail over to the survivor. It worked. Ammett Williams, a Developer Relations Engineer at Google, wrote it up on June 2, 2026 (the full walkthrough is here), and the networking story is genuinely impressive: managed DRANET assigning RDMA-capable interfaces to TPU pods, an anycast gateway shifting prompts to the nearest healthy cluster, no dropped requests. File it under "neat, not news," and you would miss the actual stakes.

The stakes sit in one line about where the model weights live. The part everyone will quote is the failover. The part that will actually wake your team at 3 AM is that Williams stores Gemma 3 in Cloud Storage and mounts it into the pods with Cloud Storage FUSE. That single decision is the load-bearing choice in the whole design, and it is the one the post spends the least time on.

I spend my days running S3-compatible storage under exactly these workloads, so let me argue the case the original blog post had no reason to make: the network was never your inference bottleneck. The storage layer is, and a file-system abstraction over object storage is the most expensive convenience in this architecture.

The networking is solved; that is the problem

Here is what the experiment proves, and it is worth stating plainly. GKE managed DRANET reached general availability on version 1.35.2-gke.1842000, and it lets pods request network interfaces, including RDMA, the same way they request CPU. The multi-cluster Inference Gateway, fronted by the `gke-l7-cross-regional-internal-managed-mc` load balancer, presents one anycast IP and reroutes around a dead region automatically. The cross-region failover test passed without manual intervention. That is real, and it is the kind of thing that used to take custom scripting and a pager.

But solving the network surfaces the next bottleneck, the way fixing a clogged drain reveals the pipe behind it is also corroded. When two regions can fail over in seconds, the question stops being "can we reroute traffic" and becomes "how fast can the surviving region load a 27-billion-parameter model and start serving." That second question is answered entirely by your storage path, and Cloud Storage FUSE is a POSIX shim over an object store never designed for low-latency, small-file reads.

Why FUSE over object storage taxes every cold start

Object storage is brilliant at exactly one thing: durable, cheap, massively parallel reads of large blobs. It is the right home for model checkpoints. The trouble starts when you mount it as if it were a disk. FUSE has to translate POSIX file semantics (`stat`, directory listings, partial reads, metadata lookups) into HTTP requests against a bucket.

Williams flags the symptom himself. Cloud Storage FUSE can stall on cold starts if the node's DNS cache is not tuned, and GKE shipped a built-in DNS cache in 1.34.1-gke.3899000 specifically to blunt that. The metadata prefetch feature that helps large many-file datasets needs version 1.32.1 or later. None of that is incidental. It is the platform paying down the inherent cost of pretending a bucket is a filesystem.

Practitioners have measured the gap. In community reports comparing GCS FUSE against S3 CSI drivers under GKE, the POSIX-completeness that FUSE chases is precisely what adds overhead; drivers built for read throughput that deliberately drop some POSIX semantics feel faster on small-file and random-I/O workloads. No amount of tuning fully hides object-storage latency for either provider, but you can choose how much of it you inherit. The moment your storage interface is optimizing for directory listings working correctly, it has stopped optimizing for your model loading quickly.

This is the trade-off the failover demo quietly assumes away. A region flips healthy, the gateway sends it prompts, and the pods are still streaming weights out of a bucket through a POSIX translation layer. Your token-throughput advantage at the network edge, the kind of single-digit-percent edge benchmarks like to celebrate, evaporates if checkpoint restoration on the cold region runs slow. Network gains and storage gains do not add up; the slowest stage caps the pipeline.

A deployment lesson the happy-path walkthrough skips

I have watched a "highly available" multi-region setup deliver worse user-facing latency after a failover than before it, and the cause was never the gateway. It was the secondary region doing a cold model load over a file-system mount while the primary's warm cache sat uselessly in a now-dead zone. Failover that reroutes traffic faster than the target can serve it does not buy you resilience. It buys you a slower error.

So when you copy this architecture, the decision that matters is not which gateway class to use. It is how each region gets a model into TPU memory fast, every time, including the time a region you have not touched in a week suddenly has to carry production. That is a storage-interface decision, and these are the axes I weigh:

DecisionFile-system mount (FUSE)Read-optimized object driverWhen it wins
Small-file / metadata-heavy readsSlow; POSIX translation overheadFaster; sheds POSIX semanticsTokenizers, configs, sharded weights
Legacy tooling that expects a pathWorks unmodifiedMay need adaptationYou cannot change the loader
Cold-start checkpoint restoreDNS/prefetch tuning requiredFewer moving partsFailover-critical regions
Operational surfaceMore knobs (cache, prefetch, version pins)SmallerTeams without storage specialists

The pattern that holds up: reserve POSIX mounts for the things that genuinely need a filesystem, meaning small config files or anything a legacy loader insists on, and put large model artifacts behind the read-optimized path. Mixing the two on purpose beats standardizing on either one.

The storage-path due diligence the walkthrough skipped

Before you replicate the failover demo, do the storage-layer homework the networking walkthrough had no reason to cover. The first thing I measure is cold-start checkpoint restore time in a region with no warm cache. That number, rather than the gateway's reroute time, is your real failover SLA. From there, if you stay on a file-system mount, confirm your GKE version actually carries the pieces that make FUSE survivable: the built-in DNS cache shipped in 1.34.1-gke.3899000 and the metadata prefetch that needs 1.32.1 or later. Without those, the cold-start stall is not a tuning problem you discover later. It is one you signed up for.

Security is the place I would copy Williams exactly. Bind a dedicated IAM service account to a Kubernetes Workload Identity for bucket access, so no static keys ever ride inside a container image. He does this in the post, and it is the one storage decision I would change nothing about. Everything else I would decide per artifact rather than per cluster: large weights on the read-optimized path, small POSIX-dependent files on the mount.

Two checks remain, and both are about the regions you are not looking at. Verify storage quota and bucket region placement in every failover target, not only the primary, because a region that cannot read the weights is not a failover target. Then re-test the failover with a genuinely cold secondary and compare end-to-end latency against steady state. If the failed-over path is slower, the storage layer is your cap, and no gateway tuning will move it.

About

I am Alex Kumar, a Senior Platform Engineer and Infrastructure Architect at Rabata.io, where I design Kubernetes storage and disaster-recovery architecture for AI/ML teams. My three years as a staff SRE on a SaaS platform serving more than a million daily users taught me one durable lesson: the failure that takes you down lives in the layer everyone assumed was fine. For AI inference, that layer is almost always storage.

Let me declare the bias up front, because it shapes what I noticed here. Rabata.io builds S3-compatible object storage, so read paths and egress are what I think about for a living. That same lens is why this blog post stopped me: a careful networking experiment resting on a storage choice no one had stress-tested for the cold-failover case. My goal in writing this is simple. Copy the architecture, but go in with eyes open on the storage layer instead of meeting it for the first time in an incident review.

Conclusion

The Google Cloud experiment is a good blueprint, and the multi-cluster Inference Gateway with managed DRANET is a real advance in how AI workloads survive regional loss. Take it. But take it knowing that the demo optimizes the part that was already getting solved, the network, and leaves the part that will actually govern your failover latency lightly examined. When a cold region has to load tens of gigabytes of weights before it can serve a single prompt, the interface between your TPUs and your bucket decides whether failover feels seamless or just feels slow.

The bottom line is short enough to remember. Failover speed is a storage problem dressed up as a networking one. The number that decides whether your design is resilient is the cold checkpoint restore time in a region with no warm cache, and the work that protects it is keeping large artifacts off the POSIX mount and confirming every failover region can actually read the weights. The gateway is the part that already works. The storage path is the part you have to verify yourself.

Frequently Asked Questions

Yes. In the source experiment, taking the primary region offline triggered the gateway to detect the failure and reroute requests to the secondary cluster automatically, with no dropped traffic and no manual intervention. The networking layer is the well-solved part of the design.

Durability is not the issue; latency is. FUSE translates POSIX file operations into HTTP requests against a bucket, which adds overhead on small-file and metadata-heavy reads. That overhead surfaces on cold starts, exactly when a freshly failed-over region needs to load model weights fast.

Decide per artifact rather than per cluster. Put large model weights behind the read-optimized path where shedding some POSIX semantics buys throughput, and reserve file-system mounts for small config files or legacy loaders that require a filesystem path. Mixing the two deliberately beats standardizing on one.

Cold-start checkpoint restore time in a region with no warm cache, not the gateway's reroute time. The gateway can shift traffic in seconds, but if the surviving region then spends minutes loading weights over a slow storage path, your users feel the storage latency, not the network speed.

Bind a dedicated IAM service account to a Kubernetes Workload Identity and grant it read-only access to the model bucket. The pods then mount and read weights using that identity, so no static credentials live inside any container image. This is the storage decision in the original walkthrough worth keeping unchanged.