Spanner Graph Algorithms: Where the Storage Bill Actually Hides
Here is a pattern I kept seeing before this announcement made sense of it. A team moves graph analytics off the transactional database to relieve it, does everything the vendor deck recommends, and then opens an invoice where the analytics line is smaller than expected and a storage line nobody named is the one that grew.
One customer asked me last quarter why their fraud-scoring pipeline cost more than the database it was supposed to relieve. They had moved scoring off the transactional path, pushed results to object storage, scaled to billions of edges. The surprise was not the database. It was the export bucket, churning rewritten PageRank scores every run, plus the dedicated compute that read the whole graph back into memory each time. The storage side of "native graph algorithms" was never in the deck, so nobody had modeled it.
On June 3, 2026, at Google Cloud Next, Bei Li and Vahab Mirrokni previewed graph algorithms running natively inside Spanner Graph: centrality, community detection, similarity, and path finding invoked directly through ISO GQL, with execution on dedicated compute via Data Boost so live transactions stay untouched. The engineering is real and the use cases (fraud rings, entity resolution, supply-chain bottlenecks) are legitimate.
I architect the storage layers that feed and catch these workloads for a living, though, and the question that decides my readers' bills sits where the blog post goes quiet. What data does this engine read, how often, and where does its output land? That is a storage-design decision, and getting it wrong is how a "near-zero impact" feature grows a five-figure bill.
What Spanner Graph actually moved, and what it left for you
The headline claim is that you no longer build ETL pipelines to ship transactional data into a separate analytics cluster. Spanner runs the algorithm in place, on dedicated compute, and writes the result back to the graph or to a Cloud Storage bucket. For tens of billions of edges, it does this in minutes by encoding topology in a dense format optimized for random access.
That eliminates the *pipeline*. It does not eliminate the *storage problem*; it relocates it. Two surfaces inherit the work. Data Boost reads the working set, and on a dense, randomly-accessed graph the effective read amplification per run is large, billed by consumption. The source text is also explicit that results can be written to Cloud Storage buckets.
That export is now part of your data lifecycle: object count, versioning, and retention on large result sets that regenerate on every scoring pass. Teams treat the bucket as a throwaway and then discover it was the fastest-growing thing in the account.
Where I would put each graph artifact
A graph workload produces four distinct data classes, and conflating them in one storage tier is the most common mistake I see. The decision is rarely "Spanner versus a bucket." It is which artifact goes where, on what lifecycle.
| Artifact | Read/write pattern | Where it belongs | Lifecycle control |
|---|---|---|---|
| Live graph (nodes, edges) | Transactional, hot | Spanner, interleaved tables | Database retention |
| In-run intermediate state | Burst read during Data Boost | Ephemeral compute, never persisted | None - discard |
| Scores written back (community_id, pagerank_score) | Updated each run, queried hot | Spanner columns | Overwrite-in-place |
| Exported result sets | Append per run, batch-read downstream | Object storage bucket | Lifecycle expiry + dedup |
The fourth row is where the money leaks. Spanner's interleaved-table design physically co-locates a profile, its edges, and its transactions, so a traversal becomes a local lookup instead of a fan-out. That is genuinely efficient for the live graph. The *exported* PageRank or clustering output has none of that locality, though, and if every run drops a fresh full snapshot into a bucket with no expiry policy, storage grows linearly with run frequency forever. Set a lifecycle rule before the first export, while it is still cheap to undo.
The dormant cost in "results to Cloud Storage"
The source frames bucket export as a convenience: weave algorithms with queries, persist the output, move on. Treat it instead as a recurring batch-write workload, because that is what it is. Before enabling any algorithm that exports, I run the export decision through a short table, and each answer changes whether and how I write to the bucket at all.
| What to check | A good answer | Why it changes the call |
|---|---|---|
| Who consumes the result? | Spanner for interactive queries; a bucket for downstream batch | Settles where output belongs and stops you double-writing for no reader |
| Is there a lifecycle rule on the export prefix? | Expiry set on day one | Scoring output is regenerable, so old runs rarely earn their storage |
| Are exports deduplicated by run identifier? | One object set per run id | Without it, a daily fraud job leaves 365 full result sets a year |
| What is the export object layout? | Sized against how the consumer actually reads | Many small objects read faster but raise per-request op costs; few large ones invert that trade |
| Does the consumer cross a region or cloud? | Egress modeled separately from per-GB price | When it does, per-GB storage is the smallest part of the bill |
None of this is a knock on the feature. It is the storage-architecture work the announcement assumes you already did. For AI and ML teams especially, these exported topological features (a PageRank value per node, a cluster label) are exactly the kind of derived training data that has to be versioned and reproducible, which is a storage discipline rather than a database one.
Vendor coupling is a storage question too
The genuine controversy in this launch is lock-in, and I want to take a clear position. Spanner Graph's offload mechanism is bound to Google Cloud's networking and Data Boost; you cannot run this specific isolation model in a hybrid or multi-cloud topology without committing the dataset to that environment. The blog frames this as an operational simplification, and for a single-cloud shop it is.
The standards that protect you, though, are uneven. GQL is an ISO query standard, so your *query logic* stays more portable than a Gremlin or Cypher codebase would. Your *data* is the heavy thing with gravity. The exported results landing in Cloud Storage buckets are the one artifact that is trivially portable, because object storage is the layer with a genuine cross-vendor contract: the S3 API.
My standing advice to storage teams is to keep the durable, downstream-consumed graph output on an S3-compatible layer you control, so the analytics engine can be swapped without restaging terabytes. The query stays portable by standard. The data stays portable only if you put it somewhere with a portable interface.
A grounded read on the customer proof
The source quotes real deployments, and one detail is worth pulling out for storage people. SoundCloud says it ran graph algorithms in batch mode for years, with multi-hour jobs on custom clusters over a multi-billion-edge music graph, before moving to the managed service. That sets the real baseline: the alternative to Data Boost was never free. It was a self-operated analytics cluster with its own storage, its own data movement, and its own staff. DaVita's Patient 360 and Yahoo's Unified User Profile tell the same shape of story: consolidation onto one engine to stop shuttling data between systems.
I will not extrapolate beyond what those teams stated. The published quotes describe consolidation and managed operation; they do not publish their storage bills, and I will not invent one. The lesson is directional: the win is removing a parallel analytics stack, and the cost you inherit is the consumption-priced compute and the export storage described above.
About
I'm Marcus Chen, Cloud Solutions Architect and Developer Advocate at Rabata.io. Most of my week is S3-compatible object storage, Kubernetes persistent volumes, and the data plumbing that quietly feeds AI and ML workloads.
Earlier I was a Solutions Engineer at Wasabi and ran DevOps at a Kubernetes-native startup, which is where I learned to read a database announcement by its second-order effects. With Spanner Graph I went straight to the same question I bring to any of them: where does the data live, and what does the storage cost once the feature runs at scale? My instinct is to trust reproducible benchmarks and total cost of ownership over per-GB sticker price, and I will say plainly when AWS S3 or another option is the better call for your use case. The opening anecdote is a composite of real conversations; the surprised-by-the-bucket moment happens more often than the marketing admits.
Conclusion
Let me restate the position I have been defending, in plain terms. Spanner Graph algorithms are legitimate engineering that removes the analytics-ETL pipeline for connected data, and for single-cloud teams the integration is clean. What they do not do for you is storage architecture.
Three costs survive the announcement. The dense in-memory read that Data Boost performs is consumption-priced. The results-to-bucket export is a recurring batch-write workload. And the dataset's gravity - not the GQL query - is what binds you to one cloud. Model those three before you enable the first algorithm.
Decide per-artifact where output lives, put a lifecycle rule on the export prefix on day one, and keep the durable downstream results on a storage layer with a portable interface. The one number I would put on a dashboard is the object count and total size of the export prefix, tracked run over run: if it climbs in step with your scoring frequency instead of holding flat, your lifecycle rule is missing or wrong, and that is the line on the invoice that grows before anyone names it.
Frequently Asked Questions
It eliminates the pipeline that shipped data to a separate analytics cluster, which is a real win. The storage cost relocates, though; it does not disappear. Data Boost still reads the working set as consumption-priced compute, and any results exported to Cloud Storage buckets become a recurring batch-write workload you have to manage on its own lifecycle.
It depends on the consumer. Scores queried interactively, like a per-account pagerank_score, belong in Spanner columns and should be overwritten in place. Result sets consumed downstream in batch belong in an object-storage bucket with a lifecycle expiry rule and run-level deduplication. Writing to both is justified only when something genuinely reads both.
Treating the export bucket as a throwaway. If every scoring run drops a full result set into a bucket with no expiry and no deduplication, storage grows linearly with run frequency and never stops. A daily job leaves 365 full snapshots a year. Set the lifecycle rule before the first export, not after the first surprising invoice.
Your query logic is fairly portable because GQL is an ISO standard, unlike a Gremlin or Cypher codebase. Your data is the heavy, sticky part: the Data Boost offload is bound to Google Cloud, so the live graph commits you to that environment. The exported results are the exception, since object storage has a portable cross-vendor contract through the S3 API.
Yes. A PageRank value or cluster label per node is derived training data, and reproducible models need it versioned like any other dataset. Store those exports on an S3-compatible layer with object versioning instead of overwriting in place, so a model trained last month can be traced back to the exact feature set it learned from.