gcs-analytics-core: Stop the Sequential-Read Tax on Iceberg

Blog 9 min read

A staff engineer on a team I was helping had the diagnosis half right and the fix dead wrong. He'd profiled a slow Spark-on-Iceberg job, seen the executors pinned for the duration, and concluded the cluster was undersized. The flame graph said otherwise. The executors were not CPU-bound; they were idle, waiting. Each Parquet file open fanned out into a string of tiny sequential reads against Google Cloud Storage, a footer seek here, a column chunk there, and every one of those reads paid the full round-trip cost to the object store before the next could start. He had read the symptom correctly. He'd just bought more compute to fix a problem that was never about compute.

On June 3, 2026, Google Cloud engineers Ajay Yadav and Nivedita Aggarwal shipped a fix for exactly that pattern: gcs-analytics-core, an open-source Java library that centralizes I/O optimizations between analytics engines and the GCS Java SDK. It is available natively in the Apache Iceberg runtime starting with version 1.11.0. For anyone running Iceberg on GCS, this is the rare optimization that costs almost nothing to adopt and pays back fastest on the workloads that hurt most.

The headline TPC-DS numbers come with a catch, though. The gain is enormous on small datasets and shrinks steadily as data grows, and I have watched teams turn "71% faster" into capacity plans that the benchmark never promised. So before the configuration flags, let me be precise about where this library moves the needle and where it quietly does not.

What the library actually changes, and where it sits

Gcs-analytics-core is a thin interception layer. It sits between your analytics engine (Apache Spark, Trino, or Apache Hive) and the underlying GCS Java SDK, catching read calls and rewriting how they hit the object store. For Iceberg users it slots into the GCSFileIO implementation, replacing the default sequential read path with two specific optimizations.

The first is vectored I/O. Instead of issuing one GCS call per data range and waiting on each, the library fetches multiple ranges in parallel inside a single threaded operation. That collapses what used to be N round-trips into one, which is why open-file latency drops sharply on read-heavy scans.

The second is smart Parquet prefetching. Reading Parquet normally starts with the engine seeking backward to the file footer to learn where the column data lives, often several small network calls per file. The library prefetches that footer in a single 50KB–100KB chunk the moment the file is opened, so the engine never stalls hunting for metadata.

What I appreciate as an operator is that none of this touches application code. Because the optimizations live inside the updated Iceberg runtime and the GCS connector, you opt in through catalog configuration rather than a rewrite. It is also catalog-agnostic (REST catalog, Hive metastore, or anything else), so it does not force changes on the metadata side of your stack.

There is a real boundary worth stating plainly: this is Iceberg-and-GCS-native. The throughput advantage applies specifically to teams already on that pairing. If you run Delta Lake or Hudi, or you read the same tables from outside Spark, you do not inherit these gains for free; they would need their own equivalent low-level work. That coupling is the cost of the speed, and it is worth saying out loud before anyone treats this as a universal storage accelerant.

Read the benchmark like an operator, not a marketer

Google ran end-to-end TPC-DS benchmarks on an Apache Spark cluster with an Iceberg catalog on GCSFileIO, comparing gcs-analytics-core against the default implementation across dataset sizes from 1 GB to 10 TB. Here is the full result, because the shape of the curve is the entire lesson. The published table reports exact percentages, so there is no excuse for the vague "massive reduction / marginal shift" summaries I keep seeing.

TPC-DS schema sizeScan time improvementExecution time improvement
1 GB71.51%32.61%
10 GB48.48%18.94%
100 GB40.98%10.95%
1 TB35.86%3.38%
10 TB18.40%1.58%

Two things jump out. Scan time, the wait-on-the-network component, improves everywhere, from 71.51% at 1 GB down to a still-useful 18.40% at 10 TB. But execution time, which is what your query latency and your bill actually track, improves dramatically only at small scale and nearly flattens by 1 TB: 32.61% down to 3.38%, and just 1.58% at 10 TB.

That divergence is physics doing its job. On small datasets, fixed per-request overhead dominates total runtime, so killing round-trips is transformative. As data grows, the compute engine becomes CPU-bound and network wait stops being the bottleneck; the I/O layer can only give back the time the engine was actually spending on I/O. The practical reading: this library is a high-leverage win for interactive, many-small-files, dashboard-and-exploration workloads, and a modest gain for petabyte-scale ETL where CPU already rules. Plan capacity around the execution-time column rather than the scan-time headline.

Enabling it without the silent failure

Adoption is two configuration flags plus the right bundle, and the most common operational mistake is setting one and forgetting the other. The library can be present on the classpath and intercepting calls while still doing nothing, because the parallel fetch path never turns on. That failure is silent: no error, no warning, just unchanged latency and a team convinced the library "didn't help."

Run this checklist before you declare it working:

  1. Bundle. Confirm iceberg-gcp-bundle and the Iceberg Spark runtime are both 1.11.0 or later. Earlier releases lack the embedded hooks entirely; this is the actual version gate, and a mismatched jar is the usual cause of a silent fallback to sequential reads.
  2. IO impl. Confirm the catalog's io-impl points at GCSFileIO. Reads must route through the connector the library patches, or none of the optimization applies.
  3. Library flag. Confirm the gcs.analytics-core.enabled property is true. This activates the centralized optimization layer.
  4. Vectorization flag. Confirm the Iceberg vectorization property is true. Without it the prefetcher is loaded but idle, which is the classic silent failure.
  5. Verification under load. Watch the Spark UI for concurrent connection spikes during footer reads. This proves vectored fetches are actually firing, beyond what the config claims.

That last step is the one I insist on. Do not trust the config file; trust the connection pattern under load. If you do not see parallel reads in flight during a scan, you are still paying the sequential tax regardless of what the properties say. New users can validate all of this on a throwaway cluster under the standard $300 free-tier credit before touching anything production-bound, which is exactly where you want to discover a missing flag.

One more deployment lesson from the network side: vectored I/O trades request count for concurrency, and concurrency has its own ceiling. Aggressive parallel fetches can brush against object-store request rate limits or simply saturate your egress link, and a saturated pipe gives the gains back with interest. The library manages concurrency internally, but on dense clusters with constrained networking, watch for throttling and executor memory pressure during the initial prefetch burst. Tune to the network tier you actually provisioned rather than the one in the benchmark.

About

I am Alex Kumar, a Senior Platform Engineer and Infrastructure Architect at Rabata.io, working remotely out of Toronto. My day job is S3-compatible object storage for AI/ML and data teams whose first question is always cost, and whose second is throughput. The work I gravitate to lives below the application: making storage layers fast and predictable without asking anyone to rewrite their code, and keeping the monthly bill something a finance team can actually read.

My background runs through SRE and platform engineering, and one pattern showed up often enough to become a reflex: the job everyone calls slow is usually an I/O job wearing compute's clothes. That is why something like gcs-analytics-core holds my attention more than yet another pass at node-count tuning. It goes after the round-trip itself, the part most teams never put on a flame graph and then steadily overpay for.

Conclusion

Gcs-analytics-core is the kind of optimization I like. It removes work instead of throwing hardware at it, and it ships behind a flag so the blast radius is tiny. If you run Iceberg on GCS, especially interactive or small-to-mid-scale Parquet workloads, the scan-time reduction is close to free once you get both flags right and verify the parallel reads are real. The discipline is in reading the benchmark for what it actually measures: huge wins where round-trips dominate, diminishing returns where CPU does, and a genuine lock-in to the Iceberg-on-GCS pairing that you should weigh against future engine diversity.

Here is the do-this-next. Spin up a throwaway Spark cluster on the $300 credit, set io-impl to GCSFileIO, flip both the analytics-core and vectorization flags to true on 1.11.0+ bundles, and run one of your real interactive queries while you watch the Spark UI for parallel reads. If the connection spikes appear, promote the config to the workloads where your curve favors you. If they don't, fix the flag before you blame the library.

Frequently Asked Questions

Small-to-mid-scale, read-heavy, many-small-files workloads - interactive queries, dashboards, exploration - gain the most, because fixed per-request overhead dominates their runtime. The benchmark shows execution time improving 32.61% at 1 GB but only 3.38% at 1 TB, so the value concentrates where scans, not CPU, are the bottleneck.

Almost always because the vectorization flag is off while the analytics-core flag is on. The library loads and intercepts calls but never executes parallel multi-range fetches, so latency stays flat with no error. Confirm both flags are true and watch the Spark UI for concurrent connections during footer reads to prove the parallel path is firing.

Not automatically. The optimization is native to Iceberg's GCSFileIO on Google Cloud Storage, so the throughput advantage applies to that pairing specifically. Other table formats or cross-cloud setups would need their own equivalent low-level I/O work; treat this as an Iceberg-on-GCS feature, not a universal accelerant.

Apache Iceberg 1.11.0 or later, with both the Iceberg Spark runtime and the iceberg-gcp-bundle at 1.11.0+. Older releases do not contain the embedded hooks, so no flag will activate the optimization. Verify the bundle version first - a mismatched jar is a common cause of silent fallback to sequential reads.

Yes, on constrained networks. Vectored I/O trades many small requests for high concurrency, which can hit object-store request limits or saturate your egress link, and a saturated pipe erodes the gains. The library caps concurrency internally, but on dense clusters watch for throttling and executor memory pressure during the prefetch burst, and tune to your actual network tier.