Storage Insights Datasets: Read the Access Log Before You Move Bytes

Blog 10 min read

Picture the day someone on your team has to decide whether a 40-terabyte training corpus is safe to archive. The age-based lifecycle rule says it is cold. Nobody in the room can prove it. The corpus is, in fact, read four times a year, by one quarterly retraining job, from one region, but that pattern lives nowhere a policy engine can see it. The decision comes down to whoever remembers the access pattern, or whoever kept a stale access-log export in a spreadsheet. That spreadsheet is exactly the artifact Google Cloud is now trying to replace.

On June 10, 2026, Google Cloud's storage team (Misha Sheth and Kumar Nachiketa) announced general availability of activity insights inside Storage Insights datasets, a feature of Storage Intelligence for Cloud Storage. The pitch is straightforward: instead of knowing only *what* objects you hold, you now get a query-ready BigQuery index of *how and when* they are accessed, written, deleted, and erroring.

For anyone running AI training data at the scale of billions of objects, that shift from static inventory to live access intelligence is the difference between guessing a lifecycle policy and proving one.

I work on S3-compatible storage for a living, so my interest here is not which logo is on the bucket. It is the operational pattern: stop tiering by object age and start tiering by observed access. Google has shipped the telemetry to do that natively. The catch is that the telemetry arrives on its own schedule, costs money to query, and is only as safe as your discipline in reading it. Here is how I would actually use it, and where I would not trust it.

Static Inventory Tells You What You Own; Activity Insights Tells You What You Use

The older half of Storage Insights is an inventory scanner. It reads object metadata (storage class, location, age, custom metadata) and lands it in a BigQuery-linked dataset you can query with SQL or with Gemini. Useful, but it answers an inventory question: what is in the warehouse.

The new views answer the operational question: what moves, and when. Per the source announcement, they capture object-level activity (writes, updates, deletes, errors via `object_events_view`), bucket-level aggregates (total operations, error counts, most-active prefixes), bucket-level regional traffic (ingress and egress bytes per region), and project-level rollups. That last one, regional traffic, is the quiet headline. Egress is where object storage bills go to die, and until now you mostly inferred it.

Question you are askingStatic inventoryActivity insights
What storage class is this object in?YesYes
How many reads did this bucket take last quarter?NoYes
Which region is actually pulling this data?NoYes
Where are 429 errors concentrated?NoYes

The announcement's own framing gets it right: this is the difference between knowing what is in your warehouse and knowing what is used and when. The catch is that "when" has a clock attached to it.

The 48-Hour Cold Start Governs When You Can Provision This

Two latency numbers govern everything you do with this feature, and the announcement states both plainly. After you configure a dataset, it can take up to 48 hours for the first data to appear in BigQuery. Once it is flowing, new activity typically lands within four hours of the event.

Treat the first number as a hard operational constraint. For 48 hours after you enable Storage Insights, you have no SQL analytics, no baseline, and no ability to validate your own configuration scope. If you switch this on mid-incident hoping to diagnose a live problem, you have switched on a tool that will start being useful two days after the fire is out.

Enable it during a maintenance window, before you need it. Keep whatever logging you already trust running in parallel until the dataset is populated and you have eyeballed a day of data.

The four-hour steady-state latency sets a different limit. It is fast enough to feel real-time and slow enough to burn you if you wire it directly into automated lifecycle deletion. During a flash crowd, millions of failed requests can accumulate inside that window before they surface.

Activity insights is excellent for forensic and trend analysis. It is not a real-time alerting bus. Anything that deletes or transitions objects on its signal should run on a deliberately slow cadence (weekly rather than hourly), with a grace period and a sanity check against another source before it acts.

Tier by Observed Access, and the Egress Math Decides Whether It Pays

This is the use case I would deploy first, and it is the one the age-based rule keeps getting wrong. The right-sizing play: query the activity views for buckets with minimal read and write activity over a 30-, 60-, or 90-day window, then transition genuinely cold data from Standard or Nearline toward Coldline or Archive. The advantage over a static age rule is that you are acting on observed silence, not on a timestamp that says nothing about whether a quarterly job still touches the block.

But cold is not the same as cheap to move, and this is where the cost analysis earns its keep. Retrieval and egress fees can erase the storage saving entirely. Per 2026 pricing guides, Google Cloud Standard sits around $0.020/GB while egress runs roughly $0.12/GB, and that egress number is the one that wrecks budgets. A bucket with near-zero writes but occasional cross-region reads can cost more in Archive retrieval and transfer than it ever cost sitting in Nearline.

The regional traffic view exists precisely to catch this: before you relocate, confirm where the reads originate. If the access is concentrated in one region, co-locating compute and storage usually beats any tier change.

Shipt, cited in the announcement, is the cleanest real-world example. Managing over two billion objects, they used Insights datasets to detect egress charges from multi-region buckets, then relocated 1.3 petabytes from multi-region to regional storage. Observed traffic drove that placement, where an age-based tier rule would have been blind to it. That is the pattern worth copying: settle the access geography, and let the storage class follow from it.

The Hidden Line Item Is Compute, and It Scales With Every Query

Here is the part the marketing underplays. Running analytics on Storage Insights datasets accrues standard BigQuery query costs that scale with the data you scan. The architecture trades a fixed-fee metrics product for unlimited SQL flexibility, and that trade has a sharp edge.

The contrast with a competing model is instructive. AWS S3 Storage Lens charges a flat fee (per 2026 pricing references, on the order of $0.20 per million objects monitored for advanced metrics) that stays predictable whether you query it once or a thousand times. Google's BigQuery-backed approach gives you far more analytical reach but converts a predictable line item into a variable one. A careless `SELECT *` across billions of activity rows can quietly eclipse the storage savings the whole exercise was meant to produce.

Query discipline is what keeps this in check. Filter every query with tight `snapshotEndTime` constraints. Estimate scan volume before you run broad queries. Treat query optimization as an ongoing cost-control task rather than a one-time setup. The feature pays for itself only if the cost of asking questions stays well below the cost of the inefficiencies it reveals, and nothing about that ratio holds itself in place on its own.

A Verification Routine Before You Trust the Data

Before any of this drives a state change, run the same short routine I use on a new dataset. None of it requires anything beyond the console and BigQuery. The table below pairs each setup choice with the failure it prevents.

Setup choiceWhat to setWhat it prevents
ScopePoint the dataset at the org, folder, project set, or buckets that matterToo wide ingests noise you pay to scan; too narrow hides cross-project dependencies that distort cost allocation
RetentionThirty days for monthly lifecycle reviews; a longer window for quarterly audits on seasonal workloads, at proportional storage costActivity logs expiring before the review cycle that needs to read them
Cold startConfirm the first data has landed and a full day of activity looks plausibleDrawing conclusions from a dataset that is still backfilling
Cross-verificationPair an activity-view finding with the latest attribute snapshot and your application logsLetting automation delete on a zero-activity reading that a second source would contradict
CadenceValidate destructive actions on a weekly windowActing on a bucket that looks idle for a day but is load-bearing for the quarter

About

I am Marcus Chen, a Cloud Solutions Architect and Developer Advocate at Rabata.io, working remotely out of Singapore. My days center on S3-compatible object storage, Kubernetes persistent volumes, and the data infrastructure that feeds AI and ML training. The conviction I keep coming back to is that a storage decision should rest on a reproducible benchmark and a full total-cost-of-ownership figure, one that prices in egress, retrieval, and the engineer-hours a migration eats, rather than on the per-gigabyte number printed at the top of a rate card.

My path to that view ran through Wasabi Technologies, a Kubernetes-native AI startup before it, and a media-streaming shop before that, where serving petabytes of video taught me how fast a careless egress assumption compounds. Certifications (AWS Solutions Architect Professional, CKA, Terraform Associate) and a cloud-computing master's are the credentials; the working habit they leave me with is simpler. When a feature like Storage Insights datasets lets me measure access instead of estimating it, I reach for the measurement every time.

Conclusion

Storage Insights datasets are the right instrument for a real problem. At billions of objects, age-based tiering and manual log inspection both break, and you end up either overpaying for warm storage or archiving data something still needs. The activity views close that visibility gap, and they hand you three constraints in the same box: the 48-hour cold start makes this a tool you provision ahead of an incident, the four-hour latency keeps it informing decisions instead of triggering real-time automation, and the BigQuery cost model means the questions themselves carry a bill that query discipline has to contain.

Once it is running, the signal I would watch first is the regional traffic view: where the reads on your largest buckets actually originate, tracked over a full review cycle. That number moving is what tells you a placement change will pay before a tier change ever could. After that, watch your monthly BigQuery scan bytes against the storage savings the queries surface; the day the former starts approaching the latter is the day the exercise has stopped earning its keep. Those two trends, more than any single snapshot, tell you whether the tool is working for you.

Frequently Asked Questions

Initial population takes up to 48 hours after you configure a dataset, during which no SQL analytics are possible. Once data is flowing, new activity typically appears within four hours of the event. Enable the feature ahead of need, not during a live incident, because the cold start will not help you diagnose something happening right now.

No. The four-hour latency makes it excellent for forensic and trend analysis but unsafe as a real-time trigger, since a flash crowd can accumulate millions of events inside that window. Run destructive automation on a weekly cadence with a grace period, and cross-check any zero-activity reading against application logs before transitioning or deleting objects.

Not always. Storage savings can be erased by retrieval and egress fees, with egress around $0.12/GB per 2026 pricing guides being the usual budget-killer. A bucket with rare cross-region reads can cost more in Archive than it did in Nearline, so check the regional traffic view first and prefer co-locating compute with storage over a blind tier downgrade.

Storage Insights queries accrue standard BigQuery costs that scale with scanned data, unlike AWS S3 Storage Lens, which charges a predictable flat fee - roughly $0.20 per million objects for advanced metrics per 2026 references. Google's model gives more analytical reach but variable cost, so filter with tight snapshotEndTime constraints and estimate scan volume before broad queries.

Region first. Shipt used Insights datasets to detect egress from multi-region buckets and moved 1.3 petabytes to regional storage based on where traffic actually originated, not on object age. Confirm access geography in the regional traffic view before considering a storage-class change, because misplaced data costs more than mis-tiered data.