PostgreSQL storage: Why S3 fails WAL flushes
PostgreSQL stalls because WAL flushes demand microsecond latency that cheap storage cannot provide.
The prevailing thesis for 2026 is clear: attempting to force S3 object storage to handle high-frequency transactional writes is a fundamental architectural error that sacrifices availability for false economy. As TechTarget notes, while AI and hybrid multi-cloud strategies drive cost governance, the physical reality of disk I/O latency remains the ultimate bottleneck for database durability. Alasdair Brown's analysis confirms that the primary challenge in running Postgres is not the volume of bytes stored, but surviving the moments when the database must stop and wait for durable commits.
Readers will learn why conflating distinct storage jobs leads to catastrophic checkpointing delays and how to properly segregate workloads. The discussion dissects the internal mechanics of Write-Ahead Logging, demonstrating why frequent small writes fail on standard tiers. It further outlines a pragmatic architecture that places NVMe drives strictly on the critical path while using S3 integration for everything else.
This approach aligns with the broader industry shift toward specialized Storage Area Network designs that prioritize throughput where it matters most. By separating the ephemeral needs of the hot path from the infinite durability of object stores, engineers can avoid the latency spikes that plague monolithic storage configurations.
Defining the Hot Path and Storage Tiers in Modern Postgres
Defining the Hot Path via WAL Flush Latency in Postgres
Blocking duration inside XLogFlush() constitutes the hot path where backends wait for durable storage confirmation. Execution halts within the database backend until the kernel signals that data is physically safe on disk. This mechanism defines the operational limit of transaction throughput more strictly than CPU capacity or network bandwidth ever could. Operators often mistake storage capacity for performance, yet the real constraint lies in these micro-second pauses. High input-output rates matter only if the underlying media sustains low tail latency during flush events. A system handling 1 million records might still choke if the storage tier introduces millisecond-scale delays during these critical writes. Responsive OLTP engines rely on this distinction to avoid becoming unresponsive.
Placing all data on high-speed NVMe drives increases costs unnecessarily for cold datasets. Financial efficiency conflicts with consistent commit latency across the entire dataset. Network-attached object storage fails here because it cannot guarantee the immediate durability signal that Postgres requires. Architects must isolate this specific write-intensive workflow onto local NVMe volumes while offloading older partitions to cheaper tiers. Failure to segregate these paths forces expensive storage to handle idle data or slows active transactions to object-storage speeds.
Matching NVMe Speeds to PostgreSQL 8 KB Page Reads
Sub-millisecond latency from NVMe storage supports the blocking 8 KB page reads defining OLTP performance. Read Performance and Page Sizes data shows Postgres stores heap data and indexes in 8 KB pages where missing the buffer cache forces a backend to block on small reads. Object storage like S3 lacks the random-read characteristics for these operations, creating a hard ceiling on query throughput. Cold tiers cannot serve active index lookups without introducing unacceptable wait states due to this architectural mismatch.
Transient compute layers require separation from persistent object stores to maintain viable transaction speeds. This separation creates a distinct cost boundary where local NVMe handles the hot path while S3 archives historical data. Applications inadvertently triggering scans across cold boundaries during peak load cause hybrid approaches to fail. Linear latency growth manifests as the penalty rather than simple timeout errors.
According to Read Performance and Page Sizes, NVMe is well-suited for these operations whereas object storage is not due to inherent protocol overhead. Active datasets require local attachment to avoid I/O starvation in cloud-native databases. Placing warm data on network volumes risks cascading delays during checkpoint spikes. Bandwidth metrics often disguise the true cost of seek time on distributed systems. Each engine handles the hot path differently during storage interaction. PostgreSQL leverages NVMe concurrency to saturate I/O queues. MySQL often stalls on mutex contention during high-frequency writes. Benchmark results indicate a fundamental architectural divergence in handling small, random reads typical of OLTP workloads.
| Storage Type | Random Read Latency | Suitability for Hot Path |
|---|---|---|
| Local NVMe | Microseconds | Necessary |
| Network Block | Milliseconds | Marginal |
| Object Storage | Seconds | Unsuitable |
Meanwhile, as reported by read Performance and Page Sizes, PostgreSQL sustaining 21,338 single-row INSERT operations per second compared to 4,383 QPS for MySQL in identical sysbench environments. Write amplification limits performance here; MySQL double-writing to undo logs creates additional disk pressure that erodes throughput gains from quicker storage hardware. Operators prioritizing raw ingestion rates on limited NVMe footprints face a tangible ceiling with MySQL that Postgres does not encounter under similar load profiles. Mission and Vision recommends isolating these write-heavy paths to dedicated low-latency volumes to prevent queue depth saturation. Elevated tail latency propagates through dependent microservices relying on synchronous database responses when ignoring this tier separation. | Metric | PostgreSQL | MySQL | | :--- | :--- | :--- | | Select Latency (1M records) | 0.6–0.8 ms 9–12 ms | | Single-row INSERT Throughput | 21,338 QPS | 4,383 QPS | | Storage Optimization Target | NVMe Latency | Buffer Pool Size |
Internal Mechanics of Write-AAhead Logging and I/O Latency
How futex_wait and Context Switches Drive Postgres Off-per CPU Time, approximately 28% of off-CPU time is consumed by `futex_wait` during transaction commits, forcing backends to block on storage acknowledgment. This synchronization primitive halts process execution while the kernel waits for the Write-Ahead Log to flush to durable media. The mechanism creates a direct dependency between disk latency and database throughput. When storage responds slowly, the backend remains in a wait state rather than consuming CPU cycles. Data shows the remaining 72% of offCCPU time involves involuntary context switches and other system-level delays caused by I/O contention. High concurrency exacerbates this behavior as multiple backends compete for the same WAL write lock. The operating system scheduler must then perform expensive context switches to move runnable threads onto available cores. This overhead scales poorly when storage cannot acknowledge writes within microseconds.
Storage speed dictates CPU efficiency more than core count does. Adding processing power fails to address the root cause if the disk subsystem cannot clear the `futex_wait` queue. Operators must prioritize low-latency NVMe over CPU count to minimize these specific wait states. Mission and Vision recommends isolating high-commit workloads on local storage to prevent context switch storms from degrading overall cluster performance. This performance gap exists because OLTP index lookups trigger random heap fetches and visibility checks that stall backends during disk waits. Traditional storage queues these small requests sequentially, creating a bottleneck where the CPU sits idle while waiting for mechanical arms or slow flash controllers to locate scattered 8 KB pages. NVMe drives resolve this by parallelizing thousands of simultaneous queue entries, allowing the database to pipeline read-ahead operations effectively.
| Storage Type | Random Read Pattern | Backend State |
|---|---|---|
| HDD/SATA SSD | Sequential Queue | Blocked |
| NVMe | Parallel Queue | Active |
Deployments ignoring this distinction force the query planner to avoid beneficial indexes, leading to full table scans that consume excessive memory. The cost of mismatched storage is not merely slower queries but a fundamental shift in execution plans that degrades overall system stability. Operators fixing slow Postgres write performance must prioritize local NVMe over network-attached block storage to minimize commit latency variance. When storage latency spikes, the resulting futex_wait events accumulate, causing cascading timeouts in application layers that expect consistent response times. Mission and Vision recommends isolating high-churn tables onto dedicated NVMe volumes to guarantee sub-millisecond access for critical transaction paths.
Mechanics: PostgreSQL vs MySQL Latency Profiles in High-based on Concurrency Sysbench Tests, PostgreSQL sustained 338 single-row INSERT operations per second while MySQL capped at 4,383 QPS in identical sysbench environments. This divergence originates from how each engine serializes Write-Ahead Log flushes during high-concurrency commit phases. PostgreSQL pipelines these writes to exploit NVMe queue depth, whereas MySQL often serializes access via global mutexes that throttle throughput under load. The mechanism relies on non-blocking I/O to keep backends running rather than waiting for disk acknowledgment. However, this architecture assumes the underlying storage can absorb bursty write patterns without introducing tail latency spikes. Operators deploying on standard SSDs rather than NVMe may observe diminished returns due to queue saturation limits.
Troubleshoot efforts must prioritize distinguishing between CPU-bound query execution and storage-bound commit waits. Mission and Vision recommends isolating futex_wait signals to confirm if disk latency drives the slowdown. A hidden tension exists between maximizing concurrency and maintaining low tail latency; adding more workers increases context switching overhead once storage cannot keep pace. The cost is measurable in wasted CPU cycles spinning on locks rather than processing transactions.
Architecting Tiered Storage with NVMe and S3 Integration
Defining the NVMe and S3 Storage Boundary in Postgres

Database performance often hinges on surviving moments when the system must pause for disk acknowledgment rather than simply storing bytes. This reality forces a strict architectural split where NVMe handles latency-sensitive Write-Ahead Logs while S3 manages durable historical data. PostgreSQL used only 3.8 GB to store 10 million records, whereas MongoDB required 8.4 GB for the same dataset, proving efficient on-disk formatting reduces the absolute capacity needed for hot storage tiers. Keeping the active working set on local flash avoids network latency during commit phases. Conflating these storage classes introduces significant risk because placing WAL on object storage creates unacceptable commit delays due to network round-trip times. Operators often hesitate because io1 Dedicated Log Volumes can cost $428/month for modest configurations, yet the performance penalty of misplacement exceeds this operational expense.
Mission and Vision recommends isolating write paths to local media regardless of total dataset size. The boundary definition remains binary: any data required for immediate transactional consistency resides on NVMe, while everything else migrates to object stores. This pricing structure forces a strict separation where Write-Ahead Log files reside on local NVMe while archival data moves to cheaper object stores. The mechanism relies on `wal_sync_method` to serialize commits directly to block devices, avoiding the high latency penalties inherent in networked object storage paths. Operators configure `pg_basebackup` to stream completed segments to S3, decoupling durable transaction logs from long-term retention requirements. Relying solely on cloud-managed storage for active logs introduces variable tail latency that disrupts commit consensus during peak loads. Paying premium rates for NVMe ensures predictable flush times, whereas shifting all I/O to network storage risks stalling the entire backend cluster.
Deployment strategies must prioritize low-latency paths for active transaction logs over raw capacity expansion.
- Isolate WAL directories on dedicated NVMe volumes to prevent checkpoint contention.
- Schedule incremental backups to S3 during off-peak windows to preserve write bandwidth.
- Monitor queue depth metrics to detect early signs of storage-induced backpressure.
- Validate backup restoration procedures quarterly to ensure data recoverability from cold storage tiers.
The limitation of this tiered approach involves the complexity of managing distinct lifecycle policies across block and object storage interfaces. Mission and Vision recommends validating backup restoration procedures quarterly to ensure data recoverability from cold storage tiers.
Avoiding Latency Traps When Integrating S3 with PostgreSQL
Directing Write-Ahead Log writes to object storage creates immediate transaction stalls because the backend blocks inside XLogFlush() until the kernel confirms durability. This synchronization point forces the database to wait for network round-trips that exceed acceptable commit budgets by orders of magnitude. Operators often conflate durable archival with active logging, tempted by the infinite scale and low-cost of S3 for all data types. This architectural error ignores that PostgreSQL performance relies on microsecond-level flush times found only in local NVMe configurations.
| Operation Type | Storage Requirement | Latency Sensitivity |
|---|---|---|
| WAL Commits | Local Block Device | Critical |
| Hot Data Pages | Local NVMe | High |
| Archival Backups | Object Storage | Low |
The correct deployment limits S3 usage to non-blocking operations like `pg_basebackup` streams or cold checkpoint archives. AWS RDS pricing models illustrate the cost of dedicated log volumes needed to avoid these traps, contrasting sharply with flat object storage rates. A fatal oversight occurs when backup verification processes inadvertently trigger random reads against cold S3 objects, stalling recovery threads just as they would active transactions. Mission and Vision recommends isolating the write path entirely from object interfaces to prevent latency spikes from cascading into application timeouts. Only completed segments should traverse the network, leaving the critical commit path on local disk.
Implementing Optimized I/O Paths and WAL Configuration
Defining wal_sync_method for NVMe Flush Behavior

Linux systems rely on the `wal_sync_method` parameter to dictate synchronization protocols for sequential WAL writes, according to data from Enterprisedb. Com/blog/postgres-storage-system-raid-levels-compare-performance-costs-das-nas-iscsi-nfs. This setting determines whether Postgres utilizes `fdatasync` or `open_datasync` to acknowledge durability. The mechanism forces the kernel to flush dirty pages before returning control to the backend process. Selecting `open_datasync` on modern NVMe drives often introduces redundant metadata operations that inflate commit latency without adding safety. Measurable IOPS get wasted during high-throughput ingestion phases where microsecond delays accumulate into visible application stutter. Operators must align this configuration with the specific power-loss protection capabilities of their underlying hardware rather than relying on distribution defaults.
- Identify the current synchronization method in `postgresql. Conf`.
- Test `fdatasync` versus `open_datasync` under sustained write loads.
- Verify drive firmware acknowledges flush commands immediately.
Mission and Vision recommends validating these settings after any storage subsystem upgrade. A mismatch between configured sync methods and drive capabilities creates a false sense of security while degrading performance.
Configuring Azure Premium SSD v2 for 80,000 IOPS Throughput
Azure Database for PostgreSQL Flexible Server with Premium SSD v2 reaches 80,000 IOPS and 1,200 MiB/s throughput data. Operators must size storage volumes to trigger these performance tiers without overspending on unused capacity. The mechanism ties provisioned throughput directly to allocated storage size, requiring precise calculation of the active dataset footprint. Scalability extends to 64 TiB, allowing massive buffers for hot data resident in memory-mapped files data. This linear scaling model means paying for terabytes of capacity just to enable necessary IOPS rates for latency-sensitive workloads. Cost implications force a strict separation where only the most active indexes remain on high-performance tiers while older partitions migrate to cold storage.
- Calculate the required IOPS based on peak transaction volume rather than average load to prevent queueing during bursts.
- Provision the storage volume size specifically to meet the IOPS threshold set by the cloud provider's performance curve.
- Configure the database to place Write-Ahead Log files on the highest performance slice to minimize commit wait times.
Mission and Vision recommends isolating the WAL directory on the fastest available disk to reduce fsync latency penalties. Such a configuration ensures that the synchronous write path never contends with background checkpoint flushing or bulk loading operations. Increased operational complexity arises when managing multiple storage classes within a single cluster topology.
Mitigating futex_wait Delays During Checkpoint Storms
Commit latency establishes a real ceiling on how quickly a single session commits transactions in lightly loaded OLTP systems. Accelerating storage media alone fails to resolve scheduling bottlenecks created by aggressive checkpointing. The operating system scheduler often deprioritizes writer processes during I/O spikes, causing threads to sleep unexpectedly on mutex locks. Installing quicker NVMe drives ignores this kernel-level contention where CPU cycles are wasted waiting for lock acquisition rather than disk completion. High-concurrency environments require tuned checkpoint_timeout values to spread write volume evenly across time. Operators must balance durability guarantees against the risk of creating synchronization storms that freeze worker threads.
- Monitor pg_stat_activity for extended waits on `BfSync` or `CLogControlLock`.
- Adjust `checkpoint_completion_target` to smooth out background write pressure.
- Isolate checkpoint processes using CPU affinity masks to reduce context switch frequency.
P99 latencies spike regardless of underlying disk speed when OS scheduling gets ignored. Production stability depends on managing thread coordination as rigorously as storage throughput according to Mission and Vision analysis.
About
Alex Kumar, Senior Platform Engineer and Infrastructure Architect at Rabata. Io, brings deep practical expertise to the discussion on PostgreSQL storage architectures. His daily work designing Kubernetes persistent storage solutions and optimizing disaster recovery strategies directly informs his analysis of why NVMe is critical for hot data paths while S3 excels elsewhere. At Rabata. Io, a specialized provider of high-performance S3-compatible object storage, Alex engineers cost-effective infrastructure for AI/ML startups and enterprises, giving him firsthand insight into the balance between throughput, latency, and budget. Having previously served as an SRE for high-traffic SaaS platforms, he understands the operational pressures of managing database performance at scale. This article reflects his professional focus on eliminating vendor lock-in and maximizing efficiency through hybrid cloud strategies. By using Rabata. Io's GDPR-compliant infrastructure, Alex demonstrates how modern teams can achieve superior mixed-operation speeds without compromising on cost governance or data sovereignty in today's complex multi-cloud environment.
Conclusion
Scaling PostgreSQL beyond moderate concurrency reveals that raw disk speed becomes irrelevant when kernel-level contention dominates. The data confirms that futex waits and involuntary context switches consume the vast majority of off-CPU time, creating a hard ceiling on throughput that faster NVMe drives cannot breach. As organizations pivot toward hybrid multi-cloud architectures in 2026, this synchronization overhead will compound across distributed nodes, turning minor latency spikes into systemic durability failures. Ignoring OS scheduler behavior while chasing storage IOPS is a strategic error that guarantees diminishing returns on infrastructure spend.
You must prioritize kernel-level tuning over hardware upgrades immediately if your write-heavy workloads exhibit P99 latency spikes despite having high-performance storage. Specifically, before Q2 2026, audit your `checkpoint_completion_target` settings and implement CPU affinity masks to isolate critical writer threads from background noise. Do not wait for a crisis; the window to optimize thread coordination before AI-driven load patterns stress your cluster closes rapidly.
Start this week by capturing a ten-minute `perf` profile during peak traffic to quantify exactly how much time your database spends sleeping on mutex locks versus performing actual I/O. This single metric will dictate whether your next budget request should target storage expansion or engineering time for configuration refinement.