
PgBackRest Continuity: When Incremental Backups Break Your Recovery Point Objective
Key Takeaways
PgBackRest incremental backups, while storage-efficient, create complex recovery chains that are brittle. A single point of failure in the chain or WAL archiving can result in extended downtime and potential data loss, directly impacting RPO and RTO.
- Incremental backups in PgBackRest rely on WAL segments and base backups.
- Chain corruption (base backup + incremental + WAL) leads to failed recovery.
- RPO is directly tied to WAL archiving reliability and recovery point.
- Recovery Time Objective (RTO) can be severely impacted by the need to stitch together many incremental backups.
- Misconfiguration of retention policies or WAL archiving can lead to data loss.
PgBackRest Continuity: When archive-push-queue-max Trashes Your Recovery Point Objective
The recent announcement of sustained funding for PgBackRest, heralded by a coalition of industry titans, promises “long-term sustainability” and “reliable disaster recovery.” While any infusion of capital into a critical open-source project warrants a nod, such narratives frequently obscure the gritty details: the precise architectural guarantees and, more crucially, the explicit failure modes that can shatter a production system’s recovery point objective (RPO). As a systems architect whose daily grind involves optimizing C implementations at the byte level, the “continuity” that truly matters isn’t about venture capital rounds, but about the unbroken chain of WAL segments.
WAL Archiving: The Asynchronous Pitfall
PgBackRest, having shed its Perl origins for the deterministic performance of C in 2019, offers block-level incremental backups. This isn’t mere file-level delta tracking. Instead, it dissects files into variable-sized blocks (8KiB to 88KiB, adapting to file size and age) and tracks only the changed ones. These changed blocks are then aggregated into “super blocks” (256KiB to 1MiB) to maximize compression efficiency and I/O throughput. A checksum map per file underpins this approach, aiming for integrity verification.
The critical mechanism for replication and point-in-time recovery is PostgreSQL’s Write-Ahead Log (WAL). PgBackRest uses a dedicated archive-push command to stream these 16MB WAL segments to a repository. It employs a custom protocol designed for high-throughput, parallel operation. A backup is only considered transactionally complete—and thus, the RPO satisfied up to that point—when all necessary WAL segments have been reliably transmitted to the repository.
Herein lies the rub, particularly with asynchronous archiving (archive-async). This configuration permits PgBackRest to quickly acknowledge receipt of WAL segments to PostgreSQL by staging them locally in a spool-path before the actual network transfer begins. This quick acknowledgment prevents the PostgreSQL WAL directory from filling up and halting the database, a critical survival mechanism. However, this expediency introduces a subtle yet catastrophic dependency on queue depth.
The culprit is the archive-push-queue-max configuration parameter. This setting defines the maximum number of WAL segments that can be queued locally in the spool-path awaiting transfer. When this queue limit is reached, and WAL generation continues unabated, PgBackRest faces a hard choice: either allow PostgreSQL to fill its pg_wal directory and potentially crash, or sacrifice transactional integrity. The implemented behavior in PgBackRest, to prevent a full database crash, is to notify PostgreSQL that the WAL segment was successfully archived, and then silently DROP that WAL segment from the local queue.
This is not a bug; it is a deliberate, if devastating, design choice. The consequence is stark: an explicit gap is created in the WAL stream. Any point-in-time recovery attempt targeting a timestamp beyond that dropped segment will fail. The effective RPO is retroactively pushed back to the last fully archived WAL segment before the gap. Reestablishing a valid RPO requires taking a new full backup. This fundamental trade-off, sacrificing guaranteed RPO for database availability under load, is a critical detail often lost in the broad strokes of project funding announcements.
The Illusion of Verification
While PgBackRest strives for data integrity, its verification mechanisms can mask underlying corruption until it’s too late. The system does checksum every file in a backup, and these checksums are re-validated during a verify operation or a restore. It also checks PostgreSQL page-level checksums, provided they are enabled in the database configuration.
However, the crucial point is that checksum failures detected during the backup process do not automatically abort the backup. Instead, PgBackRest logs warnings. This means a backup operation can complete with a status of “success,” despite containing pages that have failed checksum verification. The integrity issue is then deferred, potentially to a future, manually initiated verify command, or, in the worst-case scenario, discovered only during a critical restore operation when there is no time for remediation. This selective logging of critical failures is a common pattern in systems prioritizing throughput, but it places a heavier burden on the operator to actively monitor for these specific warnings.
Consider the following configuration snippet for enabling page checksums in PostgreSQL and how PgBackRest interacts with it:
# postgresql.conf
wal_level = replica
fsync = on
synchronous_commit = on
data-checksums = on
When data-checksums = on in postgresql.conf, PgBackRest will attempt to verify these. However, even if a page fails this check during the backup, the process might continue. The absence of an explicit pgBackRest configuration option to fail the backup on page checksum errors means that manual intervention or vigilant log scraping becomes essential for detecting these integrity compromises early. The absence of such a fail-fast mechanism is a direct architectural choice that can obscure data corruption.
Operational Dependencies and Interdependencies
PgBackRest’s operational model carries further complexities. It lacks an integrated scheduler, necessitating reliance on external tools such as cron. This introduces an additional layer of operational overhead and potential failure points – a misconfigured cron job is as effective at preventing backups as a kernel panic.
Furthermore, the spool-path directory, used for staging WAL segments during asynchronous archiving, must reside on a local, POSIX-compliant filesystem. Attempts to use network file systems (NFS, CIFS) are explicitly discouraged by the PgBackRest documentation. Deviating from this recommendation can lead to unpredictable performance degradation and, more critically, reliability issues that manifest as missed WAL segments or corrupted archives, directly impacting the integrity of the backup chain.
The fundamental dependency chain of incremental and differential backups also presents a significant RPO risk. Each incremental backup is intrinsically linked to the preceding full or differential backup. If any base backup in this lineage is corrupted or lost, all subsequent dependent backups become effectively useless. The last valid independent full backup then becomes the sole recourse, dramatically pushing back the RPO. This cascade of failures underscores the importance of robust verification and proactive monitoring of the entire backup chain, not just the most recent operation.
Opinionated Verdict
The recent funding for PgBackRest is a positive development for its long-term viability. However, for the engineer on the ground responsible for RPO, the project’s sustainability is secondary to its operational guarantees. The explicit RPO compromise baked into the archive-push-queue-max mechanism is not a theoretical concern; it is a production-ready disaster waiting for a high-transaction load. Architects must treat this DROP WAL behavior as a critical failure mode. Implementing aggressive monitoring for WAL archiving delays and queue depth saturation—monitoring that goes beyond the cursory pgBackRest info command, which may not always perform deep archive consistency checks for performance reasons—is not optional. Your RPO is only as resilient as your guaranteed WAL stream. A system that prioritizes continuing its operation by silently discarding critical data segments, even to prevent an immediate crash, has fundamentally failed to meet the stringent requirements of disaster recovery where transactional consistency is paramount.




