Martnet
Back to blog
Blog

Business backups by 3-2-1: what you actually test, what even the big ones don't

The 3-2-1 rule plus a 2026 addendum, RTO/RPO without the fluff, a test cadence that catches what cron jobs miss, and the 5 patterns we see at every audit. For companies that run backups but have never restored one.

6 min read

The first on-call night you remember forever is the one when the client calls at 23:47 asking "do you have the Tuesday backup?". You do. The .sql files sit in S3, the latest dump from 03:00. Only nobody has ever tried to bring it back up. An hour later you know the file drifted from replication on Saturday and the database you're trying to restore never existed in production in that shape.

A backup you don't test doesn't exist. That sentence probably bores everyone in the industry by now, but 8 in 10 companies that come to us for an infrastructure audit have the same problem: backups run every night, restores have never run.

This piece isn't about "you should have backups." It's about what to do concretely so that on the night the phone rings, there's something to actually get up from.

The 3-2-1 rule, plus one condition from 2026

A 1980s classic, still current:

  • 3 copies of data (original + 2 backups)
  • 2 different media (e.g. production disk + cloud object)
  • 1 copy off-site

In 2026 we add a leading zero: 0 copies that can be modified after they're written. Immutability (immutable backup, Object Lock in S3, WORM in GCS) isn't a luxury today, it's defense against ransomware. An attacker who got into your AWS account will destroy backups first, before encrypting production. If all your copies sit in the same account with full operational rights, the attacker gets a backup level with production.

In practice for a small company this means:

  • backup in an S3 bucket with Object Lock in Compliance mode, retention of at least 30 days
  • separate AWS Account for backups with cross-account replication
  • IAM keys that only write (write-only), don't read and don't delete
  • once a quarter, the "total loss of the main account, how much do we recover from the separate account" test

RTO and RPO: two numbers you don't know and you should

Most people say "we run backups every night" as if that were an answer. It isn't an answer. It's two questions.

RPO (Recovery Point Objective) is how much data you can afford to lose. A nightly backup is RPO = 24h, so if a disk dies at 18:00 you lose 15 hours of customer work. For an e-commerce shop in the middle of Black Friday that can be hundreds of thousands in transactions you won't recover, plus chargebacks, plus customers who never come back.

RTO (Recovery Time Objective) is how long you can be offline. A backup in S3 Glacier is cheap (cents per GB) but its restore takes 6–12h. If your RTO is 1h, Glacier won't help you, even if you have it.

Before you pick technology, set these two numbers per system:

| System | RPO | RTO | What it means | |---|---|---|---| | Company website | 24h | 4h | Nightly backup to S3 Standard is enough | | E-commerce shop | 5 min | 30 min | Synchronous replication + warm standby | | Customer database (CRM) | 1h | 2h | Continuous backup + fast restore | | Marketing files | 24h | 24h | Daily backup to S3 Glacier is fine |

Without these numbers you pick blind and almost always either overpay or underpay relative to real risk.

What to actually test and how often

A backup doesn't exist until someone has restored a working system from it in an environment that isn't production. Four tiers of tests, each at a different cadence:

Weekly: file-level restore. Pick a random file from 7 days ago, try to recover it. Takes 5 minutes, catches 80% of problems: broken permissions, wrong compression, missing metadata.

Monthly: full database restore. Take a full database dump from the backup, load it into a separate environment, start the app, click through critical paths. Catches problems with database engine versions, missing extensions, schemas drifted from production.

Quarterly: tabletop exercise. The team sits down at a table, someone throws a scenario ("disk theft from the office, how do we recover?"), everyone says out loud what they'd do. No clicking on anything. Catches gaps in runbooks, in documentation, in team knowledge: someone is on vacation, someone doesn't know the procedure, only one person knows the vault password.

Annually: chaos exercise. Real disaster drill. The production server gets shut down (leadership notified, ops team not), time to full restore is measured. Unpopular. Very effective.

Most companies only do tier one. Usually automatically and without checking whether it worked.

5 mistakes we see most often at new clients

1. "We have backups in S3" without saying since when it hasn't been tested. Common answer: never.

2. Everything in one cloud account. Attacker gets IAM root, deletes production, deletes backups, deletes logs. Three minutes of work, your company disappears. Backups must live in a separate account with separate billing and separate keys.

3. No alarm on "the backup didn't run". A cron script that died 3 months ago and nobody noticed. Discord, Slack, or Sentry notifications on every failed job are standard, not extra.

4. Backup credentials in the same vault as production. The 1Password master sits next to the database password the backup was made for. An attacker who got the vault gets everything, including the ability to wipe immutability tomorrow morning.

5. "An RDS snapshot is a backup." A snapshot is a backup only if it's copied to another account and another region, with immutability. An AWS snapshot in the same account isn't a backup, it's a convenient rollback.

Per-engine pitfalls

Postgres. pg_dump is a logical backup: slow on big databases, but portable across versions. pg_basebackup is physical: faster restore, but same major version. Continuous backup (WAL archiving, e.g. via pgBackRest or wal-g) is a must for RPO below 1h. Test restore on a separate host, not on a replica.

MySQL/MariaDB. mysqldump for small databases, xtrabackup (Percona, free) for larger ones. Remember --single-transaction for InnoDB so you don't block production. Binary log for PITR (point-in-time recovery).

MongoDB. mongodump is obvious, but for larger deployments mongo-tools will slow the cluster. Ops Manager / Cloud Manager take volume snapshots. For replica sets: back up from the secondary node, never from primary.

SQLite. Yes, you need to. Use .backup instead of copying the file, because a write during the copy equals a corrupted backup. In a container: litestream replicates to S3 in real time, far better than cron with .backup.

Checklist: 10 things to verify in your company

  1. You know the RPO and RTO for every critical system, written down somewhere, not just in your head.
  2. Backups live in a separate cloud account or physically outside the production location.
  3. Backups are immutable (Object Lock, WORM, immutable snapshot).
  4. At least one restore test ran in the last 30 days.
  5. Failed backups send an alert to Slack, Discord, or email, not just to a log file on disk.
  6. Backups are encrypted at rest with keys separate from production keys.
  7. Backup credentials and keys live outside the production vault.
  8. A restore runbook exists and was updated this year.
  9. Two people in the company know how to recover data (not one DevOps).
  10. Backups cover more than the database: user files, configs, secrets, certificates.

If you checked fewer than 6, you have a problem you don't see yet.

What we do for clients

As part of managed hosting and cloud setup we typically configure:

  • database backups with continuous archiving (RPO below 5 min)
  • cross-account replication to a separate AWS Account with a different owner
  • Object Lock with 30–90 day immutable retention
  • automated monthly restore tests (a script restores to staging, verifies, reports)
  • a disaster recovery runbook, updated on every major infrastructure change
  • 24/7 monitoring of backup jobs with alerts to Slack or Discord

If your company runs backups but isn't sure they actually work, write to us. We do a free backup audit: 1h meeting plus a report by the end of the week on where the risk sits and what can be improved without ripping the infrastructure out.


Next up

Related services

Related articles

Questions

Frequently asked questions

How often should I test backup restores?

At minimum once a month for a full database restore in an environment other than production, and once a week for a quick file-level restore of a random file. Quarterly tabletop with the team, annual real disaster drill. A backup nobody has restored statistically will not work on the night of the actual outage.

Is an AWS snapshot a backup?

A snapshot is a backup only if it is copied to a separate account and a different region, with immutability enabled (Object Lock, WORM). A snapshot in the same account is a convenient rollback, but does not protect against full account loss or against ransomware that obtained administrator rights.

What are RPO and RTO and does my company need to know them?

RPO (Recovery Point Objective) is how much data the company can afford to lose in an outage — a nightly backup means RPO 24h. RTO (Recovery Time Objective) is how long the company can be offline. Without these two numbers you pick a backup solution blind and almost always either overpay or underpay relative to the real risk.

Is the 3-2-1 rule enough in 2026?

It is enough in theory, but in practice we add a leading zero today: zero copies that can be modified after they are written. Immutability defends against ransomware that targets backups before encrypting production. In AWS that is Object Lock, in GCS WORM. Without it, three copies in one account are three copies of the same target.

Need help choosing the right setup?

Tell us about your infrastructure or workflow and we will recommend hosting, cloud or automation support.

Talk to us