← all insights · 9 min read resilience

How to design network resilience without overbuying equipment.

Resilience is not a procurement decision. Buying two of everything is the most expensive way to build a fragile network — and one of the most common. Real resilience comes from understanding how links terminate, whether providers share the same physical path, whether failover has ever actually been tested, and whether the team knows what to do at 2 AM when the primary path drops.

In this article

Resilience is not bought — it is designed
Single points of failure people often miss
Dual ISP is not always true diversity
Firewall HA: when it helps and when it does not
Backup connectivity is only as good as the last test
Virtualization as a shared-resource resilience strategy
What survives the outage matters more than what fails
A practical checklist before approving a resilient design
Closing

Resilience is not bought — it is designed

The instinct, when a network has had an outage, is to buy more of what was running when it failed. Two firewalls instead of one. Two ISPs instead of one. A second core switch. A backup SAN.

None of those purchases are wrong on their own. They are just rarely sufficient, and frequently misdirected. A network with two of every device but a single power supply running the rack is not a resilient network — it is an expensive one that fails just as completely as the cheap version when the power supply trips. Real resilience is an operational property of the whole system, not a hardware count.

Single points of failure people often miss

The interesting failures are rarely the ones the procurement budget anticipates. The list we end up walking through with customers tends to include:

The single UPS feeding both "redundant" firewalls.
The single switch port carrying both ISP uplinks because the patch panel ran out.
The single DNS resolver everyone in the office (and every server) points at.
The single VPN concentrator that grants remote access to the "resilient" cloud environment.
The single admin who knows where the backup configurations actually live.
The single dependency on a SaaS identity provider, with no break-glass account anywhere.

These are not exotic problems. They are what we find when we do a current-state review of a network that on paper looks very redundant.

Dual ISP is not always true diversity

Having two internet providers is the most commonly recommended resilience move. It is also the one most likely to disappoint when tested. The reason is that "two providers" and "two physical paths" are very different things.

Two ISPs may quietly share:

The same incoming copper or fibre into the building.
The same conduit and same entry point into the room.
The same carrier upstream (because one is reselling the other's last-mile circuit).
The same neighborhood power feed.
The same regional fibre route between exchanges.

The question worth asking, when the second circuit is being installed, is the unglamorous one: if a backhoe cuts the cable in front of the building, does the secondary path stay up? If the honest answer is "probably not," the second circuit is not really a second path — it is an extra cost.

True diversity usually means combining technologies: a fibre primary with a 4G/5G or microwave secondary, or splitting between a wireline carrier and a satellite link for the most critical sites.

A paired design holds up only when each layer has independent paths — including ISPs that don't share the same last mile.

Firewall HA: when it helps and when it does not

Active-passive HA on a pair of firewalls is genuinely useful for a narrow set of failures: a hardware fault on the primary, a planned firmware upgrade, a power supply failing in the chassis. The standby takes over, traffic continues, the on-call engineer sleeps through the event.

It does not help with the failures that cause most real outages:

A misconfiguration that gets synchronized to the standby in real time.
An upstream ISP outage that takes out both legs.
An ISP modem on one leg dying silently because nothing alerted on it.
A licence expiry that hits both units at the same time.
A subtle bug in a feature that triggers under specific traffic conditions on both peers.

The point is not that HA is wrong. The point is that two firewalls do not automatically mean twice the uptime. For some businesses, the right answer is one well-monitored, well-managed firewall plus a tested failover plan — cheaper to buy, cheaper to operate, and not significantly less resilient against the failures that actually happen.

Backup connectivity is only as good as the last test

The most expensive resilience investment in many networks is the secondary link that has never actually carried real traffic. "We have a 4G failover" sounds reassuring until the day it needs to kick in: the SIM expired six months ago, the modem reboots without coming back, the routing protocol does not converge because the policy was wrong, and nobody noticed because nothing ever exercised the path.

The discipline that turns backup connectivity into real resilience is unglamorous:

A scheduled failover drill at least quarterly — ideally during business hours, with the team that would be paged.
Synthetic monitoring that sends real traffic over the secondary path on a known interval.
Alerting on the secondary going dark, not only on the primary. A backup link that is silently down is worse than no backup at all.
A written record of the last successful failover. If you cannot point at a date, you do not have a tested backup.

Virtualization as a shared-resource resilience strategy

Buying a second physical server for every workload is the resilience answer that does not survive a budget review. Virtualization is often the better one. Two physical hosts running a hypervisor (VMware, Proxmox, Hyper-V) with shared or replicated storage give you:

Hardware redundancy at the host level — a failed motherboard or PSU no longer takes down a service.
Workload migration during planned maintenance — patch one host while the other runs the workloads.
Hypervisor-level HA — VMs restart on the surviving host within minutes of a failure.
Resource consolidation — fewer physical boxes, less power, fewer things to fail.

For an SMB with a handful of internal services — file share, line-of-business app, internal database, identity, monitoring — two virtualization hosts will outperform six single-purpose physical servers on both resilience and cost. The architectural rule of thumb: share the physical hardware deliberately and replicate at the layer that matters.

A well-designed network with one firewall can be more resilient than a poorly-designed one with two — because resilience is an operational property, not a hardware property.

What survives the outage matters more than what fails

The most useful mental model we use in design conversations: instead of asking "what do we duplicate?", ask "what services must keep working when the primary path fails?" The list is usually shorter than the procurement spec would suggest:

Can remote staff still reach the things they need to do their job?
Does DNS resolve? (More outages are caused by DNS than by missing redundancy.)
Do cloud applications still authenticate against the identity provider?
Does the phone system survive — and is the on-call number on a different network than the one that just died?
Can the support team SSH into the device that broke, through a path that does not depend on the broken thing?

Designs that pass this test usually do not require buying everything twice. They require knowing where the dependencies actually live and arranging the architecture so the critical ones survive.

A practical checklist before approving a resilient design

Before signing off on any architecture that is being sold as "resilient," we walk through the following list. Anything that cannot be answered with a confident yes is a design risk worth surfacing.

Every "two of X" in the diagram has been validated as two physically independent paths, not two boxes on the same circuit.
The proposed failover has been tested at least once end-to-end, on the actual equipment, with the actual configuration.
The team that will operate the network has a written runbook for the most likely outage scenarios.
The secondary path is monitored independently. An alert fires when it goes dark — not only when the primary does.
Single points of failure outside the network (DNS, identity, the cloud entry point, the SaaS dashboard the team relies on) are documented and accepted as risks, or mitigated.
The cost of the architecture is justified by the cost of the outage it prevents — not the marketing definition of "five nines."

Closing

Network resilience is rarely about buying more. It is about understanding where the real risk lives, designing around it deliberately, and validating that the design actually holds. The right architecture for a particular business depends on what it can tolerate losing, what it can afford to operate, and what the team can run on the worst night of the year. A resilient network is one where the answer to those questions has been written down — and tested — before it mattered.

All insights Discuss your network

/ read next · cloud connectivity

Cloud-to-office connectivity: site-to-site VPN, SD-WAN, or direct connect?

Continue