← all insights · 9 min read resilience

How to design network resilience without overbuying equipment.

Resilience is not a procurement decision. Buying two of everything is the most expensive way to build a fragile network — and one of the most common. Real resilience comes from understanding how links terminate, whether providers share the same physical path, whether failover has ever actually been tested, and whether the team knows what to do at 2 AM when the primary path drops.

Resilience is not bought — it is designed

The instinct, when a network has had an outage, is to buy more of what was running when it failed. Two firewalls instead of one. Two ISPs instead of one. A second core switch. A backup SAN.

None of those purchases are wrong on their own. They are just rarely sufficient, and frequently misdirected. A network with two of every device but a single power supply running the rack is not a resilient network — it is an expensive one that fails just as completely as the cheap version when the power supply trips. Real resilience is an operational property of the whole system, not a hardware count.

Single points of failure people often miss

The interesting failures are rarely the ones the procurement budget anticipates. The list we end up walking through with customers tends to include:

These are not exotic problems. They are what we find when we do a current-state review of a network that on paper looks very redundant.

Dual ISP is not always true diversity

Having two internet providers is the most commonly recommended resilience move. It is also the one most likely to disappoint when tested. The reason is that "two providers" and "two physical paths" are very different things.

Two ISPs may quietly share:

The question worth asking, when the second circuit is being installed, is the unglamorous one: if a backhoe cuts the cable in front of the building, does the secondary path stay up? If the honest answer is "probably not," the second circuit is not really a second path — it is an extra cost.

True diversity usually means combining technologies: a fibre primary with a 4G/5G or microwave secondary, or splitting between a wireline carrier and a satellite link for the most critical sites.

A paired design holds up only when each layer has independent paths — including ISPs that don't share the same last mile.

Firewall HA: when it helps and when it does not

Active-passive HA on a pair of firewalls is genuinely useful for a narrow set of failures: a hardware fault on the primary, a planned firmware upgrade, a power supply failing in the chassis. The standby takes over, traffic continues, the on-call engineer sleeps through the event.

It does not help with the failures that cause most real outages:

The point is not that HA is wrong. The point is that two firewalls do not automatically mean twice the uptime. For some businesses, the right answer is one well-monitored, well-managed firewall plus a tested failover plan — cheaper to buy, cheaper to operate, and not significantly less resilient against the failures that actually happen.

Backup connectivity is only as good as the last test

The most expensive resilience investment in many networks is the secondary link that has never actually carried real traffic. "We have a 4G failover" sounds reassuring until the day it needs to kick in: the SIM expired six months ago, the modem reboots without coming back, the routing protocol does not converge because the policy was wrong, and nobody noticed because nothing ever exercised the path.

The discipline that turns backup connectivity into real resilience is unglamorous:

Virtualization as a shared-resource resilience strategy

Buying a second physical server for every workload is the resilience answer that does not survive a budget review. Virtualization is often the better one. Two physical hosts running a hypervisor (VMware, Proxmox, Hyper-V) with shared or replicated storage give you:

For an SMB with a handful of internal services — file share, line-of-business app, internal database, identity, monitoring — two virtualization hosts will outperform six single-purpose physical servers on both resilience and cost. The architectural rule of thumb: share the physical hardware deliberately and replicate at the layer that matters.

A well-designed network with one firewall can be more resilient than a poorly-designed one with two — because resilience is an operational property, not a hardware property.

What survives the outage matters more than what fails

The most useful mental model we use in design conversations: instead of asking "what do we duplicate?", ask "what services must keep working when the primary path fails?" The list is usually shorter than the procurement spec would suggest:

Designs that pass this test usually do not require buying everything twice. They require knowing where the dependencies actually live and arranging the architecture so the critical ones survive.

A practical checklist before approving a resilient design

Before signing off on any architecture that is being sold as "resilient," we walk through the following list. Anything that cannot be answered with a confident yes is a design risk worth surfacing.

Closing

Network resilience is rarely about buying more. It is about understanding where the real risk lives, designing around it deliberately, and validating that the design actually holds. The right architecture for a particular business depends on what it can tolerate losing, what it can afford to operate, and what the team can run on the worst night of the year. A resilient network is one where the answer to those questions has been written down — and tested — before it mattered.

/ read next · cloud connectivity

Cloud-to-office connectivity: site-to-site VPN, SD-WAN, or direct connect?

Continue