SPOF — The Silent Killer of Cloud Infrastructure

Ido Frizler
5 min readDec 28, 2021

Today’s world relies heavily on the Internet, that’s obvious. What has possibly went somewhat unnoticed is that it is also heavily reliant on cloud infrastructure, accounting for about 95% of Internet traffic today.

That reliance is in every aspect of our lives, from social media and news, to critical infrastructure, 911 dispatch systems, aerial navigation, medical infrastructure, and more.

For this, I want to give you a peek into the world of Cloud Infrastructure, how companies make it so resilient, and how come you still end up hearing every now and then, about some giant having this massive outage, blacking out portions of the Internet for hours.

To understand the complexity of building and managing cloud infrastructure, we should realize the concept of cloud which is for customers to delegate everything to the cloud vendor. So, setting up a cloud environment means actually needing to physically build it — I’m talking about land, warehouses, electricity, HVAC, water, as well as cables, servers, disks, racks, switches, routers and more. And this is without even mentioning any of the software running on top of it all, connecting devices and users to the platform and to one another.

Some of these facilities are huge in size and energy consumption. Now, imagine that each respectable cloud vendor builds such datacenters everywhere they can, with a goal of being less than a few milliseconds from all of the world’s traffic and data. In numbers, we’re talking about 100’s of datacenters for each of the big names. Now you start to understand why their valuation is so high. Managing this for the world is priceless.

At this scale, the challenges these vendors are coping with are not the same as the ones other software companies are dealing with. Mostly, because it’s not only about writing quality code that has no bugs, it’s about building a system that is resilient enough to withhold the failures that will absolutely happen. Whether it’s in code or physical, within the datacenters or between them, due to a faulty action, a malicious one or an act of god, cloud infrastructure teams build their code with that state-of-mind.

A beautiful quote I love in this regard: “At the scale we are operating at, the improbable happens every day, and the impossible happens every week”

This is a good time to bring up SPOF (Single Point of Failure). These points will be the components in your system that when fail, will have a huge blast radius of damage, because they are on some critical path. It could be anything from an authentication service that is not load-balanced, to your system being hosted on a single physical rack having a power outage.

So, what do cloud infrastructure teams do to make sure their platform is resilient? The name of the game is finding and eliminating every possible SPOF in your system.

In the big league, you won’t believe to what level does that go. Everything can be a SPOF:

  • Logical components: usual suspects would be software components that may suffer from scaling issues. Each such component needs to have a many-to-many relationship with the components it communicates with, so that if an instance goes bad, the entire pipeline keeps working.
  • Geography: building services in a multi-region fashion, so that if a region goes down, the application keeps functioning. Even within a single region, cloud vendors are building physically separate availability zones, so that your virtual machines are guaranteed to be distributed across, if one of the facilities experiences any issues.
  • Networking: cloud networking has several layers of switches, routers and load balancers. So, what happens when a device malfunctions? If it’s a top-level router, you can’t allow it to drop traffic of an entire server rack. So, each server, each rack and reach gateway are load-balanced to multiple network devices, so that if one goes down, everything remains (almost) the same.
  • Storage: not special to the world of cloud (you probably have heard about RAID), but every storage is replicated in minimal latency to endure any sporadic meltdown of any physical storage device.
  • Power/cable: each physical facility has an array of generators and failsafe mechanisms to hold-up the location for a meaningful period in case of a natural disaster or any infrastructure failure (regional power outage).
  • Communication: what happens when you run the infrastructure for a world-leading communications platform (Zoom, Teams, WebEx, Google Meet, …), and the system has a major outage? How do you communicate with your team to quickly resolve the issue? Answer is, you make sure you have a backup system (probably, one of your competitors) that does not rely on your infrastructure, and you test is once in a while.
  • Authentication: one of the amazing things that were told about a recent Facebook outage was that employees rushing to fix the global outage were locked out of the Facebook offices because their keycards did not work. Reason for that was that this access was managed by the same systems they came to repair. Denied physical access was definitely an amusing anecdote in this case (and was reportedly solved by “bringing in the person with the key” to open the door), but online authentication is no less critical and challenging. Imagine if they could not connect to the cloud system, because authentication requires that very same cloud in order to operate. A classic SPOF example.

In addition to building your system to be ready to any failure, there are other interesting ways of inducing failures to test the system, such as Power outage drills to test the generators, and make sure everything is still intact.

One concept worth mentioning here is a Chaos Engineering. This is the concept of purposely introducing chaos into the system to test its resilience. It could be in the form of simulating a regional outage, or injecting faults (such as storage outage, expiring certificates, network failures) into the system and measuring how it reacts.

Having said all that, the world is complex and unpredictable and people will always make mistakes, so cloud outages are really much a thing. Actually, the impact of these outages on our daily lives only grows with time.

So, on top of the cloud vendor’s own reliability efforts, almost every large company has a BCDR (Business Continuity and Disaster Recovery) plan, which follows many of the above principles to make sure their services running on the cloud can endure an outage, to continuously be able to serve their clients.

An interesting trend that has emerged from that is the rise of Multi-cloud. One SPOF that I haven’t mentioned yet is having a vendor be the SPOF itself. If a vendor is experiencing any global issue, what better can a customer do than distribute their software among various cloud vendors? So many of them do, running the same apps on multiple cloud platforms, ensuring their fate is not tied to that of a single vendor, eliminating yet another SPOF :)

--

--