The world always assumes that Google services will be there from Gmail and YouTube to Cloud Storage, Search and IoT management we take it for granted. Each hour YouTube has 30,000 hours of video uploaded, Google receives almost 230 million search queries, and an enormous number of emails are sent. In terms of connections, Google controls about a third of the surface internet but on Monday the 14th of December 2020, all of Google services suddenly disappeared across the world, users were unable to access emails, kicked out of ongoing Google Meet sessions.
One Twitter user even said that he was left sitting in the dark with his toddler as his Google Home system had failed. The crash had become one of the biggest social media trends. It sent waves of panic across businesses in many parts of the world. How could one of the largest companies suddenly go dark on all of its services at once? What happened? Was this a hack?
In total, the outage only lasted one hour but had already caused a lot of chaos. When it was all over most people forgot about it and went about their day. But when we analyze the situation some interesting things emerge. Not only are consumers dependent on Google but many businesses like us are also dependent on Google and how does Google avoid downtime in the first place?
Google outage caused pandemonium across the world. Some of the biggest companies in the world use Google and this includes Uber, Airbnb, Pinterest, Netflix, Spotify, Twitter, Instacart, and the list goes on and their employees could not able to reach not just these services in terms of Gmail but in some cases not able to get into the system at all. Lots of companies use Gmail to authenticate such as Salesforce, Dropbox so many others, in certain cases inaccessible and it also hits IoT devices such as Smart Thermostats, Smart Lights, those also appeared to be down.
Gmail, Google Search, YouTube, Google Docs, Google Drive, Nest Home Systems, Google Play even Google Stadia cloud gaming service all gone. The WSJ newsroom was dependent on Google Services, so during the outage, some reporters had to resort to using telephones to collaborate in writing stories. Some schools in the US had to close for the day. Wayne-Westland Community Schools in Michigan gave its combined 10,000 students the day off after Google crashed, the school relied on Google Meet for classes. Many other educational institutions would have been affected due to the prevalence of online classes because of the ongoing pandemic. There were also cases of the management of medical companies not being able to check on the schedules of physicians and other medical staff not being able to contact customers.
Remote work and learning have left individuals and businesses more dependent on online services than ever and in this domain, Google is the most widely used. All in all, the outage affected billions of people worldwide. So, what happened?
Google’s spokeswoman told the WSJ that there was a problem with the company’s system that authenticates log-in credentials. She stated that the problem was due to internal servers and that the issues weren’t the result of a cyber-attack. This explanation doesn’t give us much but that’s just about all that Google wants to say about the issue. It really is rare for Google to have such a global outage like this because even single physical geography is served by multiple servers across the world and even on these servers there are multiple backups that rapidly come online if there’s a problem. So, as we’ve seen so many businesses and people’s livelihoods rely on Google. It raises some serious questions what if next time Google was down not just for an hour but for days, billions of dollars in revenue could be lost by companies around the world.
So how does Google prevent this? How does Google basically never go down? Google calls its plan to keep its services up and running, Site Reliability Engineering (SRE), coined all the way back in 2006, SRE is a digital design philosophy basically for Google the idea is to get the software developers to run software management.
People call this kind of philosophy DevOps. Basically, the development of software coding that provides the outcomes of a system administrator. The thinking goes as follows, software developers will get bored by performing tasks by hand and naturally build tools to help automate the process without the involvement of actual people. In fact, Google has written a book about this.
Google states that SRE is its most fundamental feature. Todd Underwood of Google SRE team in 2016 told WIRED magazine, “We long for the day when nobody runs anything”. It’s interesting because traditionally development and operations were opposing forces, the developers always wanted to build new software and get the changes out to the public as fast as possible, but the operations team wanted to ensure that nothing went wrong and the best way to do this is to keep the changes to a minimum.
The trick that Google found is that if we combine development and operations, we can get a powerful synergy for a reliable system. It makes sense, Google is the world’s largest online empire, so the more humans we are running the more probability there is for mistakes. So, just have code run everything.
What about hacking? To combat the threat of hacking Google often runs hacking championships, these feature hackers who report security problems so they can be fixed before bad actors exploit them. Google calls this the Vulnerability Reward Program and it was first launched in 2010.
Though this next part is the interesting thing, the Google outage occurred just a mere few hours after it was discovered that the US Government had been targeted by a foreign cyber-attack. The hack was so serious that it led to an emergency National Security Council meeting at the White House. Experts are calling out it one of the most sophisticated hacks ever seen. It was done through something called a Supply-Chain Hack.
A software tool called SolarWinds that was used by government departments was infected with malware during an update, after this, the hackers were able to monitor internal emails and do some general snooping. The infected software update in question was released all the way back in March of 2020 and lay undetected until last week. Thousands of companies and American government departments use some form of SolarWinds software, some affected by the hack include the Department of Homeland Security, Department of Justice, Department of Defence, Treasury Department, NASA, NSA, and more. All of the top 10 US telecom companies and 425 of the US Fortune 500 companies are all said to be at risk. It is estimated that 18,000 clients had installed the infected update. Ironically, SolarWinds software monitors the computer networks of businesses and governments for outages.
But what if this massive Google outage was a response to the SolarWinds attack? Google’s staff may have been hurriedly shoring up their security and the scrambling resulted in some downtime across all of their services. For a massive worldwide outage to occur just a few hours after a massive cyber-attack on some of America’s biggest companies is pretty interesting timing, to put it mildly.
Google’s outage can be seen as a stark reminder of our hyper-connectedness. The company has become a bottleneck for so much of the world’s processes, it’s become part of a massive system and if it breaks that also has massive consequences. For just one company to become an unexpected chokepoint for global productivity is pretty unnerving.
So, what can be done? Well, the solution is obvious, there are alternatives to Google for every service they provide. It really comes down to the individual person or business, what it all comes down to is trading in some of that convenience that we’ve all gotten used to.