Engineering prowess famously catapulted the 14-year-old search giant into its place as one of the world’s most successful, influential, and frighteningly powerful companies. Its constantly refined search algorithm changed the way we all access and even think about information. Its equally complex ad-auction platform is a perpetual money-minting machine. But other, less well-known engineering and strategic breakthroughs are arguably just as crucial to Google’s success: its ability to build, organize, and operate a huge network of servers and fiber-optic cables with an efficiency and speed that rocks physics on its heels. Google has spread its infrastructure across a global archipelago of massive buildings—a dozen or so information palaces in locales as diverse as Council Bluffs, Iowa; St. Ghislain, Belgium; and soon Hong Kong and Singapore—where an unspecified but huge number of machines process and deliver the continuing chronicle of human experience.
This is what makes Google Google: its physical network, its thousands of fiber miles, and those many thousands of servers that, in aggregate, add up to the mother of all clouds. This multibillion-dollar infrastructure allows the company to index 20 billion web pages a day. To handle more than 3 billion daily search queries. To conduct millions of ad auctions in real time. To offer free email storage to 425 million Gmail users. To zip millions of YouTube videos to users every day. To deliver search results before the user has finished typing the query. In the near future, when Google releases the wearable computing platform called Glass, this infrastructure will power its visual search results.
The problem for would-be bards attempting to sing of these data centers has been that, because Google sees its network as the ultimate competitive advantage, only critical employees have been permitted even a peek inside, a prohibition that has most certainly included bards. Until now.
Here I am, in a huge white building in Lenoir, standing near a reinforced door with a party of Googlers, ready to become that rarest of species: an outsider who has been inside one of the company’s data centers and seen the legendary server floor, referred to simply as “the floor.” My visit is the latest evidence that Google is relaxing its black-box policy. My hosts include Joe Kava, who’s in charge of building and maintaining Google’s data centers, and his colleague Vitaly Gudanets, who populates the facilities with computers and makes sure they run smoothly.
A sign outside the floor dictates that no one can enter without hearing protection, either salmon-colored earplugs that dispensers spit out like trail mix or panda-bear earmuffs like the ones worn by airline ground crews. (The noise is a high-pitched thrum from fans that control airflow.) We grab the plugs. Kava holds his hand up to a security scanner and opens the heavy door. Then we slip into a thunderdome of data …
Urs Hölzle had never stepped into a data center before he was hired by Sergey Brin and Larry Page. A hirsute, soft-spoken Swiss, Hölzle was on leave as a computer science professor at UC Santa Barbara in February 1999 when his new employers took him to the Exodus server facility in Santa Clara. Exodus was a colocation site, or colo, where multiple companies rent floor space. Google’s “cage” sat next to servers from eBay and other blue-chip Internet companies. But the search company’s array was the most densely packed and chaotic. Brin and Page were looking to upgrade the system, which often took a full 3.5 seconds to deliver search results and tended to crash on Mondays. They brought Hölzle on to help drive the effort.
It wouldn’t be easy. Exodus was “a huge mess,” Hölzle later recalled. And the cramped hodgepodge would soon be strained even more. Google was not only processing millions of queries every week but also stepping up the frequency with which it indexed the web, gathering every bit of online information and putting it into a searchable format. AdWords—the service that invited advertisers to bid for placement alongside search results relevant to their wares—involved computation-heavy processes that were just as demanding as search. Page had also become obsessed with speed, with delivering search results so quickly that it gave the illusion of mind reading, a trick that required even more servers and connections. And the faster Google delivered results, the more popular it became, creating an even greater burden. Meanwhile, the company was adding other applications, including a mail service that would require instant access to many petabytes of storage. Worse yet, the tech downturn that left many data centers underpopulated in the late ’90s was ending, and Google’s future leasing deals would become much more costly.
For Google to succeed, it would have to build and operate its own data centers—and figure out how to do it more cheaply and efficiently than anyone had before. The mission was codenamed Willpower. Its first built-from-scratch data center was in The Dalles, a city in Oregon near the Columbia River.
Hölzle and his team designed the $600 million facility in light of a radical insight: Server rooms did not have to be kept so cold. The machines throw off prodigious amounts of heat. Traditionally, data centers cool them off with giant computer room air conditioners, or CRACs, typically jammed under raised floors and cranked up to arctic levels. That requires massive amounts of energy; data centers consume up to 1.5 percent of all the electricity in the world.
Google realized that the so-called cold aisle in front of the machines could be kept at a relatively balmy 80 degrees or so—workers could wear shorts and T-shirts instead of the standard sweaters. And the “hot aisle,” a tightly enclosed space where the heat pours from the rear of the servers, could be allowed to hit around 120 degrees. That heat could be absorbed by coils filled with water, which would then be pumped out of the building and cooled before being circulated back inside. Add that to the long list of Google’s accomplishments: The company broke its CRAC habit.
Google also figured out money-saving ways to cool that water. Many data centers relied on energy-gobbling chillers, but Google’s big data centers usually employ giant towers where the hot water trickles down through the equivalent of vast radiators, some of it evaporating and the remainder attaining room temperature or lower by the time it reaches the bottom. In its Belgium facility, Google uses recycled industrial canal water for the cooling; in Finland it uses seawater.
The company’s analysis of electrical flow unearthed another source of waste: the bulky uninterrupted-power-supply systems that protected servers from power disruptions in most data centers. Not only did they leak electricity, they also required their own cooling systems. But because Google designed the racks on which it placed its machines, it could make space for backup batteries next to each server, doing away with the big UPS units altogether. According to Joe Kava, that scheme reduced electricity loss by about 15 percent.
All of these innovations helped Google achieve unprecedented energy savings. The standard measurement of data center efficiency is called power usage effectiveness, or PUE. A perfect number is 1.0, meaning all the power drawn by the facility is put to use. Experts considered 2.0—indicating half the power is wasted—to be a reasonable number for a data center. Google was getting an unprecedented 1.2.
For years Google didn’t share what it was up to. “Our core advantage really was a massive computer network, more massive than probably anyone else’s in the world,” says Jim Reese, who helped set up the company’s servers. “We realized that it might not be in our best interest to let our competitors know.”
But stealth had its drawbacks. Google was on record as being an exemplar of green practices. In 2007 the company committed formally to carbon neutrality, meaning that every molecule of carbon produced by its activities—from operating its cooling units to running its diesel generators—had to be canceled by offsets. Maintaining secrecy about energy savings undercut that ideal: If competitors knew how much energy Google was saving, they’d try to match those results, and that could make a real environmental impact. Also, the stonewalling, particularly regarding The Dalles facility, was becoming almost comical. Google’s ownership had become a matter of public record, but the company still refused to acknowledge it.
In 2009, at an event dubbed the Efficient Data Center Summit, Google announced its latest PUE results and hinted at some of its techniques. It marked a turning point for the industry, and now companies like Facebook and Yahoo report similar PUEs.
Make no mistake, though: The green that motivates Google involves presidential portraiture. “Of course we love to save energy,” Hölzle says. “But take something like Gmail. We would lose a fair amount of money on Gmail if we did our data centers and servers the conventional way. Because of our efficiency, we can make the cost small enough that we can give it away for free.”
Google’s breakthroughs extend well beyond energy. Indeed, while Google is still thought of as an Internet company, it has also grown into one of the world’s largest hardware manufacturers, thanks to the fact that it builds much of its own equipment. In 1999, Hölzle bought parts for 2,000 stripped-down “breadboards” from “three guys who had an electronics shop.” By going homebrew and eliminating unneeded components, Google built a batch of servers for about $1,500 apiece, instead of the then-standard $5,000. Hölzle, Page, and a third engineer designed the rigs themselves. “It wasn’t really ‘designed,’” Hölzle says, gesturing with air quotes.
More than a dozen generations of Google servers later, the company now takes a much more sophisticated approach. Google knows exactly what it needs inside its rigorously controlled data centers—speed, power, and good connections—and saves money by not buying unnecessary extras. (No graphics cards, for instance, since these machines never power a screen. And no enclosures, because the motherboards go straight into the racks.) The same principle applies to its networking equipment, some of which Google began building a few years ago.
So far, though, there’s one area where Google hasn’t ventured: designing its own chips. But the company’s VP of platforms, Bart Sano, implies that even that could change. “I’d never say never,” he says. “In fact, I get that question every year. From Larry.”
Even if you reimagine the data center, the advantage won’t mean much if you can’t get all those bits out to customers speedily and reliably. And so Google has launched an attempt to wrap the world in fiber. In the early 2000s, taking advantage of the failure of some telecom operations, it began buying up abandoned fiber-optic networks, paying pennies on the dollar. Now, through acquisition, swaps, and actually laying down thousands of strands, the company has built a mighty empire of glass.
But when you’ve got a property like YouTube, you’ve got to do even more. It would be slow and burdensome to have millions of people grabbing videos from Google’s few data centers. So Google installs its own server racks in various outposts of its network—mini data centers, sometimes connected directly to ISPs like Comcast or AT&T—and stuffs them with popular videos. That means that if you stream, say, a Carly Rae Jepsen video, you probably aren’t getting it from Lenoir or The Dalles but from some colo just a few miles from where you are.
Over the years, Google has also built a software system that allows it to manage its countless servers as if they were one giant entity. Its in-house developers can act like puppet masters, dispatching thousands of computers to perform tasks as easily as running a single machine. In 2002 its scientists created Google File System, which smoothly distributes files across many machines. MapReduce, a Google system for writing cloud-based applications, was so successful that an open source version called Hadoop has become an industry standard. Google also created software to tackle a knotty issue facing all huge data operations: When tasks come pouring into the center, how do you determine instantly and most efficiently which machines can best afford to take on the work? Google has solved this “load-balancing” issue with an automated system called Borg.
These innovations allow Google to fulfill an idea embodied in a 2009 paper written by Hölzle and one of his top lieutenants, computer scientist Luiz Barroso: “The computing platform of interest no longer resembles a pizza box or a refrigerator but a warehouse full of computers … We must treat the data center itself as one massive warehouse-scale computer.”
This is tremendously empowering for the people who write Google code. Just as your computer is a single device that runs different programs simultaneously—and you don’t have to worry about which part is running which application—Google engineers can treat seas of servers like a single unit. They just write their production code, and the system distributes it across a server floor they will likely never be authorized to visit. “If you’re an average engineer here, you can be completely oblivious,” Hölzle says. “You can order x petabytes of storage or whatever, and you have no idea what actually happens.”
But of course, none of this infrastructure is any good if it isn’t reliable. Google has innovated its own answer for that problem as well—one that involves a surprising ingredient for a company built on algorithms and automation: people.
At 3 am on a chilly winter morning, a small cadre of engineers begin to attack Google. First they take down the internal corporate network that serves the company’s Mountain View, California, campus. Later the team attempts to disrupt various Google data centers by causing leaks in the water pipes and staging protests outside the gates—in hopes of distracting attention from intruders who try to steal data-packed disks from the servers. They mess with various services, including the company’s ad network. They take a data center in the Netherlands offline. Then comes the coup de grâce—cutting most of Google’s fiber connection to Asia.
Turns out this is an inside job. The attackers, working from a conference room on the fringes of the campus, are actually Googlers, part of the company’s Site Reliability Engineering team, the people with ultimate responsibility for keeping Google and its services running. SREs are not merely troubleshooters but engineers who are also in charge of getting production code onto the “bare metal” of the servers; many are embedded in product groups for services like Gmail or search. Upon becoming an SRE, members of this geek SEAL team are presented with leather jackets bearing a military-style insignia patch. Every year, the SREs run this simulated war—called DiRT (disaster recovery testing)—on Google’s infrastructure. The attack may be fake, but it’s almost indistinguishable from reality: Incident managers must go through response procedures as if they were really happening. In some cases, actual functioning services are messed with. If the teams in charge can’t figure out fixes and patches to keep things running, the attacks must be aborted so real users won’t be affected. In classic Google fashion, the DiRT team always adds a goofy element to its dead-serious test—a loony narrative written by a member of the attack team. This year it involves a Twin Peaks-style supernatural phenomenon that supposedly caused the disturbances. Previous DiRTs were attributed to zombies or aliens.
As the first attack begins, Kripa Krishnan, an upbeat engineer who heads the annual exercise, explains the rules to about 20 SREs in a conference room already littered with junk food. “Do not attempt to fix anything,” she says. “As far as the people on the job are concerned, we do not exist. If we’re really lucky, we won’t break anything.” Then she pulls the plug—for real—on the campus network. The team monitors the phone lines and IRC channels to see when the Google incident managers on call around the world notice that something is wrong. It takes only five minutes for someone in Europe to discover the problem, and he immediately begins contacting others.
“My role is to come up with big tests that really expose weaknesses,” Krishnan says. “Over the years, we’ve also become braver in how much we’re willing to disrupt in order to make sure everything works.” How did Google do this time? Pretty well. Despite the outages in the corporate network, executive chair Eric Schmidt was able to run a scheduled global all-hands meeting. The imaginary demonstrators were placated by imaginary pizza. Even shutting down three-fourths of Google’s Asia traffic capacity didn’t shut out the continent, thanks to extensive caching. “This is the best DiRT ever!” Krishnan exclaimed at one point.
The SRE program began when Hölzle charged an engineer named Ben Treynor with making Google’s network fail-safe. This was especially tricky for a massive company like Google that is constantly tweaking its systems and services—after all, the easiest way to stabilize it would be to freeze all change. Treynor ended up rethinking the very concept of reliability. Instead of trying to build a system that never failed, he gave each service a budget—an amount of downtime it was permitted to have. Then he made sure that Google’s engineers used that time productively. “Let’s say we wanted Google+ to run 99.95 percent of the time,” Hölzle says. “We want to make sure we don’t get that downtime for stupid reasons, like we weren’t paying attention. We want that downtime because we push something new.”
Nevertheless, accidents do happen—as Sabrina Farmer learned on the morning of April 17, 2012. Farmer, who had been the lead SRE on the Gmail team for a little over a year, was attending a routine design review session. Suddenly an engineer burst into the room, blurting out, “Something big is happening!” Indeed: For 1.4 percent of users (a large number of people), Gmail was down. Soon reports of the outage were all over Twitter and tech sites. They were even bleeding into mainstream news.
The conference room transformed into a war room. Collaborating with a peer group in Zurich, Farmer launched a forensic investigation. A breakthrough came when one of her Gmail SREs sheepishly admitted, “I pushed a change on Friday that might have affected this.” Those responsible for vetting the change hadn’t been meticulous, and when some Gmail users tried to access their mail, various replicas of their data across the system were no longer in sync. To keep the data safe, the system froze them out.
The diagnosis had taken 20 minutes, designing the fix 25 minutes more—pretty good. But the event went down as a Google blunder. “It’s pretty painful when SREs trigger a response,” Farmer says. “But I’m happy no one lost data.” Nonetheless, she’ll be happier if her future crises are limited to DiRT-borne zombie attacks.
One scenario that dirt never envisioned was the presence of a reporter on a server floor. But here I am in Lenoir, earplugs in place, with Joe Kava motioning me inside.
We have passed through the heavy gate outside the facility, with remote-control barriers evoking the Korean DMZ. We have walked through the business offices, decked out in Nascar regalia. (Every Google data center has a decorative theme.) We have toured the control room, where LCD dashboards monitor every conceivable metric. Later we will climb up to catwalks to examine the giant cooling towers and backup electric generators, which look like Beatle-esque submarines, only green. We will don hard hats and tour the construction site of a second data center just up the hill. And we will stare at a rugged chunk of land that one day will hold a third mammoth computational facility.
But now we enter the floor. Big doesn’t begin to describe it. Row after row of server racks seem to stretch to eternity. Joe Montana in his prime could not throw a football the length of it.
During my interviews with Googlers, the idea of hot aisles and cold aisles has been an abstraction, but on the floor everything becomes clear. The cold aisle refers to the general room temperature—which Kava confirms is 77 degrees. The hot aisle is the narrow space between the backsides of two rows of servers, tightly enclosed by sheet metal on the ends. A nest of copper coils absorbs the heat. Above are huge fans, which sound like jet engines jacked through Marshall amps.
The huge fans sound like jet engines jacked through Marshall amps.
We walk between the server rows. All the cables and plugs are in
front, so no one has to crack open the sheet metal and venture into the
hot aisle, thereby becoming barbecue meat. (When someone does have to
head back there, the servers are shut down.) Every server has a sticker
with a code that identifies its exact address, useful if something goes
wrong. The servers have thick black batteries alongside. Everything is
uniform and in place—nothing like the spaghetti tangles of Google’s
long-ago Exodus era.Blue lights twinkle, indicating … what? A web search? Someone’s Gmail message? A Glass calendar event floating in front of Sergey’s eyeball? It could be anything.
Every so often a worker appears—a long-haired dude in shorts propelling himself by scooter, or a woman in a T-shirt who’s pushing a cart with a laptop on top and dispensing repair parts to servers like a psychiatric nurse handing out meds. (In fact, the area on the floor that holds the replacement gear is called the pharmacy.)
How many servers does Google employ? It’s a question that has dogged observers since the company built its first data center. It has long stuck to “hundreds of thousands.” (There are 49,923 operating in the Lenoir facility on the day of my visit.) I will later come across a clue when I get a peek inside Google’s data center R&D facility in Mountain View. In a secure area, there’s a row of motherboards fixed to the wall, an honor roll of generations of Google’s homebrewed servers. One sits atop a tiny embossed plaque that reads july 9, 2008. google’s millionth server. But executives explain that this is a cumulative number, not necessarily an indication that Google has a million servers in operation at once.
Wandering the cold aisles of Lenoir, I realize that the magic number, if it is even obtainable, is basically meaningless. Today’s machines, with multicore processors and other advances, have many times the power and utility of earlier versions. A single Google server circa 2012 may be the equivalent of 20 servers from a previous generation. In any case, Google thinks in terms of clusters—huge numbers of machines that act together to provide a service or run an application. “An individual server means nothing,” Hölzle says. “We track computer power as an abstract metric.” It’s the realization of a concept Hölzle and Barroso spelled out three years ago: the data center as a computer.
As we leave the floor, I feel almost levitated by my peek inside Google’s inner sanctum. But a few weeks later, back at the Googleplex in Mountain View, I realize that my epiphanies have limited shelf life. Google’s intention is to render the data center I visited obsolete. “Once our people get used to our 2013 buildings and clusters,” Hölzle says, “they’re going to complain about the current ones.”
Asked in what areas one might expect change, Hölzle mentions data center and cluster design, speed of deployment, and flexibility. Then he stops short. “This is one thing I can’t talk about,” he says, a smile cracking his bearded visage, “because we’ve spent our own blood, sweat, and tears. I want others to spend their own blood, sweat, and tears making the same discoveries.” Google may be dedicated to providing access to all the world’s data, but some information it’s still keeping to itself.
No comments:
Post a Comment