Building Rome in a Day
If you’ve been following the launch of Wolfram|Alpha, then you have probably heard that two supercomputer-class systems are a big part of what is behind the scenes. One of them is the R Smarr system, belonging to our good friends at R Systems, which is featured in this video. The other is our custom Dell system, highlighted in the Rack ‘n’ Roll video. (That’s me in the blue shirt and the crazy blond hair.) Between the two of them, we can handle around 1800 queries per second (qps). Many people have asked about how we pulled together all of this infrastructure.
First, some background.
Back in mid-March our development team was intensely focused on building Wolfram|Alpha. As each day went by, the pace of development was accelerating and the further we progressed, the faster Wolfram|Alpha was growing in both content and functionality. On the infrastructure side, we had put in place a prudent plan. We knew the rollout would have an audience of early adopters amongst the professional audiences that our company is very familiar with, and we had planned accordingly for a capacity of 200 queries per second. A few colocations spread throughout the United States should do the job; we were well on track to set them up in plenty of time. And we thought that our “I’m sorry, Dave, I’m afraid I can’t do that” message would be seen occasionally in the first few weeks if there was overflow beyond our capacity.
Following our first few public announcements about Wolfram|Alpha, we began to see broad global interest. As the product swelled in capability and scope faster than any of us had anticipated, we periodically revisited our assumptions, and the infrastructure team decided to see what we could do to spike the launch capacity to be as strong as the product was becoming. Perhaps even get to the point where nobody would have to see the “Dave” message at all, no matter how busy the launch-day activity.
Stephen Wolfram was highly supportive: “The full support of the company is behind you; do whatever it takes.” Given the broader audience the product was becoming viable for and given the public response that we had seen so far, what should we forecast as peak launch demand? How about being able to handle a peak of 2000 queries per second, ten times the earlier plan? Since we hadn’t even talked to a supercomputer vendor yet with about two months to go until launch, we had moved from prudent to very aggressive on both time frame and target. Good thing we had a crackerjack team and strong partners in Dell and R Systems.
The time between our initial talks with Dell and the delivery of our system was only a few weeks. Dell was just amazing in how they worked with our schedule. After the delivery we had just three weeks left to build, install, deploy, test, and tune a 500-node system for the launch of a high-profile, highly anticipated, computationally intensive website. What could possibly go wrong?
The short answer to that question is “surprisingly little,” given the scale of what we were trying to accomplish. The initial hardware build-out took Jeff, Ken, Rusty, John, the two Chrisses, the three Matts, Steve, and others (henceforth referred to as “the systems engineering team”) just 18 hours, start to finish, with the first truck arriving at 8am, and the last guy turning the lights off at 2 the following morning. I stopped counting the pizza boxes halfway through. All 500 nodes powered on without problem. We did lose one node after a few days due to a bad onboard network card, fixed easily enough with Dell’s stellar onsite parts service. One motherboard swap later and we were back to full capacity. Chris, Jamie, and Grant took over from there for the software install and deployment. They were building the tools and learning techniques to deploy and manage distributing software and a multi-terabyte database throughout the 500 nodes at the same time that they were doing the deployment. In less than a week and a half, they accomplished a task that should have taken a month and a half. These guys are simply amazing.
If you’re counting, we were then a week and a half from launch, and ready for the first load test on our brand new Dell system. The results from a handful of computers firing requests off to one rack of the Dell weren’t pretty, and the test computers were straining under the load. The obvious conclusion was that it takes a supercomputer to test a supercomputer. So, we used a few nodes of the Dell system to create queries for Wolfram|Alpha running on the rest of the Dell system, and it scaled beautifully. Victory? Not quite; we needed to test capacity of every infrastructure layer, and that meant load coming in from the outside. Chris, Jamie, and Grant were still working on deploying software on R Smarr, so that supercomputer was not yet available to generate load for the Dell. One quick call to R Systems, and they were kind enough to let us borrow a separate 140-node system for a few days. If you’re counting supercomputer-class systems involved in Wolfram|Alpha, we were then up to three: two soon to be in production, and one being used as a test rig to generate load for the production systems.
We were then just days before launch, that put us with 140 nodes at our disposal, and final load testing could proceed. One cluster of the big Dell system handled 130 qps—check. Two clusters got 260 qps. We were cooking. Three clusters, 210. Uh oh. Four clusters, 120. !?@#%^. Maybe it was just a glitch. We tried it again, but round two didn’t fare any better. It was time for an emergency meeting. Everyone was on the case (Jeff and his systems engineering guys, plus Chris, Jamie, Grant, Mike, Oyvind, and many other folks), working non-stop to figure out the bottleneck. Something must have been thrashing, but what was the problem? The test rig? It checked out. The edge switch? That checked out, too. Ditto on the other end of the line. Core switch? Also fine. Was logging slowing us down? Nope. Were any of the databases saturated? Looked okay. The test log implied packet loss, as did the web server logs.
By then it was Thursday evening and we’d called in the network engineers to go step by step through every piece of the outside connection, having eliminated everything else. By Friday morning, we’d verified that every link on the primary network was fine. Greg from R Systems popped in and ended up basically staying for the whole weekend. After he helped us bring R Smarr online, he kept asking, “Anything I can do to help over here?” There always was. These guys are awesome. Greg has deep, deep knowledge of supercomputer infrastructure. We’re all glad he’s on our side.
By eight hours before the Friday evening pre-launch, we’d refocused on the switch configuration. It’s complex, and there was a good chance cross-chatter between the racks under load was gumming up the works. Greg also found that the router nodes were dropping packets. So we disconnected a few of the racks from each other, changed an obscure setting in the Linux network stack, and tried again: one cluster gave 130 qps, two clusters yielded 260 qps, and three brought us to 400 qps. That was promising. Unfortunately, we were out of time. Two hours to the pre-launch, and we put the Dell back the way it was. There were too many unknown unknowns for a live webcast, but we were confident that we were on a path to resolving the issue.
If you watched the webcast, then you know that we continued load testing through the pre-launch weekend. We had also been testing the R Smarr system, which had been going well. We did one live test on the webcast that went very well. Then we did a second live test that was less encouraging. Still, our confidence remained strong: one quick sanity check of R Smarr at scale, and we’d be in good shape…160 qps for 1 cluster, 300 qps for 2 clusters, 320 qps for 3 clusters. !?@#%^. R Smarr doesn’t have the complex switch configuration that we have in the Dell system, so it was back to square one. Exhausted, but not beaten, we all agreed to regroup in the morning.
Fast forward to Saturday morning and we were digging deep into the guts of the live-running system, continuously load testing to observe how it was behaving in real time. (This saturation load testing was the source of the vast majority of “Dave” messages for members of the public participating in our weekend test phase.) Apache was throwing 503 errors. Apache itself was a red herring. Rechecked the bandwidth on all connections; looked okay. webMathematica appeared to be refusing connections to Apache. Theories emerged and fixes were implemented, deployed, and tested. No good. Wash, rinse, repeat. It was a very long night, which happened to be documented in the “Burning the Midnight Oil” blog post. (That’s me in the third picture, scratching my head and looking flummoxed.)
On Sunday, we were exhausted and confused, but we were stubbornly determined to figure this out. Greg was still hanging with us. We made him an honorary Wolfram employee. Going back to the system logs, we stepped through each piece of the architecture. What did we miss? Apache really was a red herring (although the 503s were real). We walked through the HA design, and it turned out we could saturate a webMathematica server before the HA noticed. Probably not the root cause as it only happened in exceptional circumstances, but we noted it for follow-up. The next candidate was the database. Database experts Joshua and Mike ran the numbers. The volume of connections looked high, but that was only because we weren’t used to seeing so many systems hit it at once. The database hardware seemed to be handling everything just fine. Back to logging. We turned it off again and retested. No change.Very frustrating.
During the previous test run one of the team members complained about the network performance back to the main office, which was via an auxiliary network (a network connection between the facility and the Wolfram headquarters office, not a network connection for the Wolfram|Alpha supercomputers’ connection to the public). The auxiliary network had been struggling under its workload all weekend, for a variety of reasons, and was one of the many other things being tweaked while we worked on the big machines. This time, though, there seemed to be a correlation between the performance of the auxiliary network and the load tests. Rechecked the bandwidth. Still looked fine. Curious. The Wolfram|Alpha logging data was being transmitted across the auxiliary network to the main office for aggregation before being sent to the monitoring systems to make those nice visualizations you see in the video. Chris from the systems engineering team ran a ping test on the auxiliary network during a load test. Latency skyrocketed. Bingo! Not enough allowed connections, so we were saturating the proxy.
After raising the number of allowed connections to something ludicrous, we tested again. No dice. Joshua and Mike continued monitoring all of the auxiliary traffic, and in this test the logging system was saturated. It wasn’t doing that before. There weren’t enough connections to the logging database. After Joshua and Mike implemented a fix, we did one more test. One cluster: 140 qps. Two clusters: 280 qps. Three clusters: 400 qps. Then we decided to go for broke. Six clusters: 750 qps. Then for R Smarr: 160 qps, 300 qps, 500 qps, 900 qps. Eureka!
Then we just needed to run both the Dell and R Smarr at the same time. At 4am on Monday, we launched the final test, achieving just shy of the 1800 qps goal, and the machines have been humming along ever since. In fact, at the time of that final successful load test, we watched real launch-day traffic from Europe start to ramp up to the rate of many hundreds of queries per second. Talk about just in time!
Many thanks go out to Dell and the DCS team for getting our system to us under an impossible time frame, delivering a system that has worked so reliably, and providing relentless support the few times something has gone awry. R Systems deserves special recognition for going way above and beyond the call of duty to help us debug, deconstruct, and tune our system.
PS: For the record, we had one week to install, deploy, test, and tune R Smarr at the same time we were working on the Dell system. We managed to complete the first two steps in a day and rolled the testing and tuning in with our Dell system. I would tell you that story, too, but it just worked, so there’s not much to tell. And our other smaller locations spread throughout the country with the original 200 qps capacity? Problem-free; they just got done without a squeak. That is a testament to folks at R Systems and their machines (which also happen to be Dell) and to our systems engineering, software deployment, and database teams (Jeff, Ken, Rusty, John, Chris, Chris, Matt, Matt, Matt, Steve, Chris, Jamie, Grant, Joshua, and Mike – I’m looking at you!)