Building Rome in a Day

June 10, 2009 — Schoeller Porter

If you’ve been following the launch of Wolfram|Alpha, then you have probably heard that two supercomputer-class systems are a big part of what is behind the scenes. One of them is the R Smarr system, belonging to our good friends at R Systems, which is featured in this video. The other is our custom Dell system, highlighted in the Rack ‘n’ Roll video. (That’s me in the blue shirt and the crazy blond hair.) Between the two of them, we can handle around 1800 queries per second (qps). Many people have asked about how we pulled together all of this infrastructure.

First, some background.

Back in mid-March our development team was intensely focused on building Wolfram|Alpha. As each day went by, the pace of development was accelerating and the further we progressed, the faster Wolfram|Alpha was growing in both content and functionality. On the infrastructure side, we had put in place a prudent plan. We knew the rollout would have an audience of early adopters amongst the professional audiences that our company is very familiar with, and we had planned accordingly for a capacity of 200 queries per second. A few colocations spread throughout the United States should do the job; we were well on track to set them up in plenty of time. And we thought that our “I’m sorry, Dave, I’m afraid I can’t do that” message would be seen occasionally in the first few weeks if there was overflow beyond our capacity.

Following our first few public announcements about Wolfram|Alpha, we began to see broad global interest. As the product swelled in capability and scope faster than any of us had anticipated, we periodically revisited our assumptions, and the infrastructure team decided to see what we could do to spike the launch capacity to be as strong as the product was becoming. Perhaps even get to the point where nobody would have to see the “Dave” message at all, no matter how busy the launch-day activity.

Stephen Wolfram was highly supportive: “The full support of the company is behind you; do whatever it takes.” Given the broader audience the product was becoming viable for and given the public response that we had seen so far, what should we forecast as peak launch demand? How about being able to handle a peak of 2000 queries per second, ten times the earlier plan? Since we hadn’t even talked to a supercomputer vendor yet with about two months to go until launch, we had moved from prudent to very aggressive on both time frame and target. Good thing we had a crackerjack team and strong partners in Dell and R Systems.

The time between our initial talks with Dell and the delivery of our system was only a few weeks. Dell was just amazing in how they worked with our schedule. After the delivery we had just three weeks left to build, install, deploy, test, and tune a 500-node system for the launch of a high-profile, highly anticipated, computationally intensive website. What could possibly go wrong?

The short answer to that question is “surprisingly little,” given the scale of what we were trying to accomplish. The initial hardware build-out took Jeff, Ken, Rusty, John, the two Chrisses, the three Matts, Steve, and others (henceforth referred to as “the systems engineering team”) just 18 hours, start to finish, with the first truck arriving at 8am, and the last guy turning the lights off at 2 the following morning. I stopped counting the pizza boxes halfway through. All 500 nodes powered on without problem. We did lose one node after a few days due to a bad onboard network card, fixed easily enough with Dell’s stellar onsite parts service. One motherboard swap later and we were back to full capacity. Chris, Jamie, and Grant took over from there for the software install and deployment. They were building the tools and learning techniques to deploy and manage distributing software and a multi-terabyte database throughout the 500 nodes at the same time that they were doing the deployment. In less than a week and a half, they accomplished a task that should have taken a month and a half. These guys are simply amazing.

If you’re counting, we were then a week and a half from launch, and ready for the first load test on our brand new Dell system. The results from a handful of computers firing requests off to one rack of the Dell weren’t pretty, and the test computers were straining under the load. The obvious conclusion was that it takes a supercomputer to test a supercomputer. So, we used a few nodes of the Dell system to create queries for Wolfram|Alpha running on the rest of the Dell system, and it scaled beautifully. Victory? Not quite; we needed to test capacity of every infrastructure layer, and that meant load coming in from the outside. Chris, Jamie, and Grant were still working on deploying software on R Smarr, so that supercomputer was not yet available to generate load for the Dell. One quick call to R Systems, and they were kind enough to let us borrow a separate 140-node system for a few days. If you’re counting supercomputer-class systems involved in Wolfram|Alpha, we were then up to three: two soon to be in production, and one being used as a test rig to generate load for the production systems.

We were then just days before launch, that put us with 140 nodes at our disposal, and final load testing could proceed. One cluster of the big Dell system handled 130 qps—check. Two clusters got 260 qps. We were cooking. Three clusters, 210. Uh oh. Four clusters, 120. !?@#%^. Maybe it was just a glitch. We tried it again, but round two didn’t fare any better. It was time for an emergency meeting. Everyone was on the case (Jeff and his systems engineering guys, plus Chris, Jamie, Grant, Mike, Oyvind, and many other folks), working non-stop to figure out the bottleneck. Something must have been thrashing, but what was the problem? The test rig? It checked out. The edge switch? That checked out, too. Ditto on the other end of the line. Core switch? Also fine. Was logging slowing us down? Nope. Were any of the databases saturated? Looked okay. The test log implied packet loss, as did the web server logs.

By then it was Thursday evening and we’d called in the network engineers to go step by step through every piece of the outside connection, having eliminated everything else. By Friday morning, we’d verified that every link on the primary network was fine. Greg from R Systems popped in and ended up basically staying for the whole weekend. After he helped us bring R Smarr online, he kept asking, “Anything I can do to help over here?” There always was. These guys are awesome. Greg has deep, deep knowledge of supercomputer infrastructure. We’re all glad he’s on our side.

By eight hours before the Friday evening pre-launch, we’d refocused on the switch configuration. It’s complex, and there was a good chance cross-chatter between the racks under load was gumming up the works. Greg also found that the router nodes were dropping packets. So we disconnected a few of the racks from each other, changed an obscure setting in the Linux network stack, and tried again: one cluster gave 130 qps, two clusters yielded 260 qps, and three brought us to 400 qps. That was promising. Unfortunately, we were out of time. Two hours to the pre-launch, and we put the Dell back the way it was. There were too many unknown unknowns for a live webcast, but we were confident that we were on a path to resolving the issue.

If you watched the webcast, then you know that we continued load testing through the pre-launch weekend. We had also been testing the R Smarr system, which had been going well. We did one live test on the webcast that went very well. Then we did a second live test that was less encouraging. Still, our confidence remained strong: one quick sanity check of R Smarr at scale, and we’d be in good shape…160 qps for 1 cluster, 300 qps for 2 clusters, 320 qps for 3 clusters. !?@#%^. R Smarr doesn’t have the complex switch configuration that we have in the Dell system, so it was back to square one. Exhausted, but not beaten, we all agreed to regroup in the morning.

Fast forward to Saturday morning and we were digging deep into the guts of the live-running system, continuously load testing to observe how it was behaving in real time. (This saturation load testing was the source of the vast majority of “Dave” messages for members of the public participating in our weekend test phase.) Apache was throwing 503 errors. Apache itself was a red herring. Rechecked the bandwidth on all connections; looked okay. webMathematica appeared to be refusing connections to Apache. Theories emerged and fixes were implemented, deployed, and tested. No good. Wash, rinse, repeat. It was a very long night, which happened to be documented in the “Burning the Midnight Oil” blog post. (That’s me in the third picture, scratching my head and looking flummoxed.)

On Sunday, we were exhausted and confused, but we were stubbornly determined to figure this out. Greg was still hanging with us. We made him an honorary Wolfram employee. Going back to the system logs, we stepped through each piece of the architecture. What did we miss? Apache really was a red herring (although the 503s were real). We walked through the HA design, and it turned out we could saturate a webMathematica server before the HA noticed. Probably not the root cause as it only happened in exceptional circumstances, but we noted it for follow-up. The next candidate was the database. Database experts Joshua and Mike ran the numbers. The volume of connections looked high, but that was only because we weren’t used to seeing so many systems hit it at once. The database hardware seemed to be handling everything just fine. Back to logging. We turned it off again and retested. No change.Very frustrating.

During the previous test run one of the team members complained about the network performance back to the main office, which was via an auxiliary network (a network connection between the facility and the Wolfram headquarters office, not a network connection for the Wolfram|Alpha supercomputers’ connection to the public). The auxiliary network had been struggling under its workload all weekend, for a variety of reasons, and was one of the many other things being tweaked while we worked on the big machines. This time, though, there seemed to be a correlation between the performance of the auxiliary network and the load tests. Rechecked the bandwidth. Still looked fine. Curious. The Wolfram|Alpha logging data was being transmitted across the auxiliary network to the main office for aggregation before being sent to the monitoring systems to make those nice visualizations you see in the video. Chris from the systems engineering team ran a ping test on the auxiliary network during a load test. Latency skyrocketed. Bingo! Not enough allowed connections, so we were saturating the proxy.

After raising the number of allowed connections to something ludicrous, we tested again. No dice. Joshua and Mike continued monitoring all of the auxiliary traffic, and in this test the logging system was saturated. It wasn’t doing that before. There weren’t enough connections to the logging database. After Joshua and Mike implemented a fix, we did one more test. One cluster: 140 qps. Two clusters: 280 qps. Three clusters: 400 qps. Then we decided to go for broke. Six clusters: 750 qps. Then for R Smarr: 160 qps, 300 qps, 500 qps, 900 qps. Eureka!

Then we just needed to run both the Dell and R Smarr at the same time. At 4am on Monday, we launched the final test, achieving just shy of the 1800 qps goal, and the machines have been humming along ever since. In fact, at the time of that final successful load test, we watched real launch-day traffic from Europe start to ramp up to the rate of many hundreds of queries per second. Talk about just in time!

Many thanks go out to Dell and the DCS team for getting our system to us under an impossible time frame, delivering a system that has worked so reliably, and providing relentless support the few times something has gone awry. R Systems deserves special recognition for going way above and beyond the call of duty to help us debug, deconstruct, and tune our system.

PS: For the record, we had one week to install, deploy, test, and tune R Smarr at the same time we were working on the Dell system. We managed to complete the first two steps in a day and rolled the testing and tuning in with our Dell system. I would tell you that story, too, but it just worked, so there’s not much to tell. And our other smaller locations spread throughout the country with the original 200 qps capacity? Problem-free; they just got done without a squeak. That is a testament to folks at R Systems and their machines (which also happen to be Dell) and to our systems engineering, software deployment, and database teams (Jeff, Ken, Rusty, John, Chris, Chris, Matt, Matt, Matt, Steve, Chris, Jamie, Grant, Joshua, and Mike – I’m looking at you!)

21 Comments

A thrilling story in itself.
You mention that “I’m sorry, Dave, I’m afraid I can’t do that” message was sent if the system was overloaded. Until now I had understood that this meant that W|A could not understand the query and from the Forum so did many others.
If there are two slightly differrent messages I suggest you reword both to clear up the confusion.

Posted by Brian Gilbert June 10, 2009 at 2:56 am

Icredible story !!! Good job you guys !!! Amazing work !!!

Posted by Ionut Danet June 10, 2009 at 4:20 am

A great report on a wonderful collaborative team effort with your vendors and your staff. A few glitches aside, Day One seems to have been a huge operational success. Congratulations to everyone at Wolfram|Alpha.

So what’s in store operationally for the next chapter (“Day Two”)? Any thoughts about putting in more server locations around the U.S. or in Europe, Asia or the Far East. What happens as the curated databases expand along with linguistic understanding? As the popularity of Wolfram|Alpha spreads with that expansion, there will likely be a need for more “farms” to handle the increased appetites of users to maintain QPS levels without the “Dave’s” showing up again.

And what about putting those “NASA” like screens up live on the website for all of us to see how the QPS hits are going, along with information on where (countries/states) queries are coming from on an information tab or from the side bar? That would be neat for many of us to see and good for users to determine usage near their location. A current status screen shot with a date and time stamp might do as a second choice, if desired live action is not practical.

Thanks for the look back and keep up the informational posts.

Posted by Bob D. June 10, 2009 at 8:17 am

Can you give us more information about the hardware used?

Posted by Roger Birong, Jr. June 10, 2009 at 8:20 am

Great Read! Such experiences refine engineering teams in promising ways for future projects. Keep @ it…

Cheers!

Posted by Jarvis June 10, 2009 at 9:10 am

When should we expect to see advertising pay for the growth of WA?
I guess seeing a link to a real estate agency when I type “apartment prices New York” wouldn’t be a problem for anyone, just the same as ads on the side of Gmail : non invasive and frequently relevant. And if it pays for growth of that fantastic thing and to keep it free, no problem.

Posted by Nicolas June 10, 2009 at 9:58 am

Any chance that we could use your system to run our own Mathematica code?

Posted by Paul June 10, 2009 at 10:02 am

Thanks for the information, please keep telling us more and more… The baby seems healthy 🙂
Will this technology be available for non English speakers?. Spanish, for example?

Posted by Jorge García Gil June 10, 2009 at 10:20 am

I can’t sign in to the community forum. It says my details are incorrect but they are not. When on the same page I enter my email address for a reset it says “Message Not SenT” If I try to set up my account again it says it already exists.
Is Wolfram support closed down during say 9am to 6pm UK time because the USA is asleep then. If so I suggest you locate a support team at your Oxford office.

Posted by Brian Gilbert June 10, 2009 at 1:09 pm

I had the exact same trouble with the Community sign-on yesterday here in Pennsylvania. Haven’t tried again because of frustration. That is something they should fix ASAP rather than waiting for a weekly update.

Posted by Bob D. June 10, 2009 at 2:09 pm

It’s fixed thanks to W|A support.!

Posted by Brian Gilbert June 10, 2009 at 2:14 pm

Brian: That’s good, but I can’t check it yet. Now the Forgotten Password returns a Message Delivery Failure when I entered my email address? And that didn’t work yesterday either. At least you got in.

Posted by Bob D. June 10, 2009 at 3:44 pm

Bob D – A new password has been sent to your email account.

Posted by The PR Team June 10, 2009 at 3:59 pm

A brilliant work!!!! I admire you guys! Such an enthusiasm, devotion, dedication and passion! You can move the mountain! Wishing you success and and happiness with this and every project you do! 🙂

Posted by Djordje June 11, 2009 at 4:18 am

It’s like witnessing the Big Bang. The narrative is as exciting as an old Western, yet it’s just yesterday! I have never been this close to the creativity and ingeniousness of pros like you. To have all this out in the public is truly a landmark occasion. Maybe we ARE in the 21st century.Does “wolframalpha” do barbecues? Hope not, I like to burn my own. At any rate, congratulations for this superb and seemingly endless site.

Posted by Harry Stelling June 11, 2009 at 3:30 pm

I’d suggest you make another post where you explain what you have tweaked in the linux kernel in order to achieve optimal results. I’m sure this will be a very interesting information, perhaps a publication at highscalability.com would be nice, too. Just think about this.

Posted by LordDoskias June 12, 2009 at 2:32 pm

Did you guess what the “obscure setting in the Linux network stack” was ?

Posted by eternal1 September 16, 2010 at 9:30 am

Great story…. I’m a retired programmer, and this brought back memories of endless hours chasing problems and the exhilaration of solving them! Love the comment about losing count of the pizza boxes….

Bill

Posted by Bill in Dallas June 13, 2009 at 3:41 am

Wolfram Alpha is a good news in the area of search engines. It has some advantages compared to the other search engines. Still, it is not the competition to Google or other search engines, since it uses totally different approach. It is more like dictionary or encyclopedia than search engine.

Posted by Introspective June 13, 2009 at 6:53 am

But dont you think this is a too difficult search engine for normal persons, this search engine just show defination of the keywords we search

Posted by Whirlpool Baths June 13, 2009 at 7:23 am

What was the “obscure setting in the Linux network stack” ?

Posted by eternal1 September 16, 2010 at 9:29 am