The Wolfram|Alpha Blog is now part of the Wolfram Blog. Join us there for the latest on Wolfram|Alpha and other Wolfram offerings »
Stephen Wolfram

Launching a Democratization of Data Science

February 9, 2012 —
Comments Off

It’s a sad but true fact that most data that’s generated or collected—even with considerable effort—never gets any kind of serious analysis. But in a sense that’s not surprising. Because doing data science has always been hard. And even expert data scientists usually have to spend lots of time wrangling code and data to do any particular analysis.

I myself have been using computers to work with data for more than a third of a century. And over that time my tools and methods have gradually evolved. But this week—with the release of Wolfram|Alpha Pro—something dramatic has happened, that will forever change the way I approach data.

The key idea is automation. The concept in Wolfram|Alpha Pro is that I should just be able to take my data in whatever raw form it arrives, and throw it into Wolfram|Alpha Pro. And then Wolfram|Alpha Pro should automatically do a whole bunch of analysis, and then give me a well-organized report about my data. And if my data isn’t too large, this should all happen in a few seconds.

And what’s amazing to me is that it actually works. I’ve got all kinds of data lying around: measurements, business reports, personal analytics, whatever. And I’ve been feeding it into Wolfram|Alpha Pro. And Wolfram|Alpha Pro has been showing me visualizations and coming up with analyses that tell me all kinds of useful things about the data.

Data input

In the past, when I’d really been motivated, I’d take some data here or there, read it into Mathematica, and use some of the powerful tools there to do some analysis or another. But what’s new and exciting with Wolfram|Alpha Pro is that it is all so automatic. On a whim I can throw my data in, and expect to see something useful come out.

The basic idea is very much in line with the whole core mission of Wolfram|Alpha: to take expert-level knowledge, and create a system that can apply it automatically whenever and wherever it’s needed. Here the expert-level knowledge is the collection of methods that a team of good data scientists would have, and what Wolfram|Alpha Pro does is to take that knowledge and use it to analyze whatever data you feed in.

There are many challenges, and we’re still at any early stage in addressing all of them. But with the whole Wolfram|Alpha technology stack, as well as with the underlying Mathematica language, we were able to start from a very strong foundation. And in the course of building Wolfram|Alpha Pro we’ve invented all kinds of new methods.

Categories-number-gender

There are several pieces to the whole problem. The first is just to get the data into Wolfram|Alpha in any kind of well-structured form. And as anyone who’s actually worked with real data knows, that’s often not as easy as it sounds.

You think you’ve got data that’s arranged in columns. But what about those weird separators? What about those headers? What about those delimiters that occur inside data elements? What about those missing elements? What about those lines that were stripped when copying from a browser? What about that second table in the same spreadsheet? And so on.

It’s a little like what Wolfram|Alpha has to do in understanding free-form natural language, with all its variations and redundancies. But the grammar for structured data is different, and in some ways less forgiving. And just as in the original development of Wolfram|Alpha, what we’ve done is to take a large corpus of examples, and try to deduce the appropriate grammar from what we see—with the knowledge that as we get large volumes of actual queries, we’ll gradually be able to improve this. (Needless to say, we use the analysis capabilities of Wolfram|Alpha Pro itself to do much of this analysis.)

OK, so we’ve figured out where the individual elements in our data are. Now we have to figure out what they are. And here’s where Wolfram|Alpha’s linguistic prowess is crucial. Because it immediately allows us to understand all those weird formats for numbers and dates and so on. And more than that, it lets us recognize units and place names and lots of other things, and automatically put them into a standard computable form.

Sometimes in ordinary Wolfram|Alpha, when there’s a date or unit or place that’s given in the input, it can be ambiguous. But when it’s fed whole columns of data, Wolfram|Alpha Pro can usually automatically resolve these ambiguities (“All dates are probably US style”; “those units are probably all temperature units”; etc.).

Cities

So let’s say that Wolfram|Alpha Pro knows what all the elements in a table of data are—what their “values” are. Then it has to start figuring out what they “mean”. Does that sequence of numbers represent some kind of labels or coordinates? Or is it just samples from a random distribution? Does that sequence of currency values represent an asset price with random-walk-like variations? Or is it just a sequence of unrelated currency amounts? Are both those columns actually primary data, or is one of them just the rankings for the other? Etc. etc.

Wolfram|Alpha Pro has a large number of algorithms and heuristics for trying to deduce what the data it’s given represents. And this immediately puts it on track to see what kind of visualizations and analyses it should do.

There are always tricky issues. When does it make sense to join points in a 2D plot? When should one use bar charts versus scatter plots versus pie charts, etc.? What plots have scales that are close enough to combine? How should one set up regression analysis: what variables should one try to predict? And so on.

Wolfram|Alpha Pro inherits from Mathematica many standard kinds of statistical analysis. But what it does is to completely automate these. Sometimes it chooses what kind of analysis makes sense based on looking at the data. But often it will just run a fair number of possible analyses in parallel, then report only the ones that make sense.

At some level, a key objective of Wolfram|Alpha Pro is to be able to take any set of data, and be able to “tell a story” from it. Be able to show what’s interesting or unusual about the data, and what conclusions can be drawn from it.

Dates-currency-2

One example is fits. Given data, Wolfram|Alpha Pro will typically try a large number of different kinds of functional forms. Straight lines. Polynomials. Exponentials. Logistic curves. Sine curves. And so on. And then it has criteria for deciding which, if any, of these represent a reasonable fit to the original data.

Wolfram|Alpha Pro does the same kind of thing for probability distributions. It also uses all kinds of statistical methods to be able to make statistical conclusions, exclude statistical hypotheses or not, and so on.

Things get even more interesting when the data it’s dealing with doesn’t just consist of numbers.

If it’s given, say, dates and currency values, it can figure out things like currency conversions, and inflation adjustments. If it’s given places, it can plot them on a map, but it can also normalize by properties of a place (like population or area). And if it’s given arbitrary objects with the right level of repetition, it’ll treat them as nodes in a network.

Email-addresses

For any given data that’s been input, Wolfram|Alpha Pro usually has a very large number of analyses it can run. But the challenge then is to prune, combine and organize the results to emphasize what is important, and to make them as easy for a human to assimilate as possible—appropriately adding textual summaries that are rigorous but understandable to non-experts.

Usually what will happen is that Wolfram|Alpha Pro will give an overall summary as its “default report”, and then have all sorts of buttons and pulldowns that allow drill-down to many variations or details.

In my many years of working with data, I’ve probably at some time or another generated at least a few of most of the kinds of plots, tables and analyses that Wolfram|Alpha Pro shows. But I’m quite certain that in any particular case, I’ve never generated more than a small fraction of what Wolfram|Alpha Pro would produce.

And the important thing is that by automatically generating a whole report with carefully chosen entries, Wolfram|Alpha Pro gives me something where at a glance I can start to understand what’s in my data.

Any particular part of the result, I could no doubt reproduce, with sufficient time spent wrangling code and data. But the whole point is that as a practical matter, I would only end up doing it if I pretty much knew what I was looking for. It just takes too much time to do it “on a whim”, for purely exploratory purposes.

But Wolfram|Alpha Pro changes all of this. Because for the first time, it makes it immediate to get a whole report on any data I have. And what this means is that in practice I’ll actually end up doing this. As is so often the case, a sufficiently large “quantitative” change in how easy it is to do something leads to a qualitative change in what we’ll in practice do.

Now, needless to say, the version of Wolfram|Alpha Pro that arrived this week is just the beginning. There are plenty of additional analyses to include, and plenty of new types of data with special characteristics to handle.

States-genders-counts-currencies

And right now, Wolfram|Alpha Pro is set up just to handle fairly small datasets (thousands of rows, handfuls of columns), where it can generate a meaningful report in a typical “web response time” of a few seconds.

There’s nothing about the architecture or the underlying Mathematica infrastructure, though, that restricts datasets to be this small. And I expect that in the future we’ll be able to handle bigger and bigger datasets using the Wolfram|Alpha Pro technology stack.

But for now I’m just pleased at how easy it’s become to take almost any reasonably small lump of raw data, and use Wolfram|Alpha Pro to start getting meaningful insights from it. It is, I believe, a major democratization of the achievements of data science. And a way that much more of the data that’s generated in the world can be used in meaningful ways.

16 Comments

This is really cooooool!

Posted by Riccardo February 10, 2012 at 4:55 am

In the 1950s I went on a short course on management. The instructor described how a biologist was shown a set of graphs of employee salary growth. These showed that a high flyer climbed very quickly with the growth slowing but continuing up to end of career. A low flyer climbed slowly, levelled off then fell slowly. Much to the surprise of the instructor the biologist said ‘They are Biological growth curves’.

For me the penny dropped that of course people are biological specimens so the observation made sense!

Wolfra Alpha Pro sounds as though it could make the same observation.

Posted by Brian Gilbert February 11, 2012 at 6:19 am

Companies considering buying Wolfram Mathematica, WA Pro or just using WA free of-charge need an algorithm to help make the decision.
Until now I have assumed that WA can do what Mathematica can do except the procedural functions and limits on computer time. The choice is therefore simple.

WA changes things. Presumably WA Pro can do everything that WA can do.

Can Mathematica do everything that WA Pro can do?

BG WA Volunteer Curator.

Posted by Brian Gilbert February 11, 2012 at 6:28 am

Spot the algorithm!
It seems that WA Pro is aiming to suggest algorithms that can make use of the input data.

The ‘concept’ behind WA Pro as given above by Stephen Wolfram is a bit loose for me considering the vast scope of WA Pro. I suggest it is made precise so that all concerned. can understand it better.

WA accumulates knowledge including algorithms and uses this to answer questions.

How about…
WA Pro also uses that knowledge to accept data and suggest questions that could usefully be asked?

Posted by Brian Gilbert February 11, 2012 at 6:40 am

I am doing these responses one at a time because they are probably best handled seperately…

Having begun to appreciate the immense step forward represented by WA Pro I suggest that in the background it tackles its whole database as the input. As Stephen says this would take forever as a single input but it just needs an allgorithm to break the input down into the right bite size. As it progresses it can throw out its ideas, hopefully online.

BG Volunteer Curator

Posted by Brian Gilbert February 11, 2012 at 6:46 am

Screenshots truly look fantastic, fun, and easy to use. I can’t wait to get my hands on the Wolframalpha Pro trial once I get home from my holiday. Excellent work guys.

Posted by Jenny February 13, 2012 at 11:37 pm

Stephen, this is amazing. I’ve said it before, and I say it again. You are the most interesting person in the world of science and business. Thanks Stephen et al.

Posted by Stephan Nicholas Reimers-Dahl February 15, 2012 at 3:53 pm

    +1

    Posted by bigggan February 27, 2012 at 6:58 pm

      Sorry i +1 wrong commenter… I was going to +1 the post before me. By “Nuno”

      Posted by bigggan February 27, 2012 at 7:01 pm

Hmm. The basic free tool just allows the setting of “Preferences”, “History” and “Favourites”, and you crippled the basic W|A when you transferred some free features to the Pro version (PDF export, zoom, etc). You’re now selling this product: please explain how that can be “democratizing the data science”. I think it’s just frustrating for your users who, until now, provided you with valuable feedback and expected an improvement of the tool — not just for those with money in their pockets.

I don’t use W|A in a daily basis, but when I do I would expect it to show its full potential. Do you expect the great majority of users (who fit into this kind of user profile) to pay you 60 $/yr? You’re also waving a Pro trial to people that may not have the funds to keep paying. It’s not a nice gesture, not to say it’s very 1995…

This is a lost opportunity, you missed the step of becoming really huge. In a world of Google and Wikipedia, this tool is sadly at the risk of quickly turning into an anachronism.

Posted by Nuno February 15, 2012 at 5:06 pm

You claim that you are democratizing data and science by making us pay for something or create an account to get the same features we didn’t have to pay for a few weeks ago? You launch Wolfram|Alpha Pro which has new features, which is good. But then you charge money and call this democratization? Pathetic.

Love Long and Prosper

Posted by Guest February 15, 2012 at 5:25 pm

@ Stephan Nicholas Reimers-Dahl: Just read what Nuno & Guest wrote…I agree with them to 100%!

It’s a shame fo charging money for something that was free (and said to be free. Remember Stephen? “Free knowledge for everyone?) And now? Big announcement for some new stuff (new stuff is great, of course) but, what also comes with that? Charges! And nobody said a word that, that will come until it was high time.

And first this was announced no one really said a thing…I thought I was the only one, who had a problem with this…but now,slow but steady the votes are rising. I hope WA is going back to “normal” :/
In everyone’s interest!

Posted by Mabie February 16, 2012 at 10:58 am

    I completely agree with you. If one adds new paid features, that can be ok (against the original mission, but the W|A team doesn’t care). If one “crops” free features to make them paid, that’s a different kind of story.
    I honestly admire Stephen, and I’m reading his book. He’s a great scientist and entrepreneur, I truly believe that. But you should always consider he is the head of a proprietary software company and then he has to stick with his role. So the only word is “profit! profit! profit!”.
    Ok fine, but they could profit the same way as Google, Facebook or Twitter, with nice unobtrusive ads. That would be a win-win: profit and honest payment for all the work of the team, and a less proprietary-like situation. I’m personally not against SaaS, but usually paid features are added, not “converted” from previously free. Of course I won’t buy Mathematica, but I’m considering having a Pro account. AFTER the CDF player is released for Linux, of course.
    On a final note, looks like Scott Adams read this blog. 😀
    http://dilbert.com/strips/comic/2012-02-19/

    Posted by Lazza February 19, 2012 at 6:55 am

      They used some unobtrusive ways to get money (well, ok…no ad’s but all those Iphone-Apps!)

      Why wont you buy mathematica? Or: Why did you say: “Of course I won’t buy Mathematica…”?

      I won’t buy pro, because I don’t trust paypal & and I don’t have a credit card…so even if I’d like to buy it…I couldn’t…that’s another point: Why cant one pay via bank account?

      Posted by Mabie February 21, 2012 at 1:31 pm

Regarding the capabilities of WA Pro. Why can’t you upload databases? Why is there a 1MB file size limit?
Thanks,
Jeff

Posted by Jeff Runde February 16, 2012 at 3:53 pm

The output starts with
Automatic Identification of input
Plain WA would have called these ‘Assumptions’ and offerred alternatices assumptions.
I suggest that WA Pro does the same.

BG WA Volunteer Curator

Posted by Brian Gilbert February 18, 2012 at 5:27 am