Today marks an important milestone for Wolfram|Alpha, and for computational knowledge in general: for the first time, Wolfram|Alpha is now on average giving complete, successful responses to more than 90% of the queries entered on its website (and with “nearby” interpretations included, the fraction is closer to 95%).
I consider this an impressive achievement—the hard-won result of many years of progressively filling out the knowledge and linguistic capabilities of the system.
The picture below shows how the fraction of successful queries (in green) has increased relative to unsuccessful ones (red) since Wolfram|Alpha was launched in 2009. And from the log scale in the right-hand panel, we can see that there’s been a roughly exponential decrease in the failure rate, with a half-life of around 18 months. It seems to be a kind of Moore’s law for computational knowledge: the net effect of innumerable individual engineering achievements and new ideas is to give exponential improvement.
But to celebrate reaching our 90% query success rate, I thought it’d be fun to take a look at some of what we’ve left behind. Ever since the early days of Wolfram|Alpha, we’ve been keeping a scrapbook of our favorite examples of “artificial stupidity”: places where Wolfram|Alpha gets the wrong idea, and applies its version of “artificial intelligence” to go off in what seems to us humans as a stupid direction.
Here’s an example, captured over a year ago (and now long-since fixed):
When we typed “guinea pigs”, we probably meant those furry little animals (which for example I once had as a kid). But Wolfram|Alpha somehow got the wrong idea, and thought we were asking about pigs in the country of Guinea, and diligently (if absurdly, in this case) told us that there were 86,431 of those in a 2008 count.
At some level, this wasn’t such a big bug. After all, at the top of the output Wolfram|Alpha perfectly well told us it was assuming “‘guinea’ is a country”, and offered the alternative of taking the input as a “species specification” instead. And indeed, if one tries the query today, the species is the default, and everything is fine, as below. But having the wrong default interpretation a year ago was a simple but quintessential example of artificial stupidity, in which a subtle imperfection can lead to what seems to us laughably stupid behavior.
Here’s what “guinea pigs” does today—a good and sensible result:
Below are some other examples from our scrapbook of artificial stupidity, collected over the past 3 years. I’m happy to say that every single one of these now works nicely; many actually give rather impressive results, which you can see by clicking each image below.
There’s a certain humorous absurdity to many of these examples. In fact, looking at them suggests that this kind of artificial stupidity might actually be a good systematic source of things that we humans find humorous.
But where is the artificial stupidity coming from? And how can we overcome it?
There are two main issues that seem to combine to produce most of the artificial stupidity we see in these scrapbook examples. The first is that Wolfram|Alpha tries too hard to please—valiantly giving a result even if it doesn’t really know what it’s talking about. And the second is that Wolfram|Alpha may simply not know enough—so that it misses the point because it’s completely unaware of some possible meaning for a query.
Curiously enough, these two issues come up all the time for humans too—especially, say, when they’re talking on a bad cellphone connection, and can’t quite hear clearly.
For humans, we don’t yet know the internal story of how these things work. But in Wolfram|Alpha it’s very well defined. It’s millions of lines of Mathematica code, but ultimately what Wolfram|Alpha does is to take the fragment of natural language it’s given as input, and try to map it into some precise symbolic form (in the Mathematica language) that represents in a standard way the meaning of the input—and from which Wolfram|Alpha can compute results.
By now—particularly with data from nearly 3 years of actual usage—Wolfram|Alpha knows an immense amount about the detailed structure and foibles of natural language. And of necessity, it has to go far beyond what’s in any grammar book.
When people type input to Wolfram|Alpha, I think we’re seeing a kind of linguistic representation of undigested thoughts. It’s not a random soup of words (as people might feed a search engine). It has structure—often quite complex—but it has scant respect for the niceties of traditional word order or grammar.
And as far as I am concerned one of the great achievements of Wolfram|Alpha is the creation of a linguistic understanding system that’s robust enough to handle such things, and successfully to convert them to precise computable symbolic expressions.
One can think of any particular symbolic expression as having a certain “basin of attraction” of linguistic forms that will lead to it. Some of these forms may look perfectly reasonable. Others may look odd—but that doesn’t mean they can’t occur in the “stream of consciousness” of actual Wolfram|Alpha queries made by humans.
And usually it won’t hurt anything to allow even very odd forms, with quite bizarre distortions of common language. Because the worst that will happen is that these forms just won’t ever actually get used as input.
But here’s the problem: what if one of those forms overlaps with something with a quite different meaning? If it’s something that Wolfram|Alpha knows about, Wolfram|Alpha’s linguistic understanding system will recognize the clash, and—if all is working properly—will choose the correct meaning.
But what happens if the overlap is with something Wolfram|Alpha doesn’t know about?
In the last scrapbook example above (from 2 years ago) Wolfram|Alpha was asked “what is a plum”. At the time, it didn’t know about fruits that weren’t explicitly plant types. But it did happen to know about a crater on the moon named “Plum”. The linguistic understanding system certainly noticed the indefinite article “a” in front of “plum”. But knowing nothing with the name “plum” other than a moon crater (and erring—at least on the website—in the direction of giving some response rather than none), it will have concluded that the “a” must be some kind of “linguistic noise”, gone for the moon crater meaning, and done something that looks to us quite stupid.
How can Wolfram|Alpha avoid this? The answer is simple: it just has to know more.
One might have thought that doing better at understanding natural language would be about covering a broader range of more grammar-like forms. And certainly this is part of it. But our experience with Wolfram|Alpha is that it is at least as important to add to the knowledgebase of the system.
A lot of artificial stupidity is about failing to have “common sense” about what an input might mean. Within some narrow domain of knowledge an interpretation might seem quite reasonable. But in a more general “common sense” context, the interpretation is obviously absurd. And the point is that as the domains of Wolfram|Alpha knowledge expand, they gradually fill out all the areas that we humans consider common sense, pushing out absurd “artificially stupid” interpretations.
Sometimes Wolfram|Alpha can in a sense overshoot. Consider the query “clever population”. What does it mean? The linguistic construction seems a bit odd, but I’d probably think it was talking about how many clever people there are somewhere. But here’s what Wolfram|Alpha says:
And the point is that Wolfram|Alpha knows something I don’t: that there’s a small city in Missouri named “Clever”. Aha! Now the construction “clever population” makes sense. To people in southwestern Missouri, it would probably always have been obvious. But with typical everyday knowledge and common sense, it’s not. And just like Wolfram|Alpha in the scrapbook examples above, most humans will assume that the query is about something completely different.
There’ve been a number of attempts to create natural-language question-answering systems in the history of work on artificial intelligence. And in terms of immediate user impression, the problem with these systems has usually been not so much a failure to create artificial intelligence but rather the presence of painfully obvious artificial stupidity. In ways much more dramatic than the scrapbook examples above, the system will “grab” a meaning it happens to know about, and robotically insist on using this, even though to a human it will seem stupid.
And what we learn from the Wolfram|Alpha experience is that the problem hasn’t been our failure to discover some particular magic human-thinking-like language understanding algorithm. Rather, it’s in a sense broader and more fundamental: the systems just didn’t know, and couldn’t work out, enough about the world. It’s not good enough to know wonderfully about just some particular domain; you have to cover enough domains at enough depth to achieve common sense about the linguistic forms you see.
I always conceived Wolfram|Alpha as a kind of all-encompassing project. And what’s now clear is that to succeed it’s got to be that way. Solving a part of the problem is not enough.
The fact that as of today we’ve reached a 90% success rate in query understanding is a remarkable achievement—that shows we’re definitely on the right track. And indeed, looking at the Wolfram|Alpha query stream, in many domains we’re definitely at least on a par with typical human query-understanding performance. We’re not in the running for the Turing Test, though: Wolfram|Alpha doesn’t currently do conversational exchanges, but more important, Wolfram|Alpha knows and can compute far too much to pass for a human.
And indeed after all these years perhaps it’s time to upgrade the Turing Test, recognizing that computers should actually be able to do much more than humans. And from the point of view of user experience, probably the single most obvious metric is the banishment of artificial stupidity.
When Wolfram|Alpha was first released, it was quite common to run into artificial stupidity even in casual use. And I for one had no idea how long it would take to overcome it. But now, just 3 years later, I am quite pleased at how far we’ve got. It’s certainly still possible to find artificial stupidity in Wolfram|Alpha (and it’s quite fun to try). But it’s definitely more difficult.
With all the knowledge and computation that we’ve put into Wolfram|Alpha, we’re successfully making Wolfram|Alpha not only smarter but also less stupid. And we’re continuing to progress down the exponential curve toward perfect query understanding.
Reading this blog post, I’m left wondering how you qualify a “success”.
Anecdotally, I almost always get an answer, and almost always fail to get a useful answer.
As an example, last year I was wondering how large the Millennium Dome/O2 arena would be if it’s roof arc was drawn into a full circle. Obviously this can be calculated from its height and height with a simple formula.
Considering Wolfram’s math background and the function of Wolfram Alpha as a natural language processor, I entered “circumference of a circle with chord of length 10 and sagitta of length 4” (the numbers are obviously not the actual data for the O2 arena).
I get: “Using closest Wolfram|Alpha interpretation: circumference of a circle”, which is 2?r. Not what I was looking for. Wikipedia also suggest that the sagitta may also be called a versine.
Replacing sagitta with versine prompts the result “Wolfram|Alpha doesn’t understand your query
Showing instead result for query: circumference”. Well, now at least it fails, as it ought to. I’ve no idea why. Searching for these words by themselves shows that they are known by Wolfram’s MathWord. And entering “sagitta=4 chord=10” gives a correct answer, apparently. Or at least the radius, which is a useful step toward the desired result. (“what is the circumference of the circle with a chord=10 and sagitta=4” also fails to give a correct answer.)
No wonder there’s little respect for traditional grammar when it regularly fails to give the desired result.
My point here is, as has been frequently pointed out by others in the past, that Wolfram Alpha knows an amazing amount of stuff; I just have no idea how to get it to tell me, to the point where it is easier to read up the topic myself, on Wikipedia and in books, rather than struggling with the so-called “natural” language interface.
And I really want WA to be good. I just don’t know how to help it understand what I want. Why does it think that only the first part is important and half the input can be discarded, even if it includes terms it clearly knows and can use?
Another example:
Say I want to plant chillies. It would be useful to know the USDA hardiness zone of my area. I type “USDA hardiness zone of Linköping”.
Wolfram|Alpha doesn’t understand your query
Great. Let’s simplify this then. I type “USDA hardiness zone”.
Using closest Wolfram|Alpha interpretation: hardiness
Again, not a good answer. I know what the word means. I’m not looking for that. (And it is again ignoring most of the data in my query.)
Obviously this is a somewhat constructed example, because what I did first was to type “Capsicum Annuum” into Wolfram Alpha, which gives me the USDA hardiness range of the plant, and the USDA hardiness zone for where I currently am located (i.e. Linköping). Only later did I want to look up what the hardiness zone for my area was, and I thought it would be faster to just enter “USDA hardiness zone”. No dice, apparently.
Testing now, “Capsicum annuum USDA hardiness zone” gives a correct answer. “Linköping USDA hardiness zone” fails, again providing me with the word definition for hardiness. Good God, how is it supposed to work?
(If I enter any city without unicode characters, the results I get are a comparison of data for the city and Zone, Lombardy, which in itself is a a possible cause of concern suggesting non-ascii characters aren’t afforded the same status.)
Apologies for using a perhaps not ideal venue to rant, you just provided an excellent opportunity. 😉
You wanna know how to help W|A when it fails at understanding you query?
Just type in at what it fails in the “Send us feedback”-section on the bottom of the page 😉
Great achievement! Congratulations!
Hope all of this comes in Spanish to!!!! I really hope Spanish gets supported sooner than latter.
But I’m glad that this kind of tech it’s beeing widely supported! Because in a couple of years this tech will be extremely important for every single human being.
I really enjoyed the part where you spoke about do’s and donts! Keep up the good work!
My favorite stupid example is “walk world” as a short form to find out how long it would take to walk around the world.
surely this 90% figure is as much about training the users to enter correct queries as it is about wolframalpha being better at understanding
Probably, W|A needs a more user-centric feedback mechanism. The one it has now is difficult to locate, and only good for correcting errors.
Even something simple like below, and ignoring all queies that went unanswered.
“Help Improve Wolfram|Alpha: Did you find what you were looking for, Yes or No.”
Really, what you are counting as success is probably more along the lines of “Could W|A answer the question”.
Good old human stupidity here – “It seems to be a kind of Moore’s law for computational knowledge” would imply that the rate of improvement is increasing, but actually we’re seeing exponential decay (the rate of progress is slowing down) – the graph is cunningly placed upside down so it looks like a Moore’s Law.
I must agree with Adrian (first comment above) – I jump on WA to find answers to things, mainly to see if it “gets” it; but, must say I RARELY get the correct answer I need.
The only time I get a correct answer is if I’m following the method that YOU (or someone) has shown me to use in your blog. And most often using WA merely adds minutes to hours to my search effort (counting the tangents I get on).
THEREFORE – how about doing a follow-up post describing precisely how you are obtaining your “success” percentages and defining success. I’ll keep visiting WA to see its progress and following the blog because I too would like this to succeed and am impressed at the data you’ve accumulated; BUT, I don’t see any evidence (yet) of anything even close to 90% – sorry!
Just did a search on WA about “Life” and so far it gave me all the right answers. Good job! I’ll try to do more things with it like something more complex like viral diseases or maybe the human anatomy and see what it comes up with. Still, if this is further developed, I can see the potential with this tool.
This evidence of high quality is great news for W|A. Thanks for sharing those metrics along with the hilarious queries. Fall of Troy especially. I’d hate to be standing under that.
I agree that some Turing tests are needed to measure how the computations have improved from continued development by human beings. Though, I fear that when we face this question, we might ignore another: how has the evolution of technology affected a human being’s ability to behave humanely?