Today if you give input to Wolfram|Alpha in a language other than English, you’ll most likely see something like:
But in making Wolfram|Alpha accessible to as many people around the world as possible, our goal is eventually to have it understand every one of these languages.
A certain amount of Wolfram|Alpha input is actually quite language independent—because it’s really in math, or chemistry, or some other international notation, or because it’s asking about something (like a place) that’s always referred to by the same name.
But inevitably many inputs do depend on human language—and in fact even now about 5% of all inputs that are given try to use a language other than English.
Handling English is of course difficult enough. And each language that is added is a huge project—requiring all kinds of local help, support, and investment. But the good news is that with the core technology of Wolfram|Alpha, any language can in principle be handled.
At the lowest level, Wolfram|Alpha inherits from Mathematica its comprehensive use of Unicode—allowing it immediately to represent any character set. (Try something like unicode 2345 or unicode 1000 through 1050.)
But what’s more important is that Wolfram|Alpha’s whole approach to linguistic processing is general enough to be adapted to any detailed language structure.
And in fact, the very language that people use to interact with Wolfram|Alpha—even in English—is not really a language that’s been seen before.
Sometimes when people are first introduced to Wolfram|Alpha they’ll use a complete sentence, like What is the population of Italy? But remarkably quickly, they’ll abbreviate down to something that keeps the key concepts, but gets rid of other words, say just Italy population.
One might think this would mean that all one has to do is to spot the key words. But that wouldn’t get very far. Almost always one has to understand how the words are linked, and what actions, as well as objects, are being specified (e.g. 2 feet in inches, big apple population).
The abbreviated “computese” that people enter into Wolfram|Alpha isn’t quite like any existing human language. When people give input to Wolfram|Alpha, they’re usually trying to get their ideas communicated as quickly and directly as possible—and that means that they don’t put on the same gloss as in ordinary human language.
There are often fragments of language left over, as well as pieces of phrase structure and so on. But the forms that occur are not ones that one can learn from traditional grammar books.
The approach our team took during the initial development of Wolfram|Alpha was to accumulate large corpuses of linguistic usage in different areas, then to abstract from these rules and meta-rules that could be slotted into
Wolfram|Alpha’s linguistic processing system.
Now that Wolfram|Alpha has been released, our team has a major new—and more accurate—source, at least for English: the millions and millions of actual inputs that are given to the system.
So what’s involved in generalizing to other languages? A certain amount can be done by word- or phrase-wise translation. Often there will be multiple translations at this level. And when there are several words or phrases
together, there will often be a combinatorial explosion in the number of possibilities.
But conveniently enough, Wolfram|Alpha’s general ambiguity-handling system already deals very efficiently with some of this—providing an interesting foundation for a first level of language understanding.
A lot of language, however, does not factor in this kind of way, and it’s inevitable that all sorts of detailed linguistic curation will have to be done for each particular language—to capture all its particular idioms and special forms.
Different human languages often have rather different structures. For example, some languages (like English) have dominant subject-verb-object word orders, while others (like Japanese) have other word orders such as
subject-object-verb. Similarly, some languages indicate the role of words by case endings, others by position or by using post- or prepositions.
Interestingly, though, when people write “computese” these differences don’t seem to be as marked as usual: word orders are jumbled; case endings are simplified or omitted. Often this makes understanding individual inputs
more difficult, but it will make it easier to generalize Wolfram|Alpha to completely different classes of languages.
Of course, even once Wolfram|Alpha understands input in a particular language, we’re not finished. There’s also the problem of synthesizing correct output text in that language.
The automation of the underlying Mathematica system makes it feasible to have arbitrarily modified text flow immediately into tables, graphics, and everything else. But Wolfram|Alpha is mostly not dealing with literal pieces of text: it’s instead dealing with many small algorithms that form correct phrases from linguistic fragments. And directly or using appropriate meta-algorithms, each of these algorithms has to be converted for each output language.
The generalization of Wolfram|Alpha to all major human languages is a huge undertaking. But it’s one that we’re committed to pursuing.
We’re already had many comments and suggestions—as well as offers of help—from the international Wolfram|Alpha community. And we look forward to extensive collaborations with many individuals and organizations as we pursue the goal of making Wolfram|Alpha fully accessible to as many people in the world as possible.