The Wolfram|Alpha Blog is now part of the Wolfram Blog. Join us there for the latest on Wolfram|Alpha and other Wolfram offerings »
C. Alan Joyce

Compute American Community Survey Data for Every Geographic Area

May 9, 2012 —
Comments Off

When Wolfram|Alpha launched three years ago, it did so with broad (but not very deep) socioeconomic data for most geographic places on Earth. Since then, each enhancement of this part of our knowledge base has tended to address just one type of place at a time. Sometimes we’ve added an entirely new category (like US congressional districts or school districts); other times, we’ve added a narrowly focused set of properties to an existing category (such as age pyramids for countries or home prices for US metro areas).

I’ve been proud of each of these individual features, but also frustrated by how hard it’s been to get detailed and directly comparable data for many different types of places at once—the kind of data, in other words, that Wolfram|Alpha is perfectly suited to work with.

But thanks to the outstanding work of our friends at the US Census Bureau, we’ve been able to take some big steps toward filling this “data gap.” The annual American Community Survey (ACS) is designed to replace the old long-form decennial census questionnaire, covering information about age, sex, race, ethnicity, education, income, and much more. In 2006, the Census Bureau released the first single-year ACS estimates, but only for areas with populations over 65,000; in 2008, three-year estimates came out for areas with populations of 20,000 or more; and in 2010, the first five-year estimates were released, covering every geographic area in the country.

What does this mean for Wolfram|Alpha? It means that when we add new data from the five-year ACS estimates, we can immediately compute answers to a new set of questions about virtually every city, school district, congressional district, county, metropolitan area, and state in the country—as well as questions about the nation overall. You can ask about a specific place, compare several specific places, or generate distributions and rankings for a single property mapped over a large set of places.

Let’s start with one of the most fundamental—and most frequently requested—demographic breakdowns: population by age and sex. I’ve always been able to ask Wolfram|Alpha simple questions like “What’s the population of the city of Mars, PA?” (my tiny hometown). But I couldn’t dig any deeper into those numbers.

Now that we’ve added some ACS estimates to our knowledge base, I can ask for a population pyramid for Mars, PA, or I could ask what fraction of the population of the city is female, or even what fraction of the population are girls age 0 to 4. But then I might be curious about how the city proper compares to my old school district. Or I might want to analyze and rank the proportion of school-age children among school districts in my home county. Since it’s an election year, I also find myself asking Wolfram|Alpha to do things like compare the middle-aged male population fraction of PA congressional districts or compare the senior citizen population fraction of Florida and Nevada, two other supposed swing states in the upcoming election.

School-age population fraction of school districts in Butler County, PA

Even limiting myself to questions about population by age and sex, I’ve squandered a probably-unhealthy amount of time comparing the shape of specific cities’ age pyramids. Consider the distinctive spikes of college towns like Champaign, Illinois or Binghamton, NY—or the dramatically different “bulges” for Manhattan and Staten Island.

Age pyramid for Manhattan versus Staten Island

And those are only questions related to a single table of ACS estimates. We’ve already added data on race, Hispanic origin, and poverty; estimates of educational attainment, school enrollment, household income, and more are coming within the next few weeks. Because each of these topics represents such a large volume of data and such a wealth of new things to compute with Wolfram|Alpha, we plan to publish a new blog post each week for the next month or so. We’ll focus on one or two new topics, with lots of examples of new ACS-based queries and other computations that mash up ACS estimates with other datasets in Wolfram|Alpha.

We’ll also be making some subtle improvements to Wolfram|Alpha’s ability to understand complex, natural-language queries about this data, but, as always, it helps to have lots of real test cases from users. So dig in, play around, and let us know what works—or what could work better. We’re excited to make this rich data more accessible to the general public and eager to hear what you think about it.

4 Comments

It was commonly believed that these sort of statistics would not infringe personal privacy. Then someone pointed out that if you know a few details about someone that make them unique in the group you can the ask how many people in that group have another attribute. If the answer is one then it is true of that person.

Posted by Brian Gilbert May 10, 2012 at 11:43 am

    Brian,
    That is a relevant concern, and shows that you care about privacy. Me too. I don’t work for Wolfram Alpha, and am uncertain how WA avoids this situation. But I AM certain enough about how it is often done to respond to your comment.

    The issue of inadvertently identifying an individual as part of what is presumed (and intended) to be anonymous is a concern in many fields. I refer to public health care, in which I worked recently (hopefully, again, and soon!) Often the data was narrowed to children (children with special health care needs was the area I worked in, as a statistician), then further restricted to incidence of certain diseases, broken down by zip code or metropolitan area sub-categories. It was very easy to uniquely identify an individual, I realized. Doing so renders the study useless (we don’t want to have that information, the identifying data), and more important, it is a violation of HIPPA PHI (private health information) laws.

    There are fixes that can be implemented as part of the data collection to prevent this, though. They are programmatic. For example, combining enough regions so that there will be at least three, or five, or even 10 individuals in that larger area will preserve anonymity.

    I suspect that Wolfram Alpha is aware of these issues. I was merely a master’s degree level statistician and compliance officer at a state health agency, with hardly the analytic expertise of Wolfram Alpha! Secondly, there may not be much (or any) privacy sensitive information in the WA data. It all depends on what fields are available.

    Posted by Ellie Kesselman May 11, 2012 at 2:24 pm

    Actually, I just had an ideat, Brian, about how your concern could be quite valid. In the post above (nicely written by Wolfram Alpha’er C. Alan Joyce) household income is mentioned as one of the data fields that will be available.

    Here’s the scenario (I live in Arizona, so will use local context to illustrate): Let’s say you were doing analysis in a particular county, and then zip code, looking for distribution by household income for Native Americans, by tribe, and stratifying by political party based on voter registration. Let’s suppose that the query returned only one value for the criteria “member of the Navajo Nation”, where gender were female, age range was 36 to 55 years, with “annual household income over $1 million”, for zip code 85xxx. And that this person was a registered Democrat. This would be a problem, in a variety of ways, based on several of the criteria.

    What to do? Wolfram Alpha will need to have implemented programmatic anonymizing checks, so that the data would be automatically re-run by aggregating over a large enough geographic area so that single or even scant numbers of multiple values were not returned. Machine learning techniques could be used on the data, which would be helpful in facilitating speed of such anonymizing approaches.

    Of course, there is a an even simpler method. If this situation were to occur, a single or small value less than 3 or 5 observations, the query would not return a result, but instead, an error message saying “too many criteria, sample size too small, try again”.

    Posted by Ellie Kesselman May 11, 2012 at 2:43 pm

Is there any prospect for getting budgetary data from Data.gov connected to WA?

Posted by G. Ryan Faith May 16, 2012 at 1:31 pm