.data

There’s been very little change in top-level internet domains (like .com, .org, .us, etc.) for a long time. But a number of years ago I started thinking about the possibility of having a new .data top-level domain (TLD). And starting this week, there’ll finally be a period when it’s possible to apply to create such a thing.

It’s not at all clear what’s going to happen with new TLDs—or how people will end up feeling about them. Presumably there’ll be TLDs for places and communities and professions and categories of goods and events. A .data TLD would be a slightly different kind of thing. But along with some other interested parties, I’ve been exploring the possibility of creating such a thing.

With Wolfram|Alpha and Mathematica—as well as our annual Data Summit—we’ve been deeply involved with the worldwide data community, and coordinating the creation of a .data TLD would be an extension of that activity.

But what would be the point? For me, it’s about highlighting the exposure of data on the internet—and providing added impetus for organizations to expose data in a way that can efficiently be found and accessed.

In building Wolfram|Alpha, we’ve absorbed an immense amount of data, across a huge number of domains. But—perhaps surprisingly—almost none of it has come in any direct way from the visible internet. Instead, it’s mostly from a complicated patchwork of data files and feeds and database dumps.

But wouldn’t it be nice if there was some standard way to get access to whatever structured data any organization wants to expose?

Right now there are conventions for websites about exposing sitemaps that tell web crawlers how to navigate the sites. And there are plenty of loose conventions about how websites are organized. But there’s really nothing about structured data.

Now of course today’s web is primarily aimed at two audiences: human readers and search engine crawlers. But with Wolfram|Alpha and the idea of computational knowledge, it’s become clear that there’s another important audience: automated systems that can compute things.

There are product catalogs, store information, event calendars, regulatory filings, inventory data, historical reference material, contact information—lots of things that can be very usefully computed from. But even if these things are somewhere on an organization’s website, there’s no standard way to find them, let alone standard structured formats for them.

My concept for the .data domain is to use it to create the “data web”—in a sense a parallel construct to the ordinary web, but oriented toward structured data intended for computational use. The notion is that alongside a website like wolfram.com, there’d be wolfram.data.

If a human went to wolfram.data, there’d be a structured summary of what data the organization behind it wanted to expose. And if a computational system went there, it’d find just what it needs to ingest the data, and begin computing with it.

Needless to say, as we’ve learned over and over again in building Wolfram|Alpha, getting the underlying data is just the beginning of the story. The real work usually starts when one wants to compute from it—so that one can answer specific questions, generate specific reports, and so on.

For example, in our recent work on making the Best Buy product catalog computable, the original data (which came to us as a database dump) was perfectly easy to read. The real work came in the whole rest of the pipeline that was involved in making that data computable.

But the first step is to get the underlying data. And my concept for the .data domain is to provide a uniform mechanism—accessible to any organization, of any size—for exposing the underlying data.

Now of course one could just start a convention that organizations should have a “/datamap.xml” file (or somesuch) in the root of their web domains, just like a sitemap—rather than having a whole separate .data site. But I think introducing a new .data top-level domain would give much more prominence to the creation of the data web—and would provide the kind of momentum that’d be needed to get good, widespread, standards for the various kinds of data.

What is the relation of all this to the semantic web? The central notion of the semantic web is to introduce markup for human-readable web pages that makes them easier for computers to understand and process. And there’s some overlap here with the concept of the data web. But the bulk of the data web is about providing a place for large lumps of structured data that no human would ever directly want to deal with.

A decade ago I suggested to early search engine pioneers that they could get to the deep web by defining standards for how to expose data from databases. For a while there was enthusiasm about exposing “web services”, and now there are all manner of APIs made available by different organizations.

It’s been interesting for me in the past few years to be involved in the emergence of the modern data community. And from what I have seen, I think we’re now just reaching a critical point, where a wide range of organizations are ready to engage in delivering large-scale structured data in standardized forms. So it is a convenient coincidence that this is happening just when it becomes possible to create a .data top-level domain.

We’re certainly not sure what all the issues about a .data TLD will be, and we’re actively seeking input and partners in this effort. But I think there’s a potentially important opportunity, so I’m trying to do what I can to provide leadership, and further help to accelerate the birth of the data web.

7 Comments

I suggest everyone be invited to store their data in any form compatible with Mathematica and thus Wolfram Alpha. The .data domain would be an index/directory of this data and not need to contain the data itself.

Posted by Brian Gilbert January 13, 2012 at 11:22 am Reply

Isn’t the Open Data format (ODATA) meant to be THE standard for web data feeds? please dont create yet another data format… For more on Odata see http://www.odata.org/

Posted by David January 13, 2012 at 2:47 pm Reply

Far simpler and cheaper to use a designated prefix to create a sub-domain such as data.domain.com.

Posted by Nmihi January 17, 2012 at 7:45 pm Reply

The ideas and technologies of the Semantic Web have been developing for at least 15 years, and are highly relevant to this post. At the very least, I think they deserved a mention.

See the W3C’s Semantic Web Activity, “What is RDF?” (XML.com), Resource Description Framework (Wikipedia), and SemanticWeb.com.

Posted by Chris W. January 24, 2012 at 11:19 am Reply

PS: I would second David’s comment about OData.

Posted by Chris W. January 24, 2012 at 11:24 am Reply

It would be simpler to just listen to a specific port to serve data instead of having to mess with an entire TLD.

Posted by asdf February 9, 2012 at 8:21 pm Reply

thanks Stephen ! for sharing wonderful article. But want to ask does domains with .data extension exists . I believe with introduction of .info domains it resembles same thing.

Posted by Manmeet Singh July 2, 2014 at 6:06 am Reply
Leave a Comment

(required)

(will not be published) (required)

(your comment will be held for moderation)