Stanford Engineering

   Ask the Expert



Google’s stated mission is to “organize the world’s information and make it universally accessible and useful.” It often seems like they (and other search engines) are well on their way. Enter the right search terms and you’ll often find the information you want in less than a second. Off the Web, in corporations, government agencies, and other “enterprises,” well-trained users can tease data from huge databases pretty quickly, too. Information seems more accessible than ever, but on the Web and at work, different technological hurdles still keep us from easily finding a lot of data, even when it’s meant for us to see.

Using automated “crawlers,” search engines do a great job of indexing the text and (to a lesser extent) graphics that make up typical Web pages. But a large fraction of the data accessible via the Web, roughly 80 percent by some estimates, is not on pages that crawlers can index just by following links. This data is instead resides in what some call the “Deep Web.” It is hidden behind forms and menus that must be filled in intelligently, usually by humans. A search engine can’t tell you which California elementary school district has the residents with the shortest average commute time, for example. To find that out, you have to manipulate the menus at the U.S. Census Bureau’s Web site. The information is available to very knowledgeable and patient searchers, but it is not yet within the reach of convenient and speedy search engines.

I’m optimistic that the Deep Web will someday be in reach. Search engine companies are working on the problem and they are not alone. Over the last several years researchers here and elsewhere have devised a variety of techniques for automated filling-out of forms by crawlers to gain access to Deep Web information. In addition, there are clearly incentives for many Web content providers—retailers, for example—to make the job easier for the search engines. They want you to find their product information, even if it happens to reside behind a form or menu. With search engines and content providers equally motivated to make the Deep Web accessible, I feel confident it will happen.

When it comes to enterprise data, I’m less optimistic about a future of unfettered data access for users. The databases maintained by large organizations tend to be very complex, often set up poorly, hard to use, and even harder to migrate to keep pace with developments in software. An even bigger problem than those faced by individual databases is that the data across an enterprise typically resides in numerous different databases (tens or even hundreds of them) that are fundamentally incompatible with each other. This means that while we may have access to individual pieces of the data puzzle, we often don’t really have access to the truly useful information, obtained by putting the pieces together. A lot of people have argued, for example, that the government had all the information needed to prevent the 9/11 terrorist attacks. I don’t know whether this is true, but certainly there is a problem that law enforcement and intelligence databases aren’t integrated well, so no one could see the whole picture all at once.

Even when data from more than one database can be brought together, it is often unclear how the information is related. Which records are talking about the same thing or person? If one piece of information contradicts another piece, which should we trust? The questions around improving enterprise data access—often referred to as the “data integration” problem—are many and often ill-defined or application-specific. In the InfoLab we are chipping away at small parts of the problem. My group’s Trio project, for example, gives database managers the tools to account for uncertainty in their data, and to track the “lineage” or provenance of data. These features smooth the way for handling data merged from many different, possibly conflicting, sources. Another InfoLab project is tackling the issue of “entity resolution”: determining which distinct data records, perhaps drawn from many different databases, actually represent the same real-world entity.

In an era when one can type in “relational database” and find more than a million pages in less than a quarter of a second, it is easy to think that we have universal access to the world’s information. In fact, that day is still yet to come.

Related Topics
button Jennifer Widom article
About Jennifer Widom
Professor Widom received her Bachelors degree from the Indiana University School of Music in 1982 and her Computer Science PhD from Cornell University in 1987. She was a research staff member at the IBM Almaden Research Center before joining the Stanford faculty in 1993. Her research interests span many aspects of nontraditional data management. A current project, Trio, is a database management system for integrated management of data, accuracy, and lineage (provenance). Past projects include “STREAM” for querying continuous streams of data, and “Lore” for managing semi-structured data, a precursor to XML. Widom is an ACM Fellow and a member of the National Academy of Engineering.