|

Google’s stated mission is to “organize the world’s
information and make it universally accessible and useful.” It often
seems like they (and other search engines) are well on their way. Enter the right
search terms and you’ll often find the information you want in less
than a second. Off the Web, in corporations, government agencies, and other
“enterprises,” well-trained users can tease data from
huge databases pretty quickly, too. Information seems more accessible than ever,
but on the Web and at work, different technological hurdles still keep us from
easily finding a lot of data, even when it’s
meant for us to see.
Using automated “crawlers,” search engines do a
great job of indexing the text and (to a lesser extent) graphics
that make up typical Web pages. But a large fraction of the data
accessible via the Web, roughly 80 percent by some estimates, is
not on pages that crawlers can index just by following links. This
data is instead resides in what some call the “Deep Web.”
It is hidden behind forms and menus that must be filled in intelligently,
usually by humans. A search engine can’t tell you which California
elementary school district has the residents with the shortest average
commute time, for example. To find that out, you have to manipulate the
menus at the U.S. Census Bureau’s Web site. The information is
available to very knowledgeable and patient searchers, but it is not yet
within the reach of convenient and speedy search engines.
I’m optimistic that the Deep Web will someday be in reach. Search
engine companies are working on the problem and they are not alone. Over the
last several years researchers here and elsewhere have devised a variety of
techniques for automated filling-out of forms by crawlers to gain access to
Deep Web information. In addition, there are clearly incentives for many Web
content providers—retailers, for example—to make the job easier for the
search engines. They want you to find their product information, even if it
happens to reside behind a form or menu. With search engines and content
providers equally motivated to make the Deep Web accessible, I feel confident
it will happen.
When it comes to enterprise data, I’m less optimistic about a future
of unfettered data access for users. The databases maintained by large
organizations tend to be very complex, often set up poorly, hard to use, and
even harder to migrate to keep pace with developments in software. An even bigger
problem than those faced by individual databases is that the data across an
enterprise typically resides in numerous different databases (tens or even
hundreds of them) that are fundamentally incompatible with each other. This
means that while we may have access to individual pieces of the data puzzle,
we often don’t really have access to the truly useful information, obtained
by putting the pieces together. A lot of people have argued, for example, that the
government had all the information needed to prevent the 9/11 terrorist attacks.
I don’t know whether this is true, but certainly there is a problem that law
enforcement and intelligence databases aren’t integrated well, so no one could
see the whole picture all at once.
Even when data from more than one database can be brought together, it is often
unclear how the information is related. Which records are talking about the same
thing or person? If one piece of information contradicts another piece, which should
we trust? The questions around improving enterprise data access—often referred to as
the “data integration” problem—are many and often ill-defined or
application-specific. In the InfoLab we are chipping away at small parts of the
problem. My group’s Trio project, for example, gives database managers the tools
to account for uncertainty in their data, and to track the “lineage” or
provenance of data. These features smooth the way for handling data merged from many
different, possibly conflicting, sources. Another InfoLab project is tackling the issue
of “entity resolution”: determining which distinct data records, perhaps
drawn from many different databases, actually represent the same real-world entity.
In an era when one can type in “relational database” and find more than
a million pages in less than a quarter of a second, it is easy to think that we have
universal access to the world’s information. In fact, that day is still yet to come.
|
 |
 |