 |
Beyond Search
How Intel and Stanford helped make sense of the Web
The World Wide Web holds billions of pages of data. Searching it to find the one you want is now a major industry. But what if instead of one page, you want ten million, optimized to match your research interests, and served up just as you need them?
The Stanford Digital Library Initiative (DLI) was conceived to meet that kind of challenge. Armed with a generous gift of high-speed hardware from Intel, the project is engineered new methods for “optimized querying of unregulated data that can scale up to several million pages,” according to Andreas Papke, a senior research scientist and the initiative's director.
The first incarnation of the project, DLI1, created the technologies that helped found Google. DLI2 then investigated new ways of crawling, storing, indexing, and querying large amounts of Web-derived data though the WebBase project.
By building flexible “smart crawlers” that learn where, how often, and how deep to explore the Web, the project built a huge database of pages. Then, the hard part: Through a small piece of software, clients could query WebBase using sophisticated search tools. Large clusters of related pages come back at just the right speed for the user to analyze them without worrying about storage.
Making it work required powerful on-the-fly compression and decompression, which according to Papke is “the kind of computation that’s exactly where Intel shines.” Their donation of equipment was particularly important because, he notes, “Often research grants are easier to get than the equipment itself. For Intel to donate these machines to us fills a very important gap, one that could otherwise be crippling.”
Researchers from Milan to Harvey Mudd have benefited from the program. Because WebBase has been able to build time-series data, linguists from the U.C. Berkeley use it to track how vocabulary on the Web, and in everyday language, evolve differently. Moreover, the California Digital Library is analyzing massive quantities of government web pages to see how they change following political shifts.
“A lot of our culture today manifests itself on the Web. A lot of history is occurring there, and it’s ephemeral.” With the support of Intel, WebBase helped researchers understand the meaning of the Web as it continued to evolve.
|
|
Last Modified: November 8 2007 05:12:05 PM |
|