ITHACA, N.Y. -- The World Wide Web is an endless source of information, butwith literally millions of pages posted by everyone from governments,universities and corporations to sixth-graders and conspiracy theorists,it's getting harder and harder to find precisely the "right" information.
Now a Cornell University researcher has come up with a method of searchingthe web that can return a list of the most valuable sites on a given topic,as well as a list of sites that index the subject. Early tests of themethod have produced highly focused lists of sites on many topics, oftencomparable to lists carefully compiled by web search experts.
The method was developed by Jon Kleinberg, Cornell professor of computerscience. An evaluation of the method was presented at the seventhInternational World Wide Web Conference held April 14-18 in Brisbane,Australia, in a paper by Kleinberg, David Gibson of the Department ofComputer Science, University of California at Berkeley, and several IBMresearchers.
Popular web-searching tools, known as engines, such as Yahoo! andAltaVista, work by hunting for keywords in the text of web pages. On sometopics this can return hundreds or even thousands of pages. The algorithm(a set of rules specifying how to solve the problem) developed by Kleinberginstead works by analyzing the way web pages are linked to one another. Theassumption behind this is that the most authoritative pages on a givensubject will be those that are most often pointed to by other pages.
The web is annotated with "precisely the type of human judgment we need toidentify authority," Kleinberg explains. "It almost says something aboutthe way the web has evolved. I think it's about the way people linkinformation in general, not just on the web."
Kleinberg's method does more than just identify pages with usefulinformation about a topic, which he calls "authorities." The method alsolooks for pages that contain many links to pages with useful information onthe topic, which he calls "hubs."
The best authorities, Kleinberg says, will be those that point to the besthubs, and the best hubs will be the ones that point to the bestauthorities. Kleinberg prevents this from becoming a circular definition byrecalculating the relationship several times, each time moving closer tothe ideal result.
He has written a search program using this technique called HITS (forHyperlink-Induced Topic Search). HITS begins by conducting an ordinarytext-based search on a topic using a search engine such as AltaVista. Thiscollects a "root set" of about 200 pages that contain the entered keywords.It then expands the set to include all the pages linked to by pages in theroot set. The expanded set might include from 1,000 to 3,000 pages.
>From there on, text is ignored, and the application only looks at the waypages in the expanded set are linked to one another. The first timethrough, it identifies the pages that are pointed to most often by otherpages, and assigns them a score, or "weight," indicating that they are morelikely to be authorities. At the same time it notes the pages that containmore links to other pages and gives them more weight as hubs.
This calculation is repeated several times. Each time the program givesmore authority weight to sites that link to sites with more hub weight, andmore hub weight to sites that link to sites with more authority weight.Ten repetitions, Kleinberg says, are enough to return surprisingly focusedlists of authorities and hubs.
The system overcomes several of the problems frequently identified withtext-based searches. For example, at one time a text-based search for"Gates" didn't return the Microsoft Corp. home page because Microsoftchairman Bill Gates wasn't mentioned on the opening page. (He still isn't,but now his biography can be found by following the link "AboutMicrosoft.") A search for "jaguar" returns a jumble of pages about cars,animals, the Jacksonville Jaguars NFL team, and the obsolete but stillmuch-discussed Atari Jaguar computer.
In a case where a word represents more than one topic, Kleinberg's methodautomatically separates sites into "communities" of hubs and authorities,each representing one of the possible topics. Thus a HITS search on"jaguar" lists first a community of sites related to the Jaguar computer,because the number of web sites on this subject predominate. Further down,it listed communities relating to the football team and the car. Finally itfinds sparse information relating to the animal, because this topic issimply not well represented on the web, Kleinberg says.
Communities also form when a topic is polarized: A search on "abortion"returns separate communities of pro-life and pro-choice sites, because thesites within each community link more densely to one other than to sitesadvocating an opposing view.
One disadvantage of the method, Kleinberg says, is that it doesn't alwayswork for sharply focused queries. A search for "Netscape 4.04," forexample, returns a general list of sites about web browsers.
The paper being presented in Brisbane is titled "Automatic Resource ListCompilation by Analyzing Hyperlink Structure and Associated Text." Anotherpaper by Kleinberg, "Authoritative Sources in a Hyperlinked Environment,"was published in the Proceedings of the 9th ACM-SIAM Symposium on DiscreteAlgorithms, 1998. A related paper, "Inferring Web Communities from LinkTopology," by Kleinberg, Gibson and Prabhakar Raghavan of the IBM AlmadenResearch Center, appears in the Proceedings of the 9th ACM Conference onHypertext and Hypermedia, 1998.
The texts of these papers can be found on Kleinberg's web page athttp://www.cs.cornell.edu/home/kleinber/.
Kleinberg developed the method while working as a visiting scientist atIBM's Almaden Research Center, on leave from Cornell. IBM has applied fora patent on the algorithm.
Materials provided by Cornell University. Note: Content may be edited for style and length.
Cite This Page: