Searching the Internet, Brian Davison likes to joke, begins with a popularity contest. You enter a topic, and the search engine produces a list of websites. The pages listed first have received the most votes, or “recommendations,” with each link to a listed page from a credible site representing one vote.
This concept seems simple, but it underpins how web search engines function. Davison theorizes that each web page’s links are made in different contexts, and that recognizing those contexts can lead to improved quality of web search results. This theory is the basis for a five-year CAREER Award he recently received from the National Science Foundation.
“Our goal is to improve Web searches – from eliminating search-engine spam to improving search-engine ranking functions,” says Davison, an assistant professor of computer science and engineering. “One way to begin to identify a good page is to determine how many credible sites are linked to it.”
A search engine might count each link to a site as a recommendation, Davison says, but those recommendations can’t be weighted equally.
“While doing a search for ‘home improvement’ you may discover a page on plumbers,” he says. “While they may be credible plumbers, your objective is to redecorate your house. We’re working to filter out the ‘recommendations’ that cause such a page to be viewed as authoritative on one topic, but are not relevant to your desired topic.”
Davison and his graduate students are considering the topical context of a link and determining how to assign rank by authoritativeness within the query topic, ultimately improving the authority calculation. In time they believe they’ll find additional ways to determine context beyond this topical approach and to estimate why the link was created. This will help decipher links that may be created for advertising purposes and do not match the search intent.
After finishing its analysis, the team will be able to identify a site’s topical content and the communities in which it is well-respected. Davison mentions the oft-cited “jaguars” example. “Are [Internet users] searching for the car, the animal or the sports team?” He and his students – Lan Nie, Xiaoguang Qi and Baoning Wu – have begun to address this issue.
“The CAREER award will also help us purchase storage equipment,” says Davison. “We begin with 12 terabytes and will later expand it to approximately 50 terabytes.” Given that each terabyte equals 1,000 gigabytes, they estimate their storage to be able to handle between 500 million and a billion Web pages. “This is a small fraction of Google’s capacity, but it is substantially more than the typical research trial.”
As part of the grant, Davison will also work with other Lehigh Valley colleges to encourage students at other institutions to pursue graduate studies and research opportunities, particularly in computer science. He hopes that it will also prompt more interaction among educators from different schools.
The trickle-down effect will continue into his classroom by availing a current, real-world search engine project.
“Eventually, I hope to develop an introductory course on information retrieval,” says Davison. “The equipment we purchase through the grant will help students ask – and answer – web information-retrieval questions. For example, ‘what fraction of articles about President Bush also talk about terrorism?’”
Davison also believes this form of link analysis to be an important way to understand other kinds of interlinked information.
“Web links are a very rich source of information that can help to identify what individuals believe to be important,” he says. In the academic world, we cite resources in research papers. This similarity in structure means that we can apply the same approach to the analysis of scholarly citations as we do to the Web.”
More on Davison's research is available via his Web site