Session
Building a Vertical Search Engine in a Day
Ken Krugler, Co-Founder and CTO, Krugle
Track: Architecture and Technology
Date: Tuesday, April 24
Time: 11:50am
- 12:35pm
Location: Ballroom D
Nutch, Hadoop, and Lucene are three Apache projects that provide the basis for web crawling, parsing/indexing the data, and serving up search results.
This talk will teach you (via hands-on hacking) how to customize the code and data for a vertical crawler and search engine, then run the crawl and serve up the results. We'll show you how to train text classifiers to rate web pages, then use these resulting scores to focus the crawl on interesting content sites, while using similar techniques to avoid many of the common spider traps that exist in the wild.
The code and techniques covered in this talk are the same ones used by Krugle to crawl the "technical web" and serve up 40M high-quality pages for programmers.

























