Schuyler Erle, MetaCarta
Track: Emerging Topics
Date: Friday, July 11
Time: 10:30am - 11:15am
Location: Salon B
Recent work in statistical machine learning techniques have made it possible for internet applications to actually build and test hypotheses about which kinds of content might be interesting to a particular user, without any a priori knowledge about the domain of inquiry. Methods such as Bayesian categorization and latent semantic indexing have received much attention lately for their effective use in spam filtering applications and website search engines, but we think that may be just the tip of the iceberg.
At the O'Reilly Network, we have been presented with the same problems plaguing any sufficiently large website: Over time, we have accumulated nearly 5,000 technical articles and weblogs, most of which continue to be interesting, relevant, and informative, and our site grows larger by the day. Additionally, our Meerkat RSS news wire service stores over 50,000 RSS items from over 5,000 feeds. How do we present this body of useful information to our users in a straightforward and effective fashion?
First, we explore the basic failures of traditional search tools, such as topic hierarchies and plain text searches. Then we turn to modern machine learning techniques, such as Bayesian networks, and vector-space indexing. We discuss how the O'Reilly Network has begun practically developing the use of some of these technique for categorizing documents and expressing relationships between them. We will also look at how cooperative learning techniques can be used to help an application develop hypotheses about a given user's preferences, to hone in on that user's specific interests and more effectively put the information that user wants directly at their fingertips.