Maciej Ceglowski, National Institute for Technology and Liberal Education
Date: Thursday, February 06
Time: 1:30pm - 3:00pm
Location: California Ballroom A & B
Latent semantic indexing (LSI) is an information retrieval technique known to substantially improve recall in full-text search engines. LSI works by applying a dimensionality reduction technique called singular value decomposition (SVD) to a vector space data model, reducing noise and bringing out latent relationships within the data. While most of the research on LSI has been done in the domain of text searches, where LSI search engines can actually retrieve relevant documents that do not match any keyword in a query, the linear algebra implementation of the technique makes it applicable to a wide range of problems in bioinformatics, including gene and protein sequencing, gene regulatory networks, and medical imaging. Many of these potential applications remain completely unexplored.
Ceglowski and Cuadrado have been working with LSI on both text and scientific data collections, including news stories, journal articles, and mass and NMR spectra, and have created a suite of open source Perl modules for use in creating LSI search engines. Their tutorial presents the basic algorithms behind LSI, with an emphasis on their practical application to real-world data sets, followed by a detailed demonstration of how to index, visualize, and search actual biological data. The tutorial ends with a discussion of open problems in the field, a brief introduction to doing distributed indexing on large data collections, and techniques for effectively searching large heterogeneous data sets.
Participants will come away with the concepts and software they need to immediately begin using LSI in their research.
Download presentation file