An Open-Source Parallelization of BLAST
By Wu-chun Feng
Team Leader & Adjunct Assistant Professor, Los Alamos National Laboratory
Poster Presenter: Adam Engelhart
Poster Authors: Aaron Darling, Adam Engelhart, and Wu-chun Feng
The BLAST search algorithms consume the bulk of compute cycles for most bioinformatics researchers. The algorithms search for similarities between a short query sequence and a large, unchanging database of DNA or amino acid sequence. As BLAST is both computationally intensive and embarrassingly parallel, several paradigms exist for parallelizing BLAST including multithreading, query segmentation, and database segmentation. Virtually all parallel implementations use multithreading and query segmentation.
This poster presents an open-source parallelization of BLAST called mpiBLAST. Using the ubiquitous parallel-programming library called MPI: Message Passing Interface, mpiBLAST segments a database into several fragments such that each node in a computational cluster searches a unique portion of the database.
Database segmentation offers two primary advantages over existing parallel BLAST algorithms. First, the current size of sequence databases is larger than core memory on most computers, forcing BLAST searches to use disk I/O. Segmenting the database permits each node to search a smaller portion of the database, eliminating disk I/O and vastly improving BLAST performance. Second, because database segmentation does not create heavy interprocessor communication demands, it allows BLAST users to take advantage of power-efficient, space-efficient, low-cost clusters such as Green Destiny (recently featured by CNN and the New York Times).
mpiBLAST requires an initial setup phase during which the sequence databases are formatted, segmented, and distributed amongst the nodes in the cluster. BLAST queries are processed upon completion of the setup phase. Since changes to the database are infrequent the cost of the setup phase can be amortized across all subsequent queries. Setup is accomplished by executing an MPI wrapper for the standard BLAST formatdb program called 'mpiformatdb.'
BLAST queries are executed by running an MPI wrapper program for the standard blastall program that comes with the NCBI BLAST distribution. The 'mpiblast' wrapper places the query onto shared storage and executes blastall on each individual node. Each node is assigned a unique file to store search results on shared storage. Upon completion of the search by all cluster nodes, a master node aggregates the search results files into a single user specified output file. Both mpiformatdb and mpiblast are command-line compatible with formatdb and blastall to ease integration with existing BLAST systems. mpiblast achieves a 170x speedup when running on 128 nodes versus a single node due to the elimination of disk paging.
In addition to a description of the significance of BLAST and the utilility of database segmentation for parallelization, the poster includes several figures and diagrams, presenting mpiBLAST's algorithms and performance. A diagram of the parallel execution flow of mpiBLAST complements the description of the algorithm. Performance is shown on a speedup diagram for 1 through 128 nodes using empirical measurements gathered on our 'Green Destiny' cluster. Finally, a slowdown curve demonstrates how disk I/O adversely affects search speed on a single node as the sequence database size increases.