 |
Poster Session
Biopipe - A flexible workflow framework for bioinformatics analysis
By Kiran Ratnapu
Bioinformatics Programmer, Fugu-Informatics, IMCB
Biopipe - A flexible workflow framework for bioinformatics analysis
The prominence of the in-silico laboratory, coupled with the explosion of
comparative genomics, has made the nature of computational biological
analysis increasingly complex. This is exacerbated by the plethora of
software that are now available. We increasingly find that
multi-genome analysis involve large amounts of data from disparate sources and formats,
different programs with specific requirements and output formats that must
be suitable for human interpretation. Added to this, there is no easy way of exchanging such workflow definition
information for it to be reused elsewhere.
Thus we find the need for a flexible workflow framework that will hide such complexity, allowing
scientists to focus on their analysis, while providing bioinformaticians a
coherent methodology for which to extend the system and allows them to shar their data experiment
methodology with other researchers.
It was with this in mind that we developed the bioperl-pipeline system. Largely adapted from
the Ensembl Pipeline Annotation System, some of the features in the
current system include:
1) Handling of various input and output data formats from various databases.
2) A bioperl interface to non-specific loadsharing software (LSF,PBS
etc) to ensure that the various analysis programs are run in proper
order and are successfully completed while re-running those that fail.
3) A flexible pluggable bioperl interface that allows programs to be
'pipeline-enabled'.
4) An easy and intuitive XML interface to define the pipeline setup including workflow,
input fetching mechanisms, output writing mechanisms.
5) Concept of XML templates for a complete reusable definition of pipeline setup
through which pipeline could be recreated and rerun exactly and results obtained which are
reproducable for the same input data for every pipeline re-run and predictable for similar input data.
We have demonstrated this with some common data analysis pipelines like genome annotation pipeline,
protein family building pipeline, orthologue finding pipeline, pseudogene prediction pipeline.
6) A user-friendly GUI system to allow easy workflow
design of xml templates and job tracking.
7) Web based access to pre-built pipelines (each represented by an Xml template)
We are currently looking at extending the system in the following way:
1) A 'grid-aware' system that allows jobs to be distributed over a
bio-cluster network harnassing collective computing power that will
be especially useful for small groups looking to perform
compute-intensive analysis.
2) Offering pre-built pipelines as web-services.
In addition, we have also developed a Generic Feature Database (GFD) that
complements the pipeline in providing a flexible schema for storing the various
features that are produced by the pipeline.
We are now applying this framework to various data analysis like genome annotation, protein family building,
orthologue finding, pseudogene prediction and high throughput multi-species studies
for regulatory elements. We will update the status of the pipeline project and show some
examples of how it has been applied.
|