O'Reilly Bioinformatics Technology Conference.
Books Safari Bookshelf Conferences O'Reilly Network

Arrow Home
Arrow Registration
Arrow Speakers
Arrow Keynotes
Arrow Tutorials
Arrow Sessions
Arrow At-a-Glance
Arrow BOFs
Arrow Posters
Arrow Community
Arrow Events
Arrow Exhibitors
Arrow Sponsors
Arrow Hotel/Travel
Arrow Venue Map
Arrow See & Do
Arrow Press
Arrow Join Mailing List 
Arrow Related Reading

Practical Innovation at BioCon 2003

Poster Session

Biopipe - A flexible workflow framework for bioinformatics analysis

By Kiran Ratnapu
Bioinformatics Programmer, Fugu-Informatics, IMCB

Biopipe - A flexible workflow framework for bioinformatics analysis The prominence of the in-silico laboratory, coupled with the explosion of comparative genomics, has made the nature of computational biological analysis increasingly complex. This is exacerbated by the plethora of software that are now available. We increasingly find that multi-genome analysis involve large amounts of data from disparate sources and formats, different programs with specific requirements and output formats that must be suitable for human interpretation. Added to this, there is no easy way of exchanging such workflow definition information for it to be reused elsewhere. Thus we find the need for a flexible workflow framework that will hide such complexity, allowing scientists to focus on their analysis, while providing bioinformaticians a coherent methodology for which to extend the system and allows them to shar their data experiment methodology with other researchers. It was with this in mind that we developed the bioperl-pipeline system. Largely adapted from the Ensembl Pipeline Annotation System, some of the features in the current system include: 1) Handling of various input and output data formats from various databases. 2) A bioperl interface to non-specific loadsharing software (LSF,PBS etc) to ensure that the various analysis programs are run in proper order and are successfully completed while re-running those that fail. 3) A flexible pluggable bioperl interface that allows programs to be 'pipeline-enabled'. 4) An easy and intuitive XML interface to define the pipeline setup including workflow, input fetching mechanisms, output writing mechanisms. 5) Concept of XML templates for a complete reusable definition of pipeline setup through which pipeline could be recreated and rerun exactly and results obtained which are reproducable for the same input data for every pipeline re-run and predictable for similar input data. We have demonstrated this with some common data analysis pipelines like genome annotation pipeline, protein family building pipeline, orthologue finding pipeline, pseudogene prediction pipeline. 6) A user-friendly GUI system to allow easy workflow design of xml templates and job tracking. 7) Web based access to pre-built pipelines (each represented by an Xml template) We are currently looking at extending the system in the following way: 1) A 'grid-aware' system that allows jobs to be distributed over a bio-cluster network harnassing collective computing power that will be especially useful for small groups looking to perform compute-intensive analysis. 2) Offering pre-built pipelines as web-services. In addition, we have also developed a Generic Feature Database (GFD) that complements the pipeline in providing a flexible schema for storing the various features that are produced by the pipeline. We are now applying this framework to various data analysis like genome annotation, protein family building, orthologue finding, pseudogene prediction and high throughput multi-species studies for regulatory elements. We will update the status of the pipeline project and show some examples of how it has been applied.

oreilly.com Home | O'Reilly Bookstores | How to Order | O'Reilly Contacts
International | About O'Reilly | Affiliated Companies | Privacy Policy

© 2002, O'Reilly Media, Inc.