GenBankParser: When all you want is to parse the flat files.
By John Kloss
Programmer, Database Administrator, Systems Administrator, Genome Sequencing Center, Washington University Medical School in St. Louis
The GenBankParser is a simple parser of the NCBI's GenBank Flat File Format. Pa
rsing is performed via recursive decent which is based upon the structure of the
format not on particular fields. Fields and subfields are parsed by separate and disjoint parsers which allows the GenBankParser to quickly adjust to new or changing field formats in subsequent GenBank releases.
Access to the information parsed is provided by a user defined callback funtion and an intuitive "point and ask for it" interface where accessor methods are named after the fields and subfields of the GenBank Flat File Format. Field and subfield parsers are selected by the user at compile time during which the necessary accessor and mutator functions are generated. Unlike other GenBank Flat File Parsers, only those fields in which the user is interested will be parsed. All other information is ignored, which provides significant speed advantages when one only cares about a few fields in the flat file.
In individual tests against the GenBank Flat File parser methods provided by the
bioperl distribution, the GenBankParser proved to be ten to fifteen times faste
r at generating common parsed output such as fasta format for nucleic or protein sequences.