O'Reilly Open Source Convention
oreilly.comO'Reilly Network

Arrow Home
Arrow Registration
Arrow Hotel/Travel
Arrow See & Do
Arrow Tutorials
Arrow Sessions
Arrow Evening Events
Arrow BOFs
Arrow Speakers
Arrow Press
Arrow Mail List
Arrow Exhibitors
Arrow Sponsors
O'Reilly Open Source Convention
Sheraton San Diego Hotel, San Diego, CA
July 23-27, 2001

News Coverage


Data Munging

Damian Conway, Thoughtstream

Track: Perl Conference 5
Date: Monday, July 23
Time: 8:45am - 5:15pm
Location: Grande Ballroom C

Target audience:
Novice perl programmers who are familiar with simple I/O and variables, and who want to a deeper insight into the techniques of Perl's "core business": extraction, manipulation, and reporting of data.

What attendees will learn:
This tutorial will show you how to use a range of standard Perl features and numerous CPAN modules to read in, decipher, process, and reformat ASCII text data.

You will learn:

  • how regular expressions work, and how to make them work better for you,
  • how to balance nested brackets and match delimiters
  • how to recognize and process common text formats like CSV and HTML
  • how to preprocess archived text formats like (g)zip, tar, uuencoding, MIME, and binary formats,
  • how to handle ambiguity and errors when processing text,
  • how to convert your processed data back into readable text, in either fixed or floating formats
  • how to extract, process, and generate simple natural language data,
Tutorial outline:
  • Getting at the data in the first place
    • Un(g)zipping, untarring, uudecoding, demiming
    • Compress::Zlib
    • Archive::Tar
    • Convert::UU
    • MIME tools
    • Handling file inclusions
  • Regular expressions
    • How they work
    • How they're used (m//, s///, split, grep)
    • How to build them (easily)
    • Common regexps and Regexp::Common
  • Some useful modules for decoding particular formats
    • Text::CSV_XS for comma separated values
    • Text::Balanced for delimiters, brackets, and tags
    • HTML::TreeBuilder for HTML
    • unpack and vec for binary formats
  • Simple transformations
    • Text::Tabs
    • Text::Autoformat
  • Fuzzy processing
    • String::Approx and String::EditDistance
    • Text::Soundex and Text::Metaphone
  • Natural language
    • Lingua::EN::Inflect
    • Lingua::EN::Infinitive
    • Lingua::EN::Squeeze
  • Reporting
    • printf and sprintf
    • Text::Autoformat::form()
    • Interpolation

oreilly.com Home | Conferences Home | Open Source Convention Home
Registration | Hotels/Travel | Tutorials | Sessions | Speakers
Press | Mail List | Exhibitors | Sponsors

© 2001, O'Reilly Media, Inc.