O'Reilly Open Source Convention
oreilly.comO'Reilly Network
ConferencesSoftwareInternational


Arrow Home
Arrow Registration
Arrow Hotel/Travel
Arrow See & Do
Arrow Tutorials
Arrow Sessions
Arrow Evening Events
Arrow BOFs
Arrow Speakers
Arrow Press
Arrow Mail List
Arrow Exhibitors
Arrow Sponsors
Innovate--Collaborate--Discover
O'Reilly Open Source Convention
Sheraton San Diego Hotel, San Diego, CA
July 23-27, 2001

News Coverage

Tutorial

Data Munging

Damian Conway, Thoughtstream

Track: Perl Conference 5
Date: Monday, July 23
Time: 8:45am - 5:15pm
Location: Grande Ballroom C

Target audience:
Novice perl programmers who are familiar with simple I/O and variables, and who want to a deeper insight into the techniques of Perl's "core business": extraction, manipulation, and reporting of data.

What attendees will learn:
This tutorial will show you how to use a range of standard Perl features and numerous CPAN modules to read in, decipher, process, and reformat ASCII text data.

You will learn:

  • how regular expressions work, and how to make them work better for you,
  • how to balance nested brackets and match delimiters
  • how to recognize and process common text formats like CSV and HTML
  • how to preprocess archived text formats like (g)zip, tar, uuencoding, MIME, and binary formats,
  • how to handle ambiguity and errors when processing text,
  • how to convert your processed data back into readable text, in either fixed or floating formats
  • how to extract, process, and generate simple natural language data,
Tutorial outline:
  • Getting at the data in the first place
    • Un(g)zipping, untarring, uudecoding, demiming
    • Compress::Zlib
    • Archive::Tar
    • Convert::UU
    • MIME tools
    • Handling file inclusions
  • Regular expressions
    • How they work
    • How they're used (m//, s///, split, grep)
    • How to build them (easily)
    • Common regexps and Regexp::Common
  • Some useful modules for decoding particular formats
    • Text::CSV_XS for comma separated values
    • Text::Balanced for delimiters, brackets, and tags
    • HTML::TreeBuilder for HTML
    • unpack and vec for binary formats
  • Simple transformations
    • Text::Tabs
    • Text::Autoformat
  • Fuzzy processing
    • String::Approx and String::EditDistance
    • Text::Soundex and Text::Metaphone
  • Natural language
    • Lingua::EN::Inflect
    • Lingua::EN::Infinitive
    • Lingua::EN::Squeeze
  • Reporting
    • printf and sprintf
    • Text::Autoformat::form()
    • Interpolation

oreilly.com Home | Conferences Home | Open Source Convention Home
Registration | Hotels/Travel | Tutorials | Sessions | Speakers
Press | Mail List | Exhibitors | Sponsors


© 2001, O'Reilly Media, Inc.
conftech@oreilly.com