O'Reilly Open Source Convention
oreilly.comO'Reilly Network
ConferencesSoftwareInternational


Arrow Home
Arrow Registration
Arrow Hotel/Travel
Arrow See & Do
Arrow Tutorials
Arrow Sessions
Arrow Evening Events
Arrow BOFs
Arrow Speakers
Arrow Press
Arrow Mail List
Arrow Exhibitors
Arrow Sponsors
Innovate--Collaborate--Discover
O'Reilly Open Source Convention
Sheraton San Diego Hotel, San Diego, CA
July 23-27, 2001

News Coverage

Session

ReBug: A Regular Expression Debugger in Perl

Michel Lambert

Track: Perl Conference 5
Date: Thursday, July 26
Time: 10:45am - 11:15am
Location: Grande Ballroom B

Extended Abstract: ReBug: A Regular Expression Debugger in Perl

Most programmers very rarely utilize the full power of regular expressions, because learning this distinct language and it's associatedidiosyncrasies just doesn't seem worth the effort. One possible cause isthat while other languages have debuggers, and sometimes even IDEs and other niceties, regular expressions never have such luxuries. Always content to be included in other languages and programs as a subset or an extra added feature, regular expressions have rarely been given the spotlight to shine on their own, and consequently have never received the attention to have debuggers and other tools develop around their usage. One could argue that regular expressions are too simple of a language to need a debugger, but regular expressions have tests, branching (whether it matches or not, move forward in the regex or backtrack), memory retention ($1, $2, and so on), and as of Perl 5.005, full fledged Perl (with the (?{}) operator).

As programmers who uphold perl's virtues of laziness, impatience, and hubris, we can find it hard to learn about regular expressions. The first one interferes with our ability to learn. The second means we want to just use the regular expressions, instead of having to spend the time to thoroughly understand them. And the third implies that we feel too proud to go and learn about something we consider to be so "beneath us." So instead, we content ourselves with the "leftmost, longest" rules of thumb, and are prepared to recant voodoo spells over the regex as we try to get it to work. The lucky ones among us who haven't followed Perl's virtues religiously, and have spent the time to read through Jeffrey Friedl's "Mastering Regular Expressions" or Mark-Jason Dominus's descriptions in The Perl Journal, have a much better understanding of how regular expressions work, and are able to use this intuitive sense to more quickly create powerful regexes to do their bidding.

Regular expressions need to be considered a true language, and be given the set of tools normally associated with languages. Maybe a full-fledged IDE just for regexes is a bit-farfetched, but one can easily justify a regular expression debugger as a valuable tool. Imagine being able to test one's regex against some text to see the process by which it matches or (more likely, if you're using the debugger,) fails. Watching how the regex goes through your regex, and seeing if it's actually attempting to match the regex token you think it's matching. Or set a breakpoint to bypass all the early parts of the match to find the part where it fails. When you find that the regex engine is using the wrong part of your regex to match the text, there's no need to restart with new breakpoint conditions. Because of the simplicity of state data with regexes, you can just start running or steeping the regex debugger backwards until it makes sense again, and then step forward to find the point of error.

What makes it all possible is a regex construct introduced in Perl 5.005. The (?{}) syntax allows one to execute arbitrary code when the engine reaches a certain point in the regex. With this functionality, it becomes much easier to implement a full-fledged debugger. In a nutshell, you can have the (?{}) construct (with code to call a particular callback) inserted at very many points within the regex. As the regex matches, the callback is called many times, as the regex engine travels up and down the target string in it's attempt to match. This callback is able to query variables like $1, $`, $&, and so on to get a picture of how far along the regular expression is. With some analyzing of this information, one is able to show exactly how a regex matches, from its deadend attempts to its successful completion.

Because a regex match does not carry along much state information (as opposed to a real program which has numerous variables and memory usage), it is possible to store the entire state of the regex quite simply. Because of this, we can store the current state of the regex with every callback. This then provides an array of states which we can traversed both up and down, to travel both forwards and backwards in time along the regex matching. Playing, rewinding, stepping, and other such tasks are merely a simple traversal through the array. Try picturing what it would be like to undo what the last 'step' did in a real debugger, and you'll realize how a regex's simplicity allows for this incredible power. This idea has one pitfall, however. As a regular expression debugger, you should be immune to any failures upon the part of the code you're testing. While a regex should seem to be self-contained (the syntax described above aside), a never-ending regex, or at least one that does not complete in any reasonable amount of time, can easily cause a program to appear to hang. Simply traveling through the regex and storing the states in an array for later traversal is not an option anymore.

The approach that was used, (which as based off a proof of concept prototype of a regex debugger created by Mark-Jason Dominus,) was to separate the execution of the script into the GUI front-end and the regex parsing backend. The way it works is more complicated than the simple one described in the preceding paragraph, although it adheres to the same underlying principles. The GUI front-end accepts a regex and a string to test it against. Instead of trying to process the regex itself, it sends the regex and the string to a forked copy of the program that accepts the data and waits. When the user hits the forward button, the GUI portion requests the next state from the parser through Inter-Process Communication (IPC). The parser then moves along the regex execution one step. In other words, it returns out of the first (?{}) construct to allow the parsing to continue, and stops when it encounters the next (?{}). The parser then constructs the fields needed by the GUI front-end ($1 to $9, $`, $&, and $') and sends them back through IPC. The GUI then uses the information to display the information, from using $`.$& to show how far along the regex 'pointer is', to $& to show what the current match consists of, to showing what was matched by the first set of parenthesis, and what has been matched thus far in the second set, and so on. The GUI end then adds this new state information to its list of states as it progresses through a match. Every time the user restarts or backs up to one of the previous states, it doesn't need to ask the parser backend, and can just pull the data from the array. In a way, the GUI creates a lazily-filled array, only pulling in values from the parsing backend as necessary to fulfill the user's desire to see further along the parsing of the regex. If one includes a way to reset what regex and string are being matched by the backend, then one can prevent incredibly long regexes from locking up the application. And finally, both breakpoint and variable watches are implemented, allowing one to set a breakpoint on a particular variable (when $1 eq "abc", stop), or allow one to watch the value of $& change as the regex progresses in pseudo-real-time.

CURRENT STATUS

Currently, the regular expression debugger, tentatively titled Rebug, implements the following features:

  • support for almost all perl5.005 regular expression constructs
  • support for breakpoints on the variables that are transferred between the backend and front-end
  • support for variable 'watching' that lets you see how the variables change as the regex matches
  • support for stepping both forward and backward through a regex match to see exactly what the engine is thinking
  • support for running both forward and backward through a regex match at a variable speed to allow you to bypass the beginning portions of a match, (if you don't like to, or cannot, use breakpoints). It also helps you quickly identify problems with a regex, like why a regex takes an extremely long time to complete, or where generally things go wrong
  • color coding: red (string slurped by the regex thus far), blue (unmatched/unslurped string), and green (exactly what was matched ($&)
Planned features include:
  • allow for debugging the full set of regex features, including (?{})
  • use files instead of user-entered strings
  • support for the various regex options (i, m, s, g, and so on) (some currently can be done with the (?imsx-imsx) syntax)
  • whatever other features people think up after I release the debugger (should be relatively soon)
  • more graphical representation of regexes, perhaps similar to ActiveState's debugger icons

oreilly.com Home | Conferences Home | Open Source Convention Home
Registration | Hotels/Travel | Tutorials | Sessions | Speakers
Press | Mail List | Exhibitors | Sponsors


© 2001, O'Reilly Media, Inc.
conftech@oreilly.com