Question

The engine for parsing strings which is called "regular expressions" in Perl is very different from what is known by the term "regular expressions" in books.

So, my question is: is there some document describing the Perl's regexp implementation and how and in what ways does it really differ from the classic one (by classic I mean a regular expressions that can really be transformed to ordinary DFA/NFA) and how it works?

Thank you.

Was it helpful?

Solution

Perl regular expressions are of course called Perl regular expressions, or regexes for short. They may also be called patterns or rules. But what they are, or at least can be, is recursive descent parsers. They’re implemented using a recursive backtracker, although you can swap in a DFA engine if you prefer to offload DFA‐solvable tasks to it.

Here are some relevant citations about these matters, with all emboldening — and some of the text :) — mine:

You specify a pattern by creating a regular expression (or regex), and Perl’s regular expression engine (the “Engine”, for the rest of this chapter) then takes that expression and determines whether (and how) the pattern matches your data. While most of your data will probably be text strings, there’s nothing stopping you from using regexes to search and replace any byte sequence, even what you’d normally think of as “binary” data. To Perl, bytes are just characters that happen to have an ordinal value less than 256.

If you’re acquainted with regular expressions from some other venue, we should warn you that regular expressions are a bit different in Perl. First, they aren’t entirely “regular” in the theoretical sense of the word, which means they can do much more than the traditional regular expressions taught in computer science classes. Second, they are used so often in Perl that they have their own special variables, operators, and quoting conventions which are tightly integrated into the language, not just loosely bolted on like any other library.

      — Programming Perl, by Larry Wall, Tom Christiansen, and Jon Orwant

This is the Apocalypse on Pattern Matching, generally having to do with what we call “regular expressions”, which are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I’m not going to try to fight linguistic necessity here. I will, however, generally call them “regexes” (or “regexen”, when I’m in an Anglo‐Saxon mood).

      — Perl6 Apocalypse 5: Pattern Matching, by Larry Wall

There’s a lot of new syntax there, so let’s step through it slowly, starting with:

    $file = rx/ ^  <$hunk>*  $ /;

This statement creates a pattern object. Or, as it’s known in Perl 6, a “rule”. People will probably still call them “regular expressions” or “regexes” too (and the keyword rx reflects that), but Perl patterns long ago ceased being anything like “regular”, so we’ll try and avoid those terms.

[Update: We’ve resurrected the term “regex” to refer to these patterns in general. When we say “rule” now, we’re specifically referring to the kind of regex that you would use in a grammar. See S05.]

      — Perl6 Exegesis 5: Pattern Matching, by Damian Conway

This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them regex rather than “regular expressions” because they haven’t been regular expressions for a long time, and we think the popular term “regex” is in the process of becoming a technical term with a precise meaning of: “something you do pattern matching with, kinda like a regular expression”. On the other hand, one of the purposes of the redesign is to make portions of our patterns more amenable to analysis under traditional regular expression and parser semantics, and that involves making careful distinctions between which parts of our patterns and grammars are to be treated as declarative, and which parts as procedural.

In any case, when referring to recursive patterns within a grammar, the terms rule and token are generally preferred over regex.

      — Perl6 Synopsis 5: Regexes and Rules, by Damian Conway, Allison Randal, Patrick Michaud, Larry Wall, and Moritz Lenz

OTHER TIPS

The O'Reilly book 'Mastering Regular Expressions' has a very good explanation of Perl's and other engines. For me this is the reference book on the topic.

There is no formal mathematical name for the language accepted by PCREs.

The term "regular expressions with backtracking" or "regular expressions with backreferences" is about as close as you will get. Anybody familiar with the difference will know what you mean.

(There are only two common types of regexp implementations: DFA-based, and backtracking-based. The former generally accept the "regular languages" in the traditional Computer Science sense. The latter generally accept... More, and it depends on the specific implementation, but backreferences are always one the non-DFA features.)

I asked the same question on the theoretical CS Stack Exchange (Regular expressions aren't), and the answer that got the most upvotes was “regex.”

  • The dialect is called PCRE (Perl-compatible Regular Expressions).
  • It's documented in the Perl manual.
  • Or in "Programming Perl" by Wall, Orwant and Christiansen
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top