Perl Remove Matched Mark Start Again

Chapter 1. Advanced Regular Expressions

Regular expressions, or only regexes, are at the core of Perl's text processing, and certainly are ane of the features that made Perl so pop. All Perl programmers laissez passer through a stage where they try to programme everything as regexes, and when that's not challenging enough, everything as a single regex. Perl's regexes have many more features than I tin, or desire, to nowadays here, and so I include those advanced features I find about useful and expect other Perl programmers to know virtually without referring to perlre , the documentation page for regexes.

Readable Regexes, /x and (?#…)

Regular expressions have a much-deserved reputation of being difficult to read. Regexes have their own terse language that uses as few characters as possible to correspond virtually infinite numbers of possibilities, and that's just counting the parts that most people use everyday.

Luckily for other people, Perl gives me the opportunity to make my regexes much easier to read. Given a little bit of formatting magic, non only will others exist able to figure out what I'one thousand trying to lucifer, but a couple weeks later, then will I. Nosotros touched on this lightly in Learning Perl, just it's such a good idea that I'k going to say more about information technology. Information technology'south also in Perl Best Practices.

When I add the /x flag to either the match or substitution operators, Perl ignores literal whitespace in the blueprint. This means that I spread out the parts of my design to make the pattern more than discernible. Gisle Aas's HTTP::Date module parses a date by trying several different regexes. Here's 1 of his regular expressions, although I've modified information technology to appear on a single line, arbitrarily wrapped to fit on this page:

/^(\d\d?)(?:\due south+|[-\/])(\w+)(?:\s+|[-\/])(\d+)(?:(?:\s+|:) (\d\d?):(\d\d)(?::(\d\d))?)?\s*([-+]?\d{2,iv}|(?![APap][Mm]\b) [A-Za-z]+)?\s*(?:\(\w+\))?\s*$/

Quick: Can yous tell which one of the many date formats that parses? Me neither. Luckily, Gisle uses the /x flag to break apart the regex and add comments to show me what each piece of the design does. With /x, Perl ignores literal whitespace and Perl-mode comments inside the regex. Here's Gisle'southward actual code, which is much easier to understand:

/^          (\d\d?)               # mean solar day (?:\s+|[-\/])          (\westward+)                 # month (?:\south+|[-\/])          (\d+)                 # yr          (?:    (?:\s+|:)       # separator before clock (\d\d?):(\d\d)     # hour:min (?::(\d\d))?       # optional seconds          )?                    # optional clock \s*          ([-+]?\d{2,iv}|(?![APap][Mm]\b)[A-Za-z]+)? # timezone \southward*          (?:\(\w+\))?          # ASCII representation of timezone in parens. \s*$ /x

Nether /ten, to match whitespace I have to specify it explicitly, using \s, which matches any whitespace; any of \f\r\north\t\R; or their octal or hexadecimal sequences, such as \040 or \x20 for a literal space. Likewise, if I demand a literal hash symbol, #, I have to escape it also: \#.

I don't have to use /10 to put comments in my regex. The (?#COMMENT) sequence does that for me. It probably doesn't make the regex whatsoever more readable at start glance, though. I can mark the parts of a string right next to the parts of the pattern that represent it. But considering you tin can utilise (?#) doesn't mean you should. I think the patterns are much easier to read with /10:

my $isbn = '0-596-10206-two';  $isbn =~ yard/\A(\d+)(?#group)-(\d+)(?#publisher)-(\d+)(?#particular)-([\dX])\z/i;  print <<"Hither"; Grouping code:     $1 Publisher code: $2 Item:           $3 Checksum:       $iv Hither

Those are just Perl features, though. It's withal up to me to nowadays the regex in a way that other people can understand, just every bit I should exercise with any other code.

These explicated regexes can have upwards quite a flake of screen space, but I can hide them like whatsoever other lawmaking. I tin can create the regex as a string or create a regular expression object with qr// and render information technology:

sub isbn_regex {     qr/ \A         (\d+)  #group             -         (\d+)  #publisher             -         (\d+)  #particular             -         ([\dX])         \z     /9;     }

I could get the regex and interpolate it into the friction match or substitution operators:

my $regex = isbn_regex(); if( $isbn =~ m/$regex/ ) {     print "Matched!\due north";     }

Since I return a regular expression object, I can bind to it directly to perform a match:

my $regex = isbn_regex(); if( $isbn =~ $regex ) {     impress "Matched!\northward";     }

But, I can besides but skip the $regex variable and demark to the return value direct. Information technology looks odd, but it works:

if( $isbn =~ isbn_regex() ) {     print "Matched!\northward";     }

If I do that, I can move all of the actual regular expressions out of the way. Not only that, I now should have a much easier time testing the regular expressions since I can go to them much more easily in the test programs.

Global Matching

In Learning Perl we told y'all about the /g flag that yous can utilise to make all possible substitutions, but it's more useful than that. I can use information technology with the match operator, where it does unlike things in scalar and list context. Nosotros told you that the friction match operator returns true if it matches and false otherwise. That'due south still true (we wouldn't have lied to yous), but it'south not but a Boolean value. The list context behavior is the most useful. With the /g flag, the match operator returns all of the captures:

$_ = "Simply some other Perl hacker,"; my @words = /(\Southward+)/g; # "Just" "another" "Perl" "hacker,"

Fifty-fifty though I just accept one gear up of captures in my regular expression, information technology makes as many matches as it can. Once it makes a match, Perl starts where it left off and tries over again. I'll say more than on that in a moment. I often run into some other Perl idiom that's closely related to this, in which I don't want the actual matches, merely only a count:

my $word_count = () = /(\Due south+)/thousand;

This uses a piddling-known but important rule: the outcome of a listing assignment is the number of elements in the list on the righthand side. In this case, that'southward the number of elements the match operator returns. This just works for a list assignment, which is assigning from a listing on the righthand side to a listing on the lefthand side. That'southward why I have the actress () in there.

In scalar context, the /g flag does some actress piece of work we didn't tell you lot near earlier. During a successful match, Perl remembers its position in the cord, and when I match confronting that same string once more, Perl starts where it left off in that string. It returns the result of ane application of the pattern to the string:

$_ = "Just another Perl hacker,"; my @words = /(\S+)/g; # "Just" "some other" "Perl" "hacker,"  while( /(\S+)/m ) { # scalar context     print "Next give-and-take is '$1'\n";     }

When I match against that same string again, Perl gets the side by side match:

Adjacent word is 'Merely' Next word is 'another' Next word is 'Perl' Next word is 'hacker,'

I tin fifty-fifty look at the match position equally I continue. The congenital-in pos() operator returns the lucifer position for the cord I requite it (or $_ by default). Every string maintains its own position. The first position in the string is 0, so pos() returns undef when it doesn't find a lucifer and has been reset, and this only works when I'm using the /g flag (since at that place's no point in pos() otherwise):

$_ = "Just another Perl hacker,"; my $pos = pos( $_ );            # same as pos() print "I'one thousand at position [$pos]\north"; # undef  /(Merely)/g; $pos = pos(); print "[$ane] ends at position $pos\n"; # 4

When my lucifer fails, Perl resets the value of pos() to undef. If I proceed matching, I'll beginning at the beginning (and potentially create an endless loop):

my( $third_word ) = /(Java)/g; impress "The adjacent position is " . pos() . "\due north";

As a side note, I really detest these print statements where I use the concatenation operator to get the upshot of a role call into the output. Perl doesn't accept a dedicated way to interpolate office calls, and so I can crook a bit. I phone call the office in an bearding array constructor, [ ... ], so immediately dereference it by wrapping @{ ... } around it:

print "The adjacent position is @{ [ pos( $line ) ] }\n";

The pos() operator tin also be an lvalue, which is the fancy programming way of saying that I can assign to it and modify its value. I can fool the match operator into starting wherever I like. After I lucifer the start word in $line, the match position is somewhere after the first of the string. After I do that, I use index to find the next h after the electric current match position. Once I have the offset for that h, I assign the commencement to pos($line) so the adjacent match starts from that position:

my $line = "Only another regex hacker,";  $line =~ /(\South+)/g; print "The first word is $1\n"; print "The next position is @{ [ pos( $line ) ] }\n";  pos( $line ) = index( $line, 'h', pos( $line) );  $line =~ /(\South+)/thou; print "The side by side give-and-take is $1\n"; impress "The side by side position is @{ [ pos( $line ) ] }\n";

Global Match Anchors

And so far, my subsequent matches tin "float," meaning they can start matching anywhere subsequently the starting position. To ballast my next match exactly where I left off the last time, I apply the \1000 anchor. It's just similar the beginning of string ballast \A, except for where \G anchors at the current match position. If my match fails, Perl resets pos() and I start at the showtime of the string.

In this instance, I anchor my design with \Grand. I have a word match, \west+. I use the /x flag to spread out the parts to heighten readability. My match only gets the beginning four words, since it can't match the comma (it'southward not in \w) after the first hacker. Since the next match must start where I left off, which is the comma, and the only affair I can match is whitespace or word characters, I can't continue. That next lucifer fails, and Perl resets the friction match position to the beginning of $line:

my $line = "Just another regex hacker, Perl hacker,";  while( $line =~ /  \G \s* (\w+)  /xg ) {     print "Institute the word '$1'\due north";     print "Pos is at present @{ [ pos( $line ) ] }\northward";     }

I have a way to get around Perl resetting the match position. If I desire to attempt a friction match without resetting the starting bespeak even if it fails, I tin can add the /c flag, which simply means to not reset the friction match position on a failed match. I can attempt something without suffering a penalisation. If that doesn't work, I tin try something else at the aforementioned match position. This feature is a poor homo's lexer. Here's a simple-minded sentence parser:

my $line = "Just another regex hacker, Perl hacker, and that's it!\n";  while( one ) {     my( $establish, $type ) = practise {         if( $line =~ /\Thou([a-z]+(?:'[ts])?)/igc )                         { ( $one, "a word"           ) }         elsif( $line =~ /\G (\due north) /xgc             )                         { ( $1, "newline char"     ) }         elsif( $line =~ /\G (\south+) /xgc            )                         { ( $1, "whitespace"       ) }         elsif( $line =~ /\One thousand ( [[:punct:]] ) /xgc  )                         { ( $1, "punctuation char" ) }         else                         { last;                      }         };      impress "Found a $type [$plant]\n";     }

Look at that case again. What if I want to add more than things I could lucifer? I could add another branch to the determination structure. That'south no fun. That'due south a lot of repeated lawmaking construction doing the aforementioned thing: lucifer something, then return $1 and a clarification. It doesn't have to be like that, though. I rewrite this code to remove the repeated structure. I can shop the regexes in the @items array. I use qr// to create the regexes, and I put the regexes in the lodge that I want to try them. The foreach loop goes through them successively until information technology finds one that matches. When information technology finds a match, it prints a message using the clarification and whatever showed upward in $1. If I desire to add together more tokens, I just add their description to @items:

#!/usr/bin/perl use strict; use warnings;  my $line = "Just another regex hacker, Perl hacker, and that'southward it!\northward";  my @items = (     [ qr/\Chiliad([a-z]+(?:'[ts])?)/i, "discussion"        ],     [ qr/\G(\northward)/,                "newline"     ],     [ qr/\G(\s+)/,               "whitespace"  ],     [ qr/\Grand([[:punct:]])/,       "punctuation" ],     );  LOOP: while( i ) {     Lucifer: foreach my $detail ( @items ) {         my( $regex, $description ) = @$item;          adjacent Friction match unless $line =~ /$regex/gc;          print "Establish a $description [$one]\n";         concluding LOOP if $1 eq "\north";          side by side LOOP;         }     }

Look at some of the things going on in this instance. All matches need the /gc flags, then I add those flags to the match operator inside the foreach loop. I add information technology in that location because those flags don't touch on the pattern, they impact the lucifer operator.

My regex to match a "word," all the same, also needs the /i flag. I can't add that to the match operator because I might have other branches that don't want information technology. The code inside the cake labeled MATCH doesn't know how information technology'south going to get $regex, so I shouldn't create whatever code that forces me to form $regex in a particular fashion.

Recursive Regular Expressions

Perl's characteristic that we phone call "regular expressions" really aren't; we've known this ever since Perl allowed backreferences (\1 and and so on). With v5.x, there's no pretending since nosotros at present accept recursive regular expressions that can do things such equally residual parentheses, parse HTML, and decode JSON. There are several pieces to this that should please the subset of Perlers who tolerate everything else in the linguistic communication then they can run a unmarried pattern that does everything.

Repeating a Subpattern

Perl v5.x added the (?PARNO) to refer to the pattern in a detail capture group. When I use that, the pattern in that capture group must lucifer at that spot.

First, I start with a naïve program that tries to friction match something between quote marks. This program isn't the style I should do it, just I'll get to a correct mode in a moment:

#!/usr/bin/perl #!/usr/bin/perl # quotes.pl  use v5.10;  $_ =<<'Hither'; Amelia said "I am a camel" HERE  say "Matched [$+{said}]!" if m/     ( ['"] )     (?<said>.*?)     ( ['"] )     /x;

Here I repeated the subpattern ( ['"] ). In other code, I would probably immediately recognize that as a chance to movement repeated lawmaking into a subroutine. I might call up that I can solve this problem with a uncomplicated backreference:

#!/usr/bin/perl #!/usr/bin/perl # quotes_backreference.pl  utilise v5.10;  $_ =<<'HERE'; Amelia said "I am a camel" Hither  say "Matched [$+{said}]!" if m/     ( ['"] )     (?<said>.*?)     ( \i )     /x;

That works in this elementary case. The \one matches exactly the text matched in the first capture grouping. If it matched a ƒdouble quote mark, it has to match a double quote mark again. Hold that idea, though, because the target text is not as simple as that. I want to follow a unlike path. Instead of using the backreference, I'll refer to a subpattern with the (?PARNO) syntax:

#!/usr/bin/perl #!/usr/bin/perl # quotes_parno.pl  employ v5.10;  $_ =<<'HERE'; Amelia said 'I am a camel' HERE  say "Matched [$+{said}]!" if m/     ( ['"] )     (?<said>.*?)     (?i)     /x;

This works, at to the lowest degree as much as the first attempt in quotes.pl does. The (?ane) uses the same pattern in that capture group, ( ['"] ). I don't have to repeat the pattern. However, this means that it might match a double quote marking in the offset capture group just a single quote mark in the second. Repeating the pattern instead of the matched text might be what yous want, but not in this case.

There'south another problem though. If the information take nested quotes, repeating the pattern can get confused:

#!/usr/bin/perl # quotes_nested.pl  employ v5.ten;  $_ =<<'Here'; He said 'Amelia said "I am a camel"' HERE  say "Matched [$+{said}]!" if thousand/     ( ['"] )     (?<said>.*?)     (?i)     /x;

This matches only part of what I want it to lucifer:

                              % perl quotes_nested.pl                            Matched [Amelia said ]!

1 problem is that I'm repeating the subpattern exterior of the subpattern I'm repeating; it gets dislocated by the nested quotes. The other trouble is that I'thou not bookkeeping for nesting. I modify the pattern so I can match all of the quotes, bold that they are nested:

#!/usr/bin/perl # quotes_nested.pl  utilize v5.10;  $_ =<<'HERE'; He said 'Amelia said "I am a camel"' Here  say "Matched [$+{said}]!" if grand/     (?<said>                       # $ane     (?<quote>['"])             (?:                 [^'"]++                 |                 (?<said> (?ane) )             )*     \g{quote}     )     /x;  say join "\n", @{ $-{said} };

When I run this, I get both quotes:

                              % perl quotes_nested.pl                            Matched ['Amelia said "I am a camel"']! 'Amelia said "I am a camel"' "I am a camel"

This pattern is quite a change, though. First, I use a named capture. The regular expression yet makes this available in the numbered capture buffers, so this is also $1:

(?<said>                       # $i ... )

My next layer is some other named capture to match the quote, and a backreference to that proper name to friction match the aforementioned quote again:

(?<said>                       # $1 (?<quote>['"]) ... \yard{quote} )

Now comes the the tricky stuff. I want to friction match the stuff inside the quote marks, but if I run into some other quote, I want to match that on its own as if information technology were a unmarried element. To do that, I have an alternation I grouping with noncapturing parentheses:

(?:     [^'"]++     |     (?<said> (?1) ) )*

The [^'"]++ matches one or more characters that aren't i of those quote marks. The ++ quantifier prevents the regular expression engine from backtracking.

If it doesn't lucifer a non−quote marking, it tries the other side of the alternation, (?<said> (?1) ). The (?1) repeats the subpattern in the start capture grouping ($one). However, it repeats that pattern every bit an contained design, which is:

(?<said>                       # $1 ... )

It matches the whole thing once more, but when it's done, it discards its captures, named or otherwise, so it doesn't touch on the superpattern. That means that I can't call up its captures, so I have to wrap that part in its own named capture, reusing the said proper name.

I modify my string to include levels of nesting:

#!/usr/bin/perl # quotes_three_nested.pl  use v5.10;  $_ =<<'Here'; Outside "Top Level 'Heart Level "Bottom Level" Eye' Exterior" Here  say "Matched [$+{said}]!" if yard/     (?<said>                       # $1     (?<quote>['"])             (?:                 [^'"]++                 |                 (?<said> (?1) )             )*     \k{quote}     )     /x;  say join "\northward", @{ $-{said} };

It looks like it doesn't lucifer the innermost quote because it outputs only two of them:

                              % perl quotes_three_nested.pl                            Matched ["Top Level 'Middle Level "Bottom Level" Middle' Outside"]! "Summit Level 'Middle Level "Bottom Level" Heart' Outside" 'Centre Level "Bottom Level" Middle'

However, the pattern repeated in (?1) is independent, and then once in in that location, none of those matches brand it into the capture buffers for the whole pattern. I can set up that, though. The (?{ CODE }) construct—an experimental feature—allows me to run code during a regular expression. I tin use information technology to output the substring I just matched each time I run the pattern. Along with that, I'll switch from using (?1), which refers to the first capture grouping, to (?R), which goes dorsum to the start of the whole pattern:

#!/usr/bin/perl # nested_show_matches.pl  use v5.ten;  $_ =<<'Here'; Outside "Peak Level 'Center Level "Lesser Level" Eye' Outside" Hither  say "Matched [$+{said}]!" if m/     (?<said>     (?<quote>['"])             (?:                 [^'"]++                 |                 (?R)             )*     \g{quote}     )     (?{ say "Inside regex: $+{said}" })     /10;

Each time I run the pattern, even through (?R), I output the electric current value of $+{said}. In the subpatterns, that variable is localized to the subpattern and disappears at the end of the subpattern, although not earlier I can output it:

                              % perl nested_show_matches.pl                            Inside regex: "Bottom Level" Within regex: 'Middle Level "Bottom Level" Heart' Within regex: "Top Level 'Centre Level "Bottom Level" Middle' Outside" Matched ["Top Level 'Middle Level "Bottom Level" Middle' Outside"]!

I can see that in each level, the pattern recurses. It goes deeper into the strings, matches at the bottom level, then works its way dorsum upwardly.

I take this one step further by using the (?(DEFINE)...) feature to create and name subpatterns that I tin can use later:

#!/usr/bin/perl # nested_define.pl  use v5.10;  $_ =<<'HERE'; Exterior "Top Level 'Middle Level "Bottom Level" Middle' Outside" Hither  say "Matched [$+{said}]!" if m/     (?(Define)         (?<QUOTE> ['"])         (?<NOT_QUOTE> [^'"])     )     (?<said>     (?<quote>(?&QUOTE))             (?:                 (?&NOT_QUOTE)++                 |                 (?R)             )*     \g{quote}     )     (?{ say "Inside regex: $+{said}" })     /x;

Within the (?(Define)...) it looks like I take named captures, simply those are really named subpatterns. They don't match until I call them with (?&Proper name) later.

I don't like that say within the pattern, merely every bit I don't particularly like subroutines that output annihilation. Instead of that, I create an assortment before I use the match operator and push each match onto information technology. The $^Due north variable has the substring from the previous capture buffer. Information technology'southward handy because I don't have to count or know names, so I don't need a named capture for said:

#!/usr/bin/perl # nested_carat_n.pl  apply v5.10;  $_ =<<'HERE'; Exterior "Top Level 'Middle Level "Lesser Level" Middle' Outside" HERE  my @matches;  say "Matched!" if one thousand/     (?(DEFINE)         (?<QUOTE_MARK> ['"])         (?<NOT_QUOTE_MARK> [^'"])     )     (     (?<quote>(?&QUOTE_MARK))         (?:             (?&NOT_QUOTE_MARK)++             |             (?R)         )*     \g{quote}     )     (?{ push @matches, $^N })     /x;  say join "\n", @matches;

I go almost the same output:

                              % perl nested_carat_n.pl                            Matched! "Bottom Level" 'Middle Level "Bottom Level" Middle' "Top Level 'Middle Level "Lesser Level" Middle' Outside"

If I tin ascertain some parts of the design with names, I tin can go even farther by giving a name to non just QUOTE_MARK and NOT_QUOTE_MARK, but everything that makes up a quote:

#!/usr/bin/perl # nested_grammar.pl  utilize v5.x;  $_ =<<'Hither'; Outside "Top Level 'Eye Level "Bottom Level" Heart' Outside" Here  my @matches;  say "Matched!" if yard/     (?(Ascertain)         (?<QUOTE_MARK> ['"])         (?<NOT_QUOTE_MARK> [^'"])         (?<QUOTE>             (                 (?<quote>(?&QUOTE_MARK))                  (?:                     (?&NOT_QUOTE_MARK)++                     |                     (?&QUOTE)                 )*                 \g{quote}             )             (?{ push @matches, $^N })         )     )      (?&QUOTE)     /x;  say bring together "\due north", @matches;

Almost everything is in the (?(Define)...), but nil happens until I call (?&QUOTE) at the end to actually lucifer the subpattern I defined with that proper noun.

Pause for a moment. While worrying about the features and how they work, you might have missed what just happened. I started with a regular expression; now I have a grammar! I can define tokens and recurse.

I accept one more feature to show before I tin become to the really adept instance. The special variable $^R holds the result of the previously evaluated (?{...}). That is, the value of the last evaluated expression in (?{...}) ends up in $^R. Even amend, I can touch $^R how I like considering it is writable.

Now that I know that, I can modify my plan to build up the array of matches past returning an array reference of all submatches at the cease of my (?{...}). Each time I accept that (?{...}), I add together the substring in $^N to the values I remembered previously. It's a kludgey mode of building an array, simply information technology demonstrates the characteristic:

#!/usr/bin/perl # nested_grammar_r.pl  employ Information::Dumper; use v5.10;  $_ =<<'Hither'; Outside "Top Level 'Eye Level "Lesser Level" Center' Outside" Hither  my @matches; local $^R = [];  say "Matched!" if m/     (?(Ascertain)         (?<QUOTE_MARK> ['"])         (?<NOT_QUOTE_MARK> [^'"])         (?<QUOTE>             (                 (?<quote>(?&QUOTE_MARK))                  (?:                     (?&NOT_QUOTE_MARK)++                     |                     (?&QUOTE)                 )*                 \g{quote}             )             (?{ [ @{$^R}, $^North ] })         )     )      (?&QUOTE) (?{ @matches = @{ $^R } })     /10;   say bring together "\northward", @matches;

Before the match, I set up the value of $^R to be an empty anonymous array. At the cease of the QUOTE definition, I create a new bearding assortment with the values already within $^R and the new value in $^N. That new bearding array is the final evaluated expression and becomes the new value of $^R. At the end of the pattern, I assign the values in $^R to @matches so I have them after the match ends.

Now that I have all of that, I can get to the code I want to show you, which I'm not going to explain. Randal Schwartz used these features to write a minimal JSON parser equally a Perl regular expression (only really a grammer); he posted it to PerlMonks as "JSON parser as a single Perl Regex". He created this as a minimal parser for a very specific client need where the JSON data are compact, appear on a single line, and are express to ASCII:

#!/usr/bin/perl utilize Data::Dumper qw(Dumper);  my $FROM_JSON = qr{  (?&VALUE) (?{ $_ = $^R->[1] })  (?(DEFINE)  (?<OBJECT>   (?{ [$^R, {}] })   \{     (?: (?&KV) # [[$^R, {}], $k, $v]       (?{  #warn Dumper { obj1 => $^R };      [$^R->[0][0], {$^R->[i] => $^R->[2]}] })       (?: , (?&KV) # [[$^R, {...}], $thou, $5]         (?{ # warn Dumper { obj2 => $^R };        [$^R->[0][0], {%{$^R->[0][i]}, $^R->[one] => $^R->[2]}] })       )*     )?   \} )  (?<KV>   (?&STRING) # [$^R, "cord"]   : (?&VALUE) # [[$^R, "string"], $value]   (?{  #warn Dumper { kv => $^R };      [$^R->[0][0], $^R->[0][one], $^R->[1]] }) )  (?<Assortment>   (?{ [$^R, []] })   \[     (?: (?&VALUE) (?{ [$^R->[0][0], [$^R->[1]]] })       (?: , (?&VALUE) (?{  #warn Dumper { atwo => $^R };              [$^R->[0][0], [@{$^R->[0][1]}, $^R->[ane]]] })       )*     )?   \] )  (?<VALUE>   \south*   (       (?&STRING)     |       (?&NUMBER)     |       (?&OBJECT)     |       (?&ARRAY)     |     true (?{ [$^R, 1] })   |     simulated (?{ [$^R, 0] })   |     zip (?{ [$^R, undef] })   )   \s* )  (?<STRING>   (     "     (?:       [^\\"]+     |       \\ ["\\/bfnrt] #    | #      \\ u [0-9a-fA-f]{iv}     )*     "   )    (?{ [$^R, eval $^N] }) )  (?<NUMBER>   (     -?     (?: 0 | [i-9]\d* )     (?: \. \d+ )?     (?: [eE] [-+]? \d+ )?   )    (?{ [$^R, eval $^Due north] }) )  ) }xms;  sub from_json {   local $_ = shift;   local $^R;   eval { m{\A$FROM_JSON\z}; } and return $_;   dice $@ if $@;   render 'no match'; }  local $/; while (<>) {   chomp;   impress Dumper from_json($_); }

In that location are more than a few interesting things in Randal'due south lawmaking that I leave to you to explore:

Part of his intermediate data structure tells the grammar what he only did.
It fails very quickly for invalid JSON data, although Randal says with more piece of work it could fail faster.
Almost interestingly, he replaces the target string with the information construction by assigning to $_ in the last (?{...}).

If you think that'southward impressive, you should see Tom Christiansen's Stack Overflow refutation that a regular expression tin can't parse HTML, in which he used many of the same features.

Lookarounds

Lookarounds are arbitrary anchors for regexes. We showed several anchors in Learning Perl, such as \A, \z, and \b, and I merely showed the \Chiliad ballast. Using a lookaround, I can describe my own anchor as a regex, and but like the other anchors, they don't eat part of the string. They specify a condition that must be true, merely they don't add to the role of the cord that the overall blueprint matches.

Lookarounds come up in 2 flavors: lookaheads , which look ahead to assert a status immediately after the current friction match position, and lookbehinds , which wait behind to affirm a condition immediately before the electric current match position. This sounds uncomplicated, but it's piece of cake to misapply these rules. The trick is to recall that it anchors to the current lucifer position, then figure out on which side it applies.

Both lookaheads and lookbehinds have ii types: positive and negative . The positive lookaround asserts that its design has to lucifer. The negative lookaround asserts that its pattern doesn't match. No matter which I choose, I accept to remember that they apply to the current match position, not anywhere else in the string.

Lookahead Assertions, (?=PATTERN) and (?!PATTERN)

Lookahead assertions let me peek at the cord immediately ahead of the current match position. The assertion doesn't consume function of the string, and if it succeeds, matching picks up right after the current match position.

Positive lookahead assertions

In Learning Perl, we included an practice to check for both "Fred" and "Wilma" on the same line of input, no matter the order they appeared on the line. The trick we wanted to testify to the novice Perler is that ii regexes can be simpler than one. I fashion to practice this repeats both Wilma and Fred in the alternation then I can try either order. A 2d try separates them into two regexes:

#!/usr/bin/perl # fred_and_wilma.pl  $_ = "Here come Wilma and Fred!"; print "Matches: $_\northward" if /Fred.*Wilma|Wilma.*Fred/; impress "Matches: $_\n" if /Fred/ && /Wilma/;

I tin can make a simple, single regex using a positive lookahead assertion , denoted by (?=Blueprint). This exclamation doesn't eat text in the string, simply if it fails, the entire regex fails. In this example, in the positive lookahead assertion I employ .*Wilma. That pattern must be true immediately subsequently the current match position:

$_ = "Here come Wilma and Fred!"; print "Matches: $_\north" if /(?=.*Wilma).*Fred/;

Since I used that at the start of my design, that means it has to exist true at the beginning of the cord. Specifically, at the showtime of the string, I accept to exist able to friction match whatsoever number of characters except a newline followed by Wilma. If that succeeds, it anchors the rest of the pattern to its position (the first of the cord). Figure i-1 shows the two ways that tin can work, depending on the order of Fred and Wilma in the string. The .*Wilma anchors where it started matching. The elastic .*, which can match any number of nonnewline characters, anchors at the start of the string.

The positive lookahead assertion (?=.*Wilma) anchors the pattern at the beginning of the string

Figure i-1. The positive lookahead exclamation (?=.*Wilma) anchors the blueprint at the kickoff of the string

It's easier to understand lookarounds by seeing when they don't work, though. I'll alter my blueprint a scrap past removing the .* from the lookahead assertion. At first it appears to work, but it fails when I reverse the order of Fred and Wilma in the string:

$_ = "Here come Wilma and Fred!"; impress "Matches: $_\northward" if /(?=Wilma).*Fred/; # Works  $_ = "Here come Fred and Wilma!"; print "Matches: $_\n" if /(?=Wilma).*Fred/; # Doesn't work

Figure 1-2 shows what happens. In the beginning case, the lookahead anchors at the starting time of Wilma. The regex tries the assertion at the start of the string, finds that it doesn't work, then moves over a position and tries over again. Information technology keeps doing this until information technology gets to Wilma. When it succeeds it sets the anchor. Once it sets the ballast, the rest of the pattern has to start from that position.

In the first case, .*Fred tin friction match from that ballast because Fred comes after Wilma. The second example in Figure 1-ii does the same affair. The regex tries that assertion at the beginning of the string, finds that it doesn't work, and moves on to the next position. By the time the lookahead assertion matches, information technology has already passed Fred. The rest of the design has to start from the anchor, but it can't match.

The positive lookahead assertion (?=Wilma) anchors the pattern at Wilma

Figure ane-2. The positive lookahead assertion (?=Wilma) anchors the pattern at Wilma

Since a lookahead assertion doesn't swallow any of the string, I can use ane in a design for dissever when I don't really want to discard the parts of the pattern that friction match. In this example, I want to break autonomously the words in the studly cap cord. I desire to divide it based on the initial capital letter. I want to keep the initial alphabetic character, though, so I utilize a lookahead assertion instead of a character-consuming string. This is dissimilar from the separator retention mode because the split pattern isn't really a separator; it'due south only an anchor:

my @words = carve up /(?=[A-Z])/, 'CamelCaseString'; print join '_', map { lc } @words; # camel_case_string

Negative lookahead assertions

Suppose I desire to detect the input lines that contain Perl, but only if that isn't Perl6 or Perl 6. I might try a negated character class to specify the pattern correct afterward the l in Perl to ensure that the next character isn't a 6. I also employ the word boundary anchors \b because I don't want to lucifer in the center of other words, such every bit "BioPerl" or "PerlPoint":

#!/usr/bin/perl # not_perl6.pl  print "Trying negated grapheme course:\n"; while( <> ) {     print if /\bPerl[^half dozen]\b/;     }

I'll attempt this with some sample input:

# sample input Perl6 comes later Perl v. Perl 6 has a space in it. I just say "Perl". This is a Perl v line Perl 5 is the electric current version. But another Perl 5 hacker, At the end is Perl PerlPoint is like PowerPoint BioPerl is genetic

It doesn't work for all the lines information technology should. Information technology only finds iv of the lines that take Perl without a trailing vi, and a line that has a space between Perl and 6. Note that even in the first line of output, the lucifer still works because it matches the Perl 5 at the end, which is Perl, a space, a 5 (a word character), and so the discussion boundary at the finish of the line:

Trying negated character form:     Perl6 comes after Perl v.     Perl half-dozen has a space in it.     This is a Perl five line     Perl 5 is the electric current version.     Just another Perl 5 hacker,

This doesn't work because in that location has to be a character after the l in Perl. Not simply that, I specified a word purlieus. If that character subsequently the l is a nonword character, such as the " in I simply say "Perl", the word boundary at the end fails. If I take off the abaft \b, now PerlPoint matches. I haven't fifty-fifty tried treatment the example where there is a infinite between Perl and 6. For that I'll need something much meliorate.

To make this really easy, I can utilize a negative lookahead assertion . I don't want to eat a grapheme after the l, and since an assertion doesn't consume characters, it'south the right tool to use. I merely desire to say that if there's anything after Perl, it tin can't be a vi, fifty-fifty if there is some whitespace between them. The negative lookahead assertion uses (?!PATTERN). To solve this problem, I utilize \s?6 as my blueprint, denoting the optional whitespace followed by a 6:

print "Trying negative lookahead assertion:\n"; while( <> ) {     print if /\bPerl(?!\s?six)\b/;     }

Now the output finds all of the right lines:

Trying negative lookahead assertion:     Perl6 comes after Perl 5.     I merely say "Perl".     This is a Perl five line     Perl v is the electric current version.     Just another Perl 5 hacker,     At the stop is Perl

Retrieve that (?!Blueprint) is a lookahead exclamation, so it looks after the current match position. That'due south why this adjacent design yet matches. The lookahead asserts that right before the b in bar the next thing isn't foo. Since the next thing is bar, which is non foo, it matches. People often confuse this to mean that the matter before bar can't exist foo, but each uses the same starting friction match position, and since bar is not foo, they both work:

if( 'foobar' =~ /(?!foo)bar/ ) {     print "Matches! That's not what I wanted!\n";     } else {     print "Doesn't friction match! Whew!\n";     }

Lookbehind Assertions, (?<!Design) and (?<=Pattern)

Instead of looking ahead at the office of the cord coming up, I can use a lookbehind to check the part of the string the regular expression engine has already candy. Due to Perl's implementation details, the lookbehind assertions take to be a fixed width, so I can't utilise variable-width quantifiers in them as some other languages can.

At present I tin can try to friction match bar that doesn't follow a foo. In the last section I couldn't apply a negative lookahead exclamation because that looks forward in the string. A negative lookbehind exclamation , denoted by (?<!Pattern), looks backward. That's just what I need. Now I get the right answer:

#!/usr/bin/perl # correct_foobar.pl  if( 'foobar' =~ /(?<!foo)bar/ ) {     print "Matches! That's not what I wanted!\due north";     } else {     print "Doesn't match! Whew!\due north";     }

Now, since the regex has already candy that function of the string by the fourth dimension information technology gets to bar, my lookbehind assertion can't be a variable-width design. I can't apply the quantifiers to make a variable-width design considering the engine is non going to backtrack in the string to brand the lookbehind work. I won't exist able to cheque for a variable number of os in fooo:

'foooobar' =~ /(?<!fo+)bar/;

When I attempt that, I go the fault telling me that I tin't exercise that, and even though information technology merely says non implemented, don't hold your breath waiting for information technology:

Variable length lookbehind not implemented in regex...

The positive lookbehind assertion also looks backward, merely its design must match. The only fourth dimension I seem to apply this is in substitutions in concert with another assertion. Using both a lookbehind and a lookahead exclamation, I can make some of my substitutions easier to read.

For case, throughout the book I've used variations of hyphenated words considering I couldn't decide which one I should use. Should it exist "builtin" or "built-in"? Depending on my mood or typing skills, I used either of them (O'Reilly Media deals with this by specifying what I should use).

I needed to clean upwards my inconsistency. I knew the part of the word on the left of the hyphen, and I knew the text on the right of the hyphen. At the position where they come across, in that location should be a hyphen. If I think about that for a moment, I've just described the ideal state of affairs for lookarounds: I desire to put something at a particular position, and I know what should exist around information technology. Here's a sample program to utilize a positive lookbehind to bank check the text on the left and a positive lookahead to bank check the text on the right. Since the regex only matches when those sides meet, that means that it's discovered a missing hyphen. When I brand the commutation, information technology puts the hyphen at the lucifer position, and I don't have to worry near the particular text:

my @hyphenated = qw( built-in );  foreach my $word ( @hyphenated ) {     my( $front end, $back ) = split up /-/, $word;      $text =~ south/(?<=$front)(?=$dorsum)/-/g;     }

If that's not a complicated enough example, try this one. Let's use the lookarounds to add together commas to numbers. Jeffery Friedl shows ane attempt in Mastering Regular Expressions, adding commas to the U.s.a. population. The Usa Census Bureau has a population clock so you can use the latest number if you're reading this book a long time from now:

$pop = 316792343;  # that'southward for Feb ten, 2007  # From Jeffrey Friedl $pop =~ s/(?<=\d)(?=(?:\d\d\d)+$)/,/g;

That works, more often than not. The positive lookbehind (?<=\d) wants to match a number, and the positive lookahead (?=(?:\d\d\d)+$) wants to observe groups of iii digits all the way to the end of the string. This breaks when I have floating-betoken numbers, such as currency. For instance, my banker tracks my stock positions to four decimal places. When I effort that substitution, I go no comma on the left side of the decimal point and 1 on the fractional side. Information technology'south considering of that finish of cord ballast:

$money = '$1234.5678';  $money =~ southward/(?<=\d)(?=(?:\d\d\d)+$)/,/g;  # $1234.5,678

I tin change that a flake. Instead of the stop-of-string anchor, I'll use a discussion boundary, \b. That might seem weird, but remember that a digit is a word character. That gets me the comma on the left side, but I withal have that actress comma:

$money = '$1234.5678';  $money =~ s/(?<=\d)(?=(?:\d\d\d)+\b)/,/m;  # $1,234.five,678

What I actually want for that first part of the regex is to utilise the lookbehind to lucifer a digit, but not when information technology's preceded by a decimal bespeak. That's the clarification of a negative lookbehind, (?<!\.\d). Since all of these friction match at the same position, it doesn't matter that some of them might overlap as long as they all do what I need:

$money = '$1234.5678';  $coin =~ s/(?<!\.\d)(?<=\d)(?=(?:\d\d\d)+\b)/,/yard; # $1,234.5678

That looks similar it works. Except information technology doesn't when I track things to five decimal places:

$money = '$1234.56789';  $money =~ s/(?<!\.\d)(?<=\d)(?=(?:\d\d\d)+\b)/,/g; # $1,234.56,789

That'south the problem with regular expressions. They piece of work for the cases we try them on, but some person comes along with something unlike to break what I'thousand proud of.

I tried for a while to fix this problem just couldn't come upward with something manageable. I fifty-fifty asked about it on Stack Overflow as a last resort. I couldn't salve the example without using another advanced Perl feature.

The \1000, added in v5.x, can act like a variable-width negative lookbehind, which Perl doesn't do. Michael Carman came up with this regex:

southward/(?<!\.)(?:\b|\G)\d+?\Thousand(?=(?:\d\d\d)+\b)/,/g;

The \M allows the pattern before information technology to match, but not be replaced. The commutation replaces only the role of the string later on \K. I can break the blueprint up to see how information technology works:

due south/     (?<!\.)(?:\b|\Chiliad)\d+?     \K     (?=(?:\d\d\d)+\b) /,/xg;

The second part, later the \Thousand, is the same thing that I was doing before. The magic comes in the first office, which I pause into its subparts:

(?<!\.) (?:\b|\G) \d+?

There's a negative lookbehind assertion to check for something other than a dot. Later on that, there's an alternation that asserts either a word purlieus or the \1000 ballast. That \G is the magic that was missing from my previous tries and let my regular expression float past that decimal point. After that discussion purlieus or current match position, in that location has to be one or more digits, nongreedily.

Afterward in this chapter I'll come back to this case when I show how to debug regular expressions. Before I get to that, I want to prove a few other simpler regex-demystifying techniques.

Debugging Regular Expressions

While trying to figure out a regex, whether one I found in someone else's code or 1 I wrote myself (possibly a long time ago), I tin can plough on Perl's regex debugging mode.

The -D Switch

Perl's -D switch turns on debugging options for the Perl interpreter (not for your program, as in Chapter 3). The switch takes a series of letters or numbers to indicate what it should turn on. The -Dr option turns on regex parsing and execution debugging.

I can use a short program to examine a regex. The beginning argument is the friction match string and the 2nd argument is the regular expression. I salve this program every bit explain_regex.pl:

#!/usr/bin/perl # explain_regex.pl  $ARGV[0] =~ /$ARGV[ane]/;

When I endeavor this with the target cord But another Perl hacker, and the regex Only another (\S+) hacker,, I meet two major sections of output, which the perldebguts documentation explains at length. Outset, Perl compiles the regex, and the -Dr output shows how Perl parsed the regex. It shows the regex nodes, such as Verbal and NSPACE, as well as whatsoever optimizations, such as anchored "But some other ". Second, it tries to match the target cord, and shows its progress through the nodes. It's a lot of information, only it shows me exactly what information technology's doing:

                              % perl -Dr explain_regex.pl 'Just another Perl hacker,' 'Just another (\Due south+)  hacker,'                            Omitting $` $& $' support (0x0).  EXECUTING...  Compiling REx "Just another (\S+) hacker," rarest char k at 4 rarest char J at 0 Final programme:    1: Verbal <Only another > (six)    6: OPEN1 (8)    8:   PLUS (10)    9:     NPOSIXD[\due south] (0)   10: CLOSE1 (12)   12: Verbal < hacker,> (15)   15: End (0) anchored "Just another " at 0 floating " hacker," at 14..2147483647 (checking  anchored) minlen 22 Guessing start of friction match in sv for REx "Just some other (\Southward+) hacker," against "Merely another Perl hacker," Institute anchored substr "Only another " at start 0... Institute floating substr " hacker," at beginning 17... Guessed: match at offset 0 Matching REx "Just some other (\S+) hacker," confronting "Just another Perl hacker,"    0 <> <But anoth>         |  i:Verbal <Only another >(6)   13 <ther > <Perl hacke>    |  six:OPEN1(eight)   13 <ther > <Perl hacke>    |  8:PLUS(x)                                   NPOSIXD[\s] tin can match 4 times out of                                    2147483647...   17 < Perl> < hacker,>      | 10:  CLOSE1(12)   17 < Perl> < hacker,>      | 12:  EXACT < hacker,>(fifteen)   25 <Perl hacker,> <>       | 15:  Finish(0) Lucifer successful! Freeing Male monarch: "Just another (\S+) hacker,"

The re pragma, which comes with Perl, has a debugging mode that doesn't require a -DDEBUGGING enabled interpreter. Once I turn on use re 'debug', it applies for the residue of the scope. Information technology's not lexically scoped like most pragmata. I modify my previous program to utilize the re pragma instead of the command-line switch:

#!/usr/bin/perl  use re 'debug';  $ARGV[0] =~ /$ARGV[1]/;

I don't accept to modify my programme to use re since I tin can also load information technology from the command line. When I run this programme with a regex as its argument, I get almost the same exact output every bit my previous -Dr case:

                              % perl -Mre=debug explain_regex 'But another Perl hacker,' 'Just some other (\S+)  hacker,'                            Compiling REx "Just another (\Due south+) hacker," Final plan:    i: EXACT <Just another > (six)    half dozen: OPEN1 (8)    8:   PLUS (10)    9:     NPOSIXD[\s] (0)   10: CLOSE1 (12)   12: Verbal < hacker,> (fifteen)   fifteen: END (0) anchored "Simply another " at 0 floating " hacker," at fourteen..2147483647 (checking  anchored) minlen 22 Guessing start of lucifer in sv for REx "Just another (\Southward+) hacker," against "Only another Perl hacker," Found anchored substr "Simply another " at offset 0... Found floating substr " hacker," at offset 17... Guessed: match at offset 0 Matching King "Just another (\S+) hacker," against "Just another Perl hacker,"    0 <> <Simply anoth>         |  1:Exact <Just another >(6)   13 <ther > <Perl hacke>    |  6:OPEN1(eight)   13 <ther > <Perl hacke>    |  8:PLUS(10)                                   NPOSIXD[\s] can lucifer 4 times out of                                   2147483647...   17 < Perl> < hacker,>      | 10:  CLOSE1(12)   17 < Perl> < hacker,>      | 12:  Verbal < hacker,>(15)   25 <Perl hacker,> <>       | 15:  Cease(0) Match successful! Freeing REx: "Simply another (\S+) hacker,"

I tin can follow that output easily because it'due south a simple pattern. I tin can run this with Michael Carman's comma-adding pattern:

#!/usr/bin/perl # comma_debug.pl  employ re 'debug';  $money = '$1234.56789';  $money =~ s/(?<!\.\d)(?<=\d)(?=(?:\d\d\d)+\b)/,/g; # $1,234.5678  print $money;

There's screens and screens of output. I want to go through the highlights because it shows you how to read the output. The commencement part shows the regex programme:

                              % perl comma_debug.pl                            Compiling REx "(?<!\.)(?:\b|\G)\d+?\K(?=(?:\d\d\d)+\b)" Final program:    1: UNLESSM[-one] (vii)    iii:   Verbal <.> (five)    5:   SUCCEED (0)    half dozen: TAIL (vii)    vii: BRANCH (9)    viii:   BOUND (12)    9: BRANCH (Fail)   10:   GPOS (12)   xi: TAIL (12)   12: MINMOD (13)   13: PLUS (15)   14:   POSIXD[\d] (0)   15: KEEPS (xvi)   16: IFMATCH[0] (28)   18:   CURLYM[0] {i,32767} (25)   20:     POSIXD[\d] (21)   21:     POSIXD[\d] (22)   22:     POSIXD[\d] (23)   23:     SUCCEED (0)   24:   NOTHING (25)   25:   Jump (26)   26:   SUCCEED (0)   27: TAIL (28)   28: Cease (0)

The next parts stand for repeated matches because it's the commutation operator with the /yard flag. The first one matches the entire string and the match position is at 0:

GPOS:0 minlen 1 Matching Rex "(?<!\.)(?:\b|\M)\d+?\Thou(?=(?:\d\d\d)+\b)" against "$1234.56789"

The first grouping of lines, prefixed with 0, evidence the regex engine going through the negative lookbehind, the \b, and the \G, matching at the start of the string. The showtime set up of <> has goose egg, and the second has the rest of the string:

0 <> <$1234.5678>         |  1:UNLESSM[-1](7) 0 <> <$1234.5678>         |  7:Co-operative(ix) 0 <> <$1234.5678>         |  eight:  Leap(12)                                  failed... 0 <> <$1234.5678>         |  9:Branch(xi) 0 <> <$1234.5678>         | 10:  GPOS(12) 0 <> <$1234.5678>         | 12:  MINMOD(13) 0 <> <$1234.5678>         | 13:  PLUS(xv)                                  POSIXD[\d] tin friction match 0 times out of 1...                                  failed...                                Co-operative failed...

The regex engine then moves on, shifting over one position:

ane <$> <1234.56789>        |  one:UNLESSM[-i](seven)

It checks the negative lookbehind, and there's no dot:

0 <> <$1234.5678>         |  three:  Verbal <.>(5)                                  failed...

It goes through the anchors once more and finds digits:

ane <$> <1234.56789>        |  7:BRANCH(nine) one <$> <1234.56789>        |  viii:  BOUND(12) 1 <$> <1234.56789>        | 12:  MINMOD(13) 1 <$> <1234.56789>        | xiii:  PLUS(fifteen)                                  POSIXD[\d] can match ane times out of one... 2 <$1> <234.56789>        | 15:    KEEPS(16) ii <$ane> <234.56789>        | 16:      IFMATCH[0](28) 2 <$1> <234.56789>        | eighteen:        CURLYM[0] {1,32767}(25)

It finds more than one digit in a row, bounded by the decimal signal, which the regex never crosses:

2 <$1> <234.56789>        | twenty:          POSIXD[\d](21) 3 <$12> <34.56789>        | 21:          POSIXD[\d](22) 4 <$123> <4.56789>        | 22:          POSIXD[\d](23) v <$1234> <.56789>        | 23:          SUCCEED(0)                                          subpattern success...                                        CURLYM now matched 1 times, len=3... v <$1234> <.56789>        | 20:          POSIXD[\d](21)                                          failed...                                        CURLYM trying tail with matches=1... v <$1234> <.56789>        | 25:          Bound(26) 5 <$1234> <.56789>        | 26:          SUCCEED(0)                                          subpattern success...

But I need three digits left over on the right, and that'south how information technology matches:

2 <$1> <234.56789>        | 28:      Cease(0)     Match successful!

At this point, the substitution operator makes its replacement, putting the comma between the one and the 2. Information technology'southward fix for another replacement, this fourth dimension with a shorter string:

Matching King "(?<!\.)(?:\b|\G)\d+?\K(?=(?:\d\d\d)+\b)" against "234.56789"

I won't go through that; it's a long series of trials to detect four adjacent digits before the decimal point, which it can't do. It fails and the string ends upwards $i,234.56789.

That's fine for a core-dump-fashion assay, where everything has already happened and I take to sift through the results. There's a ameliorate manner to do this, simply I tin can't prove it to you in this book. Damian Conway's Regexp::Debugger animates the same thing, pointing out and coloring the cord as it goes. Like many debuggers (Chapter 3), it allows me to step through a regular expression.

Summary

This chapter covered some of the more than useful advanced features of Perl's regex engine. Some of these features, such as the readable regexes, global matching, and debugging, I use all the fourth dimension. The more complicated ones, like the grammars, I apply sparingly no matter how fun they are.

boswelluntwom.blogspot.com

Source: https://www.oreilly.com/library/view/mastering-perl-2nd/9781449364946/ch01.html