001
2021-12-17
jrmu
20 Regular Expressions
003
2021-12-17
jrmu
Regular expressions are the text processing workhorse of perl. With
004
2021-12-17
jrmu
regular expressions, you can search strings for patterns, find out what
005
2021-12-17
jrmu
matched the patterns, and substitute the matched patterns with new strings.
008
2021-12-17
jrmu
There are three different regular expression operators in perl:
010
2021-12-17
jrmu
1.match m{PATTERN}
012
2021-12-17
jrmu
2.substitute s{OLDPATTERN}{NEWPATTERN}
014
2021-12-17
jrmu
3.transliterate tr{OLD_CHAR_SET}{NEW_CHAR_SET}
017
2021-12-17
jrmu
Perl allows any delimiter in these operators, such as {} or () or // or
018
2021-12-17
jrmu
## or just about any character you wish to use. The most common
019
2021-12-17
jrmu
delimiter used is probably the m// and s/// delimiters, but I prefer to
020
2021-12-17
jrmu
use m{} and s{}{} because they are clearer for me. There are two ways to
021
2021-12-17
jrmu
"bind" these operators to a string expression:
024
2021-12-17
jrmu
1.=~ pattern does match string expression
026
2021-12-17
jrmu
2.!~ pattern does NOT match string expression
029
2021-12-17
jrmu
Binding can be thought of as "Object Oriented Programming" for regular
030
2021-12-17
jrmu
expressions. Generic OOP structure can be represented as
033
2021-12-17
jrmu
$subject -> verb ( adjectives, adverbs, etc );
036
2021-12-17
jrmu
Binding in Regular Expressions can be looked at in a similar fashion:
039
2021-12-17
jrmu
$string =~ verb ( pattern );
042
2021-12-17
jrmu
where "verb" is limited to 'm' for match, 's' for substitution, and 'tr'
043
2021-12-17
jrmu
for translate. You may see perl code that simply looks like this:
049
2021-12-17
jrmu
This is functionally equivalent to this:
052
2021-12-17
jrmu
$_ =~ m/patt/;
056
2021-12-17
jrmu
Here are some examples:
059
2021-12-17
jrmu
# spam filter
061
2021-12-17
jrmu
my $email = "This is a great Free Offer\n";
063
2021-12-17
jrmu
if($email =~ m{Free Offer})
065
2021-12-17
jrmu
{$email="*deleted spam*\n"; }
067
2021-12-17
jrmu
print "$email\n";
070
2021-12-17
jrmu
# upgrade my car
072
2021-12-17
jrmu
my $car = "my car is a toyota\n";
075
2021-12-17
jrmu
$car =~ s{toyota}{jaguar};
077
2021-12-17
jrmu
print "$car\n";
080
2021-12-17
jrmu
# simple encryption, Caesar cypher
082
2021-12-17
jrmu
my $love_letter = "How I love thee.\n";
084
2021-12-17
jrmu
$love_letter =~ tr{A-Za-z}{N-ZA-Mn-za-m};
086
2021-12-17
jrmu
print "encrypted: $love_letter";
089
2021-12-17
jrmu
$love_letter =~ tr{A-Za-z}{N-ZA-Mn-za-m};
091
2021-12-17
jrmu
print "decrypted: $love_letter\n";
094
2021-12-17
jrmu
> *deleted spam*
096
2021-12-17
jrmu
> my car is a jaguar
098
2021-12-17
jrmu
> encrypted: Ubj V ybir gurr.
101
2021-12-17
jrmu
> decrypted: How I love thee.
104
2021-12-17
jrmu
The above examples all look for fixed patterns within the string.
105
2021-12-17
jrmu
Regular expressions also allow you to look for patterns with different
106
2021-12-17
jrmu
types of "wildcards".
109
2021-12-17
jrmu
20.1 Variable Interpolation
111
2021-12-17
jrmu
The braces that surround the pattern act as double-quote marks,
112
2021-12-17
jrmu
subjecting the pattern to one pass of variable interpolation as if the
113
2021-12-17
jrmu
pattern were contained in double-quotes. This allows the pattern to be
114
2021-12-17
jrmu
contained within variables and interpolated during the regular expression.
117
2021-12-17
jrmu
my $actual = "Toyota";
119
2021-12-17
jrmu
my $wanted = "Jaguar";
121
2021-12-17
jrmu
my $car = "My car is a Toyota\n";
123
2021-12-17
jrmu
$car =~ s{$actual}{$wanted};
125
2021-12-17
jrmu
print $car;
128
2021-12-17
jrmu
> My car is a Jaguar
131
2021-12-17
jrmu
20.2 Wildcard Example
133
2021-12-17
jrmu
In the example below, we process an array of lines, each containing the
134
2021-12-17
jrmu
pattern {filename: } followed by one or more non-whitespace characters
135
2021-12-17
jrmu
forming the actual filename. Each line also contains the pattern {size:
136
2021-12-17
jrmu
} followed by one or more digits that indicate the actual size of that
140
2021-12-17
jrmu
my @lines = split "\n", <<"MARKER"
142
2021-12-17
jrmu
filename: output.txt size: 1024
144
2021-12-17
jrmu
filename: input.dat size: 512
146
2021-12-17
jrmu
filename: address.db size: 1048576
152
2021-12-17
jrmu
foreach my $line (@lines) {
154
2021-12-17
jrmu
####################################
156
2021-12-17
jrmu
# \S is a wildcard meaning
158
2021-12-17
jrmu
# "anything that is not white-space".
160
2021-12-17
jrmu
# the "+" means "one or more"
162
2021-12-17
jrmu
####################################
164
2021-12-17
jrmu
if($line =~ m{filename: (\S+)}) {
166
2021-12-17
jrmu
my $name = $1;
168
2021-12-17
jrmu
###########################
170
2021-12-17
jrmu
# \d is a wildcard meaning
172
2021-12-17
jrmu
# "any digit, 0-9".
174
2021-12-17
jrmu
###########################
177
2021-12-17
jrmu
$line =~ m{size: (\d+)};
179
2021-12-17
jrmu
my $size = $1;
181
2021-12-17
jrmu
print "$name,$size\n";
187
2021-12-17
jrmu
> output.txt,1024
189
2021-12-17
jrmu
> input.dat,512
191
2021-12-17
jrmu
> address.db,1048576
194
2021-12-17
jrmu
20.3 Defining a Pattern
196
2021-12-17
jrmu
A pattern can be a literal pattern such as {Free Offer}. It can contain
197
2021-12-17
jrmu
wildcards such as {\d}. It can also contain metacharacters such as the
198
2021-12-17
jrmu
parenthesis. Notice in the above example, the parenthesis were in the
199
2021-12-17
jrmu
pattern but did not occur in the string, yet the pattern matched.
203
2021-12-17
jrmu
20.4 Metacharacters
205
2021-12-17
jrmu
Metacharacters do not get interpreted as literal characters. Instead
206
2021-12-17
jrmu
they tell perl to interpret the metacharacter (and sometimes the
207
2021-12-17
jrmu
characters around metacharacter) in a different way. The following are
208
2021-12-17
jrmu
metacharacters in perl regular expression patterns:
211
2021-12-17
jrmu
\ | ( ) [ ] { } ^ $ * + ? .
218
2021-12-17
jrmu
(backslash) if next character combined with this backslash forms a
219
2021-12-17
jrmu
character class shortcut, then match that character class. If not a
220
2021-12-17
jrmu
shortcut, then simply treat next character as a non-metacharacter.
226
2021-12-17
jrmu
alternation: (patt1 | patt2) means (patt1 OR patt2)
233
2021-12-17
jrmu
grouping (clustering) and capturing
239
2021-12-17
jrmu
grouping (clustering) only. no capturing. (somewhat faster)
245
2021-12-17
jrmu
match any single character (usually not "\n")
251
2021-12-17
jrmu
define a character class, match any single character in class
258
2021-12-17
jrmu
(quantifier): match previous item zero or more times
264
2021-12-17
jrmu
(quantifier): match previous item one or more times
270
2021-12-17
jrmu
(quantifier): match previous item zero or one time
276
2021-12-17
jrmu
(quantifier): match previous item a number of times in given range
283
2021-12-17
jrmu
(position marker): beginning of string (or possibly after "\n")
289
2021-12-17
jrmu
(position marker): end of string (or possibly before "\n")
294
2021-12-17
jrmu
Examples below. Change the value assigned to $str and re-run the script.
295
2021-12-17
jrmu
Experiment with what matches and what does not match the different
296
2021-12-17
jrmu
regular expression patterns.
299
2021-12-17
jrmu
my $str = "Dear sir, hello and goodday! "
301
2021-12-17
jrmu
." dogs and cats and sssnakes put me to sleep."
303
2021-12-17
jrmu
." zzzz. Hummingbirds are ffffast. "
306
2021-12-17
jrmu
." Sincerely, John";
309
2021-12-17
jrmu
# | alternation
311
2021-12-17
jrmu
# match "hello" or "goodbye"
313
2021-12-17
jrmu
if($str =~ m{hello|goodbye}){warn "alt";}
316
2021-12-17
jrmu
# () grouping and capturing
318
2021-12-17
jrmu
# match 'goodday' or 'goodbye'
320
2021-12-17
jrmu
if($str =~ m{(good(day|bye))})
322
2021-12-17
jrmu
{warn "group matched, captured '$1'";}
325
2021-12-17
jrmu
# . any single character
327
2021-12-17
jrmu
# match 'cat' 'cbt' 'cct' 'c%t' 'c+t' 'c?t' ...
329
2021-12-17
jrmu
if($str =~ m{c.t}){warn "period";}
333
2021-12-17
jrmu
# [] define a character class: 'a' or 'o' or 'u'
335
2021-12-17
jrmu
# match 'cat' 'cot' 'cut'
337
2021-12-17
jrmu
if($str =~ m{c[aou]t}){warn "class";}
340
2021-12-17
jrmu
# * quantifier, match previous item zero or more
342
2021-12-17
jrmu
# match '' or 'z' or 'zz' or 'zzz' or 'zzzzzzzz'
344
2021-12-17
jrmu
if($str =~ m{z*}){warn "asterisk";}
347
2021-12-17
jrmu
# + quantifier, match previous item one or more
349
2021-12-17
jrmu
# match 'snake' 'ssnake' 'sssssssnake'
351
2021-12-17
jrmu
if($str =~ m{s+nake}){warn "plus sign";}
354
2021-12-17
jrmu
# ? quantifier, previous item is optional
356
2021-12-17
jrmu
# match only 'dog' and 'dogs'
359
2021-12-17
jrmu
if($str =~ m{dogs?}){warn "question";}
362
2021-12-17
jrmu
# {} quantifier, match previous, 3 <= qty <= 5
364
2021-12-17
jrmu
# match only 'fffast', 'ffffast', and 'fffffast'
366
2021-12-17
jrmu
if($str =~ m{f{3,5}ast}){warn "curly brace";}
369
2021-12-17
jrmu
# ^ position marker, matches beginning of string
371
2021-12-17
jrmu
# match 'Dear' only if it occurs at start of string
373
2021-12-17
jrmu
if($str =~ m{^Dear}){warn "caret";}
376
2021-12-17
jrmu
# $ position marker, matches end of string
378
2021-12-17
jrmu
# match 'John' only if it occurs at end of string
380
2021-12-17
jrmu
if($str =~ m{John$}){warn "dollar";}
383
2021-12-17
jrmu
> alt at ...
385
2021-12-17
jrmu
> group matched, captured 'goodday' at ...
387
2021-12-17
jrmu
> period at ...
389
2021-12-17
jrmu
> class at ...
391
2021-12-17
jrmu
> asterisk at ...
393
2021-12-17
jrmu
> plus sign at ...
395
2021-12-17
jrmu
> question at ...
397
2021-12-17
jrmu
> curly brace at ...
399
2021-12-17
jrmu
> caret at ...
401
2021-12-17
jrmu
> dollar at ...
404
2021-12-17
jrmu
20.5 Capturing and Clustering Parenthesis
406
2021-12-17
jrmu
Normal parentheses will both cluster and capture the pattern they
407
2021-12-17
jrmu
contain. Clustering affects the order of evaluation similar to the way
408
2021-12-17
jrmu
parentheses affect the order of evaluation within a mathematical
409
2021-12-17
jrmu
expression. Normally, multiplication has a higher precedence than
410
2021-12-17
jrmu
addition. The expression "2 + 3 * 4" does the multiplication first and
411
2021-12-17
jrmu
then the addition, yielding the result of "14". The expression "(2 + 3)
412
2021-12-17
jrmu
* 4" forces the addition to occur first, yielding the result of "20".
415
2021-12-17
jrmu
Clustering parentheses work in the same fashion. The pattern {cats?}
416
2021-12-17
jrmu
will apply the "?" quantifier to the letter "s", matching either "cat"
417
2021-12-17
jrmu
or "cats". The pattern {(cats)?} will apply the "?" quantifier to the
418
2021-12-17
jrmu
entire pattern within the parentheses, matching "cats" or null string.
421
2021-12-17
jrmu
20.5.1 $1, $2, $3, etc Capturing parentheses
423
2021-12-17
jrmu
Clustering parentheses will also Capture the part of the string that
424
2021-12-17
jrmu
matched the pattern within parentheses. The captured values are
425
2021-12-17
jrmu
accessible through some "magical" variables called $1, $2, $3, ... Each
426
2021-12-17
jrmu
left parenthesis increments the number used to access the captured
427
2021-12-17
jrmu
string. The left parenthesis are counted from left to right as they
428
2021-12-17
jrmu
occur within the pattern, starting at 1.
432
2021-12-17
jrmu
my $test="Firstname: John Lastname: Smith";
434
2021-12-17
jrmu
############################################
438
2021-12-17
jrmu
$test=~m{Firstname: (\w+) Lastname: (\w+)};
440
2021-12-17
jrmu
my $first = $1;
442
2021-12-17
jrmu
my $last = $2;
444
2021-12-17
jrmu
print "Hello, $first $last\n";
447
2021-12-17
jrmu
> Hello, John Smith
452
2021-12-17
jrmu
Because capturing takes a little extra time to store the captured result
453
2021-12-17
jrmu
into the $1, $2, <85> variables, sometimes you just want to cluster without
454
2021-12-17
jrmu
the overhead of capturing. In the below example, we want to cluster
455
2021-12-17
jrmu
"day|bye" so that the alternation symbol "|" will go with "day" or
456
2021-12-17
jrmu
"bye". Without the clustering parenthesis, the pattern would match
457
2021-12-17
jrmu
"goodday" or "bye", rather than "goodday" or "goodbye". The pattern
458
2021-12-17
jrmu
contains capturing parens around the entire pattern, so we do not need
459
2021-12-17
jrmu
to capture the "day|bye" part of the pattern, therefore we use
460
2021-12-17
jrmu
cluster-only parentheses.
463
2021-12-17
jrmu
if($str =~ m{(good(?:day|bye))})
465
2021-12-17
jrmu
{warn "group matched, captured '$1'";}
469
2021-12-17
jrmu
Cluster-only parenthesis don't capture the enclosed pattern, and they
470
2021-12-17
jrmu
don't count when determining which magic variable, $1, $2, $3 ..., will
471
2021-12-17
jrmu
contain the values from the
473
2021-12-17
jrmu
capturing parentheses.
476
2021-12-17
jrmu
my $test = 'goodday John';
478
2021-12-17
jrmu
##########################################
482
2021-12-17
jrmu
if($test =~ m{(good(?:day|bye)) (\w+)})
484
2021-12-17
jrmu
{ print "You said $1 to $2\n"; }
487
2021-12-17
jrmu
> You said goodday to John
490
2021-12-17
jrmu
20.5.2 Capturing parentheses not capturing
492
2021-12-17
jrmu
If a regular expression containing capturing parentheses does not match
493
2021-12-17
jrmu
the string, the magic variables $1, $2, $3, etc will retain whatever
494
2021-12-17
jrmu
PREVIOUS value they had from any PREVIOUS regular expression. This means
495
2021-12-17
jrmu
that you MUST check to make sure the regular expression matches BEFORE
496
2021-12-17
jrmu
you use the $1, $2, $3, etc variables.
500
2021-12-17
jrmu
In the example below, the second regular expression does not match,
501
2021-12-17
jrmu
therefore $1 retains its old value of 'be'. Instead of printing out
502
2021-12-17
jrmu
something like "Name is Horatio" or "Name is" and failing on an
503
2021-12-17
jrmu
undefined value, perl instead keeps the old value for $1 and prints
504
2021-12-17
jrmu
"Name is 'be'", instead.
507
2021-12-17
jrmu
my $string1 = 'To be, or not to be';
509
2021-12-17
jrmu
$string1 =~ m{not to (\w+)}; # matches, $1='be'
511
2021-12-17
jrmu
warn "The question is to $1";
514
2021-12-17
jrmu
my $string2 = 'that is the question';
516
2021-12-17
jrmu
$string2 =~ m{I knew him once, (\w+)}; # no match
518
2021-12-17
jrmu
warn "Name is '$1'";
520
2021-12-17
jrmu
# no match, so $1 retains its old value 'be'
523
2021-12-17
jrmu
> The question is to be at ./script.pl line 7.
526
2021-12-17
jrmu
> Name is 'be' at ./script.pl line 11.
529
2021-12-17
jrmu
20.6 Character Classes
531
2021-12-17
jrmu
The "." metacharacter will match any single character. This is
532
2021-12-17
jrmu
equivalent to a character class that includes every possible character.
533
2021-12-17
jrmu
You can easily define smaller character classes of your own using the
534
2021-12-17
jrmu
square brackets []. Whatever characters are listed within the square
535
2021-12-17
jrmu
brackets are part of that character class. Perl will then match any one
536
2021-12-17
jrmu
character within that class.
539
2021-12-17
jrmu
[aeiouAEIOU] any vowel
541
2021-12-17
jrmu
[0123456789] any digit
544
2021-12-17
jrmu
20.6.1 Metacharacters Within Character Classes
546
2021-12-17
jrmu
Within the square brackets used to define a character class, all
547
2021-12-17
jrmu
previously defined metacharacters cease to act as metacharacters and are
548
2021-12-17
jrmu
interpreted as simple literal characters. Characters classes have their
549
2021-12-17
jrmu
own special metacharacters.
555
2021-12-17
jrmu
(backslash) demeta the next character
561
2021-12-17
jrmu
(hyphen) Indicates a consecutive character range, inclusively.
563
2021-12-17
jrmu
[a-f] indicates the letters a,b,c,d,e,f.
565
2021-12-17
jrmu
Character ranges are based off of ASCII numeric values.
571
2021-12-17
jrmu
If it is the first character of the class, then this indicates the class
573
2021-12-17
jrmu
is any character EXCEPT the ones in the square brackets.
575
2021-12-17
jrmu
Warning: [^aeiou] means anything but a lower case vowel. This
578
2021-12-17
jrmu
is not the same as "any consonant". The class [^aeiou] will
580
2021-12-17
jrmu
match punctuation, numbers, and unicode characters.
583
2021-12-17
jrmu
20.7 Shortcut Character Classes
585
2021-12-17
jrmu
Perl has shortcut character classes for some more common classes.
588
2021-12-17
jrmu
/*shortcut*/
592
2021-12-17
jrmu
/*class*/
596
2021-12-17
jrmu
/*description*/
606
2021-12-17
jrmu
any *d*igit
616
2021-12-17
jrmu
any NON-digit
622
2021-12-17
jrmu
[ \t\n\r\f]
626
2021-12-17
jrmu
any white*s*pace
633
2021-12-17
jrmu
[^ \t\n\r\f]
637
2021-12-17
jrmu
any NON-whitespace
643
2021-12-17
jrmu
[a-zA-Z0-9_]
647
2021-12-17
jrmu
any *w*ord character (valid perl identifier)
652
2021-12-17
jrmu
[^a-zA-Z0-9_]
656
2021-12-17
jrmu
any NON-word character
659
2021-12-17
jrmu
20.8 Greedy (Maximal) Quantifiers
661
2021-12-17
jrmu
Quantifiers are used within regular expressions to indicate how many
662
2021-12-17
jrmu
times the previous item occurs within the pattern. By default,
663
2021-12-17
jrmu
quantifiers are "greedy" or "maximal", meaning that they will match as
664
2021-12-17
jrmu
many characters as possible and still be true.
671
2021-12-17
jrmu
match zero or more times (match as much as possible)
678
2021-12-17
jrmu
match one or more times (match as much as possible)
684
2021-12-17
jrmu
match zero or one times (match as much as possible)
686
2021-12-17
jrmu
{count}
690
2021-12-17
jrmu
match exactly "count" times
692
2021-12-17
jrmu
{min, }
696
2021-12-17
jrmu
match at least "min" times (match as much as possible)
698
2021-12-17
jrmu
{min,max}
702
2021-12-17
jrmu
match at least "min" and at most "max" times
704
2021-12-17
jrmu
*(match as much as possible)*
708
2021-12-17
jrmu
20.10 Position Assertions / Position Anchors
710
2021-12-17
jrmu
Inside a regular expression pattern, some symbols do not translate into
711
2021-12-17
jrmu
a character or character class. Instead, they translate into a
712
2021-12-17
jrmu
"position" within the string. If a position anchor occurs within a
713
2021-12-17
jrmu
pattern, the pattern before and after that anchor must occur within a
714
2021-12-17
jrmu
certain position within the string.
721
2021-12-17
jrmu
Matches the beginning of the string.
723
2021-12-17
jrmu
If the /m (multiline) modifier is present, matches "\n" also.
729
2021-12-17
jrmu
Matches the end of the string.
731
2021-12-17
jrmu
If the /m (multiline) modifier is present, matches "\n" also.
737
2021-12-17
jrmu
Match the beginning of string only. Not affected by /m modifier.
743
2021-12-17
jrmu
Match the end of string only. Not affected by /m modifier.
749
2021-12-17
jrmu
Matches the end of the string only, but will chomp() a "\n" if that
751
2021-12-17
jrmu
was the last character in string.
755
2021-12-17
jrmu
word "b"oundary
757
2021-12-17
jrmu
A word boundary occurs in four places.
759
2021-12-17
jrmu
1) at a transition from a \w character to a \W character
761
2021-12-17
jrmu
2) at a transition from a \W character to a \w character
763
2021-12-17
jrmu
3) at the beginning of the string
765
2021-12-17
jrmu
4) at the end of the string
776
2021-12-17
jrmu
usually used with /g modifier (probably want /c modifier too).
778
2021-12-17
jrmu
Indicates the position after the character of the last pattern match
779
2021-12-17
jrmu
performed on the string. If this is the first regular expression begin
781
2021-12-17
jrmu
performed on the string then \G will match the beginning of the
783
2021-12-17
jrmu
string. Use the pos() function to get and set the current \G position
785
2021-12-17
jrmu
within the string.
788
2021-12-17
jrmu
20.10.1 The \b Anchor
790
2021-12-17
jrmu
Use the \b anchor when you want to match a whole word pattern but not
791
2021-12-17
jrmu
part of a word. This example matches "jump" but not "jumprope":
794
2021-12-17
jrmu
my $test1='He can jump very high.';
796
2021-12-17
jrmu
if($test1=~m{\bjump\b})
798
2021-12-17
jrmu
{ print "test1 matches\n"; }