Blame
Date:
Tue Dec 21 08:11:28 2021 UTC
Message:
Added README to install dependencies
001
2021-12-17
jrmu
20 Regular Expressions
002
2021-12-17
jrmu
003
2021-12-17
jrmu
Regular expressions are the text processing workhorse of perl. With
004
2021-12-17
jrmu
regular expressions, you can search strings for patterns, find out what
005
2021-12-17
jrmu
matched the patterns, and substitute the matched patterns with new strings.
006
2021-12-17
jrmu
007
2021-12-17
jrmu
008
2021-12-17
jrmu
There are three different regular expression operators in perl:
009
2021-12-17
jrmu
010
2021-12-17
jrmu
1.match m{PATTERN}
011
2021-12-17
jrmu
012
2021-12-17
jrmu
2.substitute s{OLDPATTERN}{NEWPATTERN}
013
2021-12-17
jrmu
014
2021-12-17
jrmu
3.transliterate tr{OLD_CHAR_SET}{NEW_CHAR_SET}
015
2021-12-17
jrmu
016
2021-12-17
jrmu
017
2021-12-17
jrmu
Perl allows any delimiter in these operators, such as {} or () or // or
018
2021-12-17
jrmu
## or just about any character you wish to use. The most common
019
2021-12-17
jrmu
delimiter used is probably the m// and s/// delimiters, but I prefer to
020
2021-12-17
jrmu
use m{} and s{}{} because they are clearer for me. There are two ways to
021
2021-12-17
jrmu
"bind" these operators to a string expression:
022
2021-12-17
jrmu
023
2021-12-17
jrmu
024
2021-12-17
jrmu
1.=~ pattern does match string expression
025
2021-12-17
jrmu
026
2021-12-17
jrmu
2.!~ pattern does NOT match string expression
027
2021-12-17
jrmu
028
2021-12-17
jrmu
029
2021-12-17
jrmu
Binding can be thought of as "Object Oriented Programming" for regular
030
2021-12-17
jrmu
expressions. Generic OOP structure can be represented as
031
2021-12-17
jrmu
032
2021-12-17
jrmu
033
2021-12-17
jrmu
$subject -> verb ( adjectives, adverbs, etc );
034
2021-12-17
jrmu
035
2021-12-17
jrmu
036
2021-12-17
jrmu
Binding in Regular Expressions can be looked at in a similar fashion:
037
2021-12-17
jrmu
038
2021-12-17
jrmu
039
2021-12-17
jrmu
$string =~ verb ( pattern );
040
2021-12-17
jrmu
041
2021-12-17
jrmu
042
2021-12-17
jrmu
where "verb" is limited to 'm' for match, 's' for substitution, and 'tr'
043
2021-12-17
jrmu
for translate. You may see perl code that simply looks like this:
044
2021-12-17
jrmu
045
2021-12-17
jrmu
046
2021-12-17
jrmu
/patt/;
047
2021-12-17
jrmu
048
2021-12-17
jrmu
049
2021-12-17
jrmu
This is functionally equivalent to this:
050
2021-12-17
jrmu
051
2021-12-17
jrmu
052
2021-12-17
jrmu
$_ =~ m/patt/;
053
2021-12-17
jrmu
054
2021-12-17
jrmu
055
2021-12-17
jrmu
056
2021-12-17
jrmu
Here are some examples:
057
2021-12-17
jrmu
058
2021-12-17
jrmu
059
2021-12-17
jrmu
# spam filter
060
2021-12-17
jrmu
061
2021-12-17
jrmu
my $email = "This is a great Free Offer\n";
062
2021-12-17
jrmu
063
2021-12-17
jrmu
if($email =~ m{Free Offer})
064
2021-12-17
jrmu
065
2021-12-17
jrmu
{$email="*deleted spam*\n"; }
066
2021-12-17
jrmu
067
2021-12-17
jrmu
print "$email\n";
068
2021-12-17
jrmu
069
2021-12-17
jrmu
070
2021-12-17
jrmu
# upgrade my car
071
2021-12-17
jrmu
072
2021-12-17
jrmu
my $car = "my car is a toyota\n";
073
2021-12-17
jrmu
074
2021-12-17
jrmu
075
2021-12-17
jrmu
$car =~ s{toyota}{jaguar};
076
2021-12-17
jrmu
077
2021-12-17
jrmu
print "$car\n";
078
2021-12-17
jrmu
079
2021-12-17
jrmu
080
2021-12-17
jrmu
# simple encryption, Caesar cypher
081
2021-12-17
jrmu
082
2021-12-17
jrmu
my $love_letter = "How I love thee.\n";
083
2021-12-17
jrmu
084
2021-12-17
jrmu
$love_letter =~ tr{A-Za-z}{N-ZA-Mn-za-m};
085
2021-12-17
jrmu
086
2021-12-17
jrmu
print "encrypted: $love_letter";
087
2021-12-17
jrmu
088
2021-12-17
jrmu
089
2021-12-17
jrmu
$love_letter =~ tr{A-Za-z}{N-ZA-Mn-za-m};
090
2021-12-17
jrmu
091
2021-12-17
jrmu
print "decrypted: $love_letter\n";
092
2021-12-17
jrmu
093
2021-12-17
jrmu
094
2021-12-17
jrmu
> *deleted spam*
095
2021-12-17
jrmu
096
2021-12-17
jrmu
> my car is a jaguar
097
2021-12-17
jrmu
098
2021-12-17
jrmu
> encrypted: Ubj V ybir gurr.
099
2021-12-17
jrmu
100
2021-12-17
jrmu
101
2021-12-17
jrmu
> decrypted: How I love thee.
102
2021-12-17
jrmu
103
2021-12-17
jrmu
104
2021-12-17
jrmu
The above examples all look for fixed patterns within the string.
105
2021-12-17
jrmu
Regular expressions also allow you to look for patterns with different
106
2021-12-17
jrmu
types of "wildcards".
107
2021-12-17
jrmu
108
2021-12-17
jrmu
109
2021-12-17
jrmu
20.1 Variable Interpolation
110
2021-12-17
jrmu
111
2021-12-17
jrmu
The braces that surround the pattern act as double-quote marks,
112
2021-12-17
jrmu
subjecting the pattern to one pass of variable interpolation as if the
113
2021-12-17
jrmu
pattern were contained in double-quotes. This allows the pattern to be
114
2021-12-17
jrmu
contained within variables and interpolated during the regular expression.
115
2021-12-17
jrmu
116
2021-12-17
jrmu
117
2021-12-17
jrmu
my $actual = "Toyota";
118
2021-12-17
jrmu
119
2021-12-17
jrmu
my $wanted = "Jaguar";
120
2021-12-17
jrmu
121
2021-12-17
jrmu
my $car = "My car is a Toyota\n";
122
2021-12-17
jrmu
123
2021-12-17
jrmu
$car =~ s{$actual}{$wanted};
124
2021-12-17
jrmu
125
2021-12-17
jrmu
print $car;
126
2021-12-17
jrmu
127
2021-12-17
jrmu
128
2021-12-17
jrmu
> My car is a Jaguar
129
2021-12-17
jrmu
130
2021-12-17
jrmu
131
2021-12-17
jrmu
20.2 Wildcard Example
132
2021-12-17
jrmu
133
2021-12-17
jrmu
In the example below, we process an array of lines, each containing the
134
2021-12-17
jrmu
pattern {filename: } followed by one or more non-whitespace characters
135
2021-12-17
jrmu
forming the actual filename. Each line also contains the pattern {size:
136
2021-12-17
jrmu
} followed by one or more digits that indicate the actual size of that
137
2021-12-17
jrmu
file.
138
2021-12-17
jrmu
139
2021-12-17
jrmu
140
2021-12-17
jrmu
my @lines = split "\n", <<"MARKER"
141
2021-12-17
jrmu
142
2021-12-17
jrmu
filename: output.txt size: 1024
143
2021-12-17
jrmu
144
2021-12-17
jrmu
filename: input.dat size: 512
145
2021-12-17
jrmu
146
2021-12-17
jrmu
filename: address.db size: 1048576
147
2021-12-17
jrmu
148
2021-12-17
jrmu
MARKER
149
2021-12-17
jrmu
150
2021-12-17
jrmu
;
151
2021-12-17
jrmu
152
2021-12-17
jrmu
foreach my $line (@lines) {
153
2021-12-17
jrmu
154
2021-12-17
jrmu
####################################
155
2021-12-17
jrmu
156
2021-12-17
jrmu
# \S is a wildcard meaning
157
2021-12-17
jrmu
158
2021-12-17
jrmu
# "anything that is not white-space".
159
2021-12-17
jrmu
160
2021-12-17
jrmu
# the "+" means "one or more"
161
2021-12-17
jrmu
162
2021-12-17
jrmu
####################################
163
2021-12-17
jrmu
164
2021-12-17
jrmu
if($line =~ m{filename: (\S+)}) {
165
2021-12-17
jrmu
166
2021-12-17
jrmu
my $name = $1;
167
2021-12-17
jrmu
168
2021-12-17
jrmu
###########################
169
2021-12-17
jrmu
170
2021-12-17
jrmu
# \d is a wildcard meaning
171
2021-12-17
jrmu
172
2021-12-17
jrmu
# "any digit, 0-9".
173
2021-12-17
jrmu
174
2021-12-17
jrmu
###########################
175
2021-12-17
jrmu
176
2021-12-17
jrmu
177
2021-12-17
jrmu
$line =~ m{size: (\d+)};
178
2021-12-17
jrmu
179
2021-12-17
jrmu
my $size = $1;
180
2021-12-17
jrmu
181
2021-12-17
jrmu
print "$name,$size\n";
182
2021-12-17
jrmu
183
2021-12-17
jrmu
}
184
2021-12-17
jrmu
185
2021-12-17
jrmu
}
186
2021-12-17
jrmu
187
2021-12-17
jrmu
> output.txt,1024
188
2021-12-17
jrmu
189
2021-12-17
jrmu
> input.dat,512
190
2021-12-17
jrmu
191
2021-12-17
jrmu
> address.db,1048576
192
2021-12-17
jrmu
193
2021-12-17
jrmu
194
2021-12-17
jrmu
20.3 Defining a Pattern
195
2021-12-17
jrmu
196
2021-12-17
jrmu
A pattern can be a literal pattern such as {Free Offer}. It can contain
197
2021-12-17
jrmu
wildcards such as {\d}. It can also contain metacharacters such as the
198
2021-12-17
jrmu
parenthesis. Notice in the above example, the parenthesis were in the
199
2021-12-17
jrmu
pattern but did not occur in the string, yet the pattern matched.
200
2021-12-17
jrmu
201
2021-12-17
jrmu
202
2021-12-17
jrmu
203
2021-12-17
jrmu
20.4 Metacharacters
204
2021-12-17
jrmu
205
2021-12-17
jrmu
Metacharacters do not get interpreted as literal characters. Instead
206
2021-12-17
jrmu
they tell perl to interpret the metacharacter (and sometimes the
207
2021-12-17
jrmu
characters around metacharacter) in a different way. The following are
208
2021-12-17
jrmu
metacharacters in perl regular expression patterns:
209
2021-12-17
jrmu
210
2021-12-17
jrmu
211
2021-12-17
jrmu
\ | ( ) [ ] { } ^ $ * + ? .
212
2021-12-17
jrmu
213
2021-12-17
jrmu
214
2021-12-17
jrmu
\
215
2021-12-17
jrmu
216
2021-12-17
jrmu
217
2021-12-17
jrmu
218
2021-12-17
jrmu
(backslash) if next character combined with this backslash forms a
219
2021-12-17
jrmu
character class shortcut, then match that character class. If not a
220
2021-12-17
jrmu
shortcut, then simply treat next character as a non-metacharacter.
221
2021-12-17
jrmu
222
2021-12-17
jrmu
|
223
2021-12-17
jrmu
224
2021-12-17
jrmu
225
2021-12-17
jrmu
226
2021-12-17
jrmu
alternation: (patt1 | patt2) means (patt1 OR patt2)
227
2021-12-17
jrmu
228
2021-12-17
jrmu
229
2021-12-17
jrmu
( )
230
2021-12-17
jrmu
231
2021-12-17
jrmu
232
2021-12-17
jrmu
233
2021-12-17
jrmu
grouping (clustering) and capturing
234
2021-12-17
jrmu
235
2021-12-17
jrmu
(?: )
236
2021-12-17
jrmu
237
2021-12-17
jrmu
238
2021-12-17
jrmu
239
2021-12-17
jrmu
grouping (clustering) only. no capturing. (somewhat faster)
240
2021-12-17
jrmu
241
2021-12-17
jrmu
.
242
2021-12-17
jrmu
243
2021-12-17
jrmu
244
2021-12-17
jrmu
245
2021-12-17
jrmu
match any single character (usually not "\n")
246
2021-12-17
jrmu
247
2021-12-17
jrmu
[ ]
248
2021-12-17
jrmu
249
2021-12-17
jrmu
250
2021-12-17
jrmu
251
2021-12-17
jrmu
define a character class, match any single character in class
252
2021-12-17
jrmu
253
2021-12-17
jrmu
254
2021-12-17
jrmu
*
255
2021-12-17
jrmu
256
2021-12-17
jrmu
257
2021-12-17
jrmu
258
2021-12-17
jrmu
(quantifier): match previous item zero or more times
259
2021-12-17
jrmu
260
2021-12-17
jrmu
+
261
2021-12-17
jrmu
262
2021-12-17
jrmu
263
2021-12-17
jrmu
264
2021-12-17
jrmu
(quantifier): match previous item one or more times
265
2021-12-17
jrmu
266
2021-12-17
jrmu
?
267
2021-12-17
jrmu
268
2021-12-17
jrmu
269
2021-12-17
jrmu
270
2021-12-17
jrmu
(quantifier): match previous item zero or one time
271
2021-12-17
jrmu
272
2021-12-17
jrmu
{ }
273
2021-12-17
jrmu
274
2021-12-17
jrmu
275
2021-12-17
jrmu
276
2021-12-17
jrmu
(quantifier): match previous item a number of times in given range
277
2021-12-17
jrmu
278
2021-12-17
jrmu
^
279
2021-12-17
jrmu
280
2021-12-17
jrmu
281
2021-12-17
jrmu
282
2021-12-17
jrmu
283
2021-12-17
jrmu
(position marker): beginning of string (or possibly after "\n")
284
2021-12-17
jrmu
285
2021-12-17
jrmu
$
286
2021-12-17
jrmu
287
2021-12-17
jrmu
288
2021-12-17
jrmu
289
2021-12-17
jrmu
(position marker): end of string (or possibly before "\n")
290
2021-12-17
jrmu
291
2021-12-17
jrmu
292
2021-12-17
jrmu
293
2021-12-17
jrmu
294
2021-12-17
jrmu
Examples below. Change the value assigned to $str and re-run the script.
295
2021-12-17
jrmu
Experiment with what matches and what does not match the different
296
2021-12-17
jrmu
regular expression patterns.
297
2021-12-17
jrmu
298
2021-12-17
jrmu
299
2021-12-17
jrmu
my $str = "Dear sir, hello and goodday! "
300
2021-12-17
jrmu
301
2021-12-17
jrmu
." dogs and cats and sssnakes put me to sleep."
302
2021-12-17
jrmu
303
2021-12-17
jrmu
." zzzz. Hummingbirds are ffffast. "
304
2021-12-17
jrmu
305
2021-12-17
jrmu
306
2021-12-17
jrmu
." Sincerely, John";
307
2021-12-17
jrmu
308
2021-12-17
jrmu
309
2021-12-17
jrmu
# | alternation
310
2021-12-17
jrmu
311
2021-12-17
jrmu
# match "hello" or "goodbye"
312
2021-12-17
jrmu
313
2021-12-17
jrmu
if($str =~ m{hello|goodbye}){warn "alt";}
314
2021-12-17
jrmu
315
2021-12-17
jrmu
316
2021-12-17
jrmu
# () grouping and capturing
317
2021-12-17
jrmu
318
2021-12-17
jrmu
# match 'goodday' or 'goodbye'
319
2021-12-17
jrmu
320
2021-12-17
jrmu
if($str =~ m{(good(day|bye))})
321
2021-12-17
jrmu
322
2021-12-17
jrmu
{warn "group matched, captured '$1'";}
323
2021-12-17
jrmu
324
2021-12-17
jrmu
325
2021-12-17
jrmu
# . any single character
326
2021-12-17
jrmu
327
2021-12-17
jrmu
# match 'cat' 'cbt' 'cct' 'c%t' 'c+t' 'c?t' ...
328
2021-12-17
jrmu
329
2021-12-17
jrmu
if($str =~ m{c.t}){warn "period";}
330
2021-12-17
jrmu
331
2021-12-17
jrmu
332
2021-12-17
jrmu
333
2021-12-17
jrmu
# [] define a character class: 'a' or 'o' or 'u'
334
2021-12-17
jrmu
335
2021-12-17
jrmu
# match 'cat' 'cot' 'cut'
336
2021-12-17
jrmu
337
2021-12-17
jrmu
if($str =~ m{c[aou]t}){warn "class";}
338
2021-12-17
jrmu
339
2021-12-17
jrmu
340
2021-12-17
jrmu
# * quantifier, match previous item zero or more
341
2021-12-17
jrmu
342
2021-12-17
jrmu
# match '' or 'z' or 'zz' or 'zzz' or 'zzzzzzzz'
343
2021-12-17
jrmu
344
2021-12-17
jrmu
if($str =~ m{z*}){warn "asterisk";}
345
2021-12-17
jrmu
346
2021-12-17
jrmu
347
2021-12-17
jrmu
# + quantifier, match previous item one or more
348
2021-12-17
jrmu
349
2021-12-17
jrmu
# match 'snake' 'ssnake' 'sssssssnake'
350
2021-12-17
jrmu
351
2021-12-17
jrmu
if($str =~ m{s+nake}){warn "plus sign";}
352
2021-12-17
jrmu
353
2021-12-17
jrmu
354
2021-12-17
jrmu
# ? quantifier, previous item is optional
355
2021-12-17
jrmu
356
2021-12-17
jrmu
# match only 'dog' and 'dogs'
357
2021-12-17
jrmu
358
2021-12-17
jrmu
359
2021-12-17
jrmu
if($str =~ m{dogs?}){warn "question";}
360
2021-12-17
jrmu
361
2021-12-17
jrmu
362
2021-12-17
jrmu
# {} quantifier, match previous, 3 <= qty <= 5
363
2021-12-17
jrmu
364
2021-12-17
jrmu
# match only 'fffast', 'ffffast', and 'fffffast'
365
2021-12-17
jrmu
366
2021-12-17
jrmu
if($str =~ m{f{3,5}ast}){warn "curly brace";}
367
2021-12-17
jrmu
368
2021-12-17
jrmu
369
2021-12-17
jrmu
# ^ position marker, matches beginning of string
370
2021-12-17
jrmu
371
2021-12-17
jrmu
# match 'Dear' only if it occurs at start of string
372
2021-12-17
jrmu
373
2021-12-17
jrmu
if($str =~ m{^Dear}){warn "caret";}
374
2021-12-17
jrmu
375
2021-12-17
jrmu
376
2021-12-17
jrmu
# $ position marker, matches end of string
377
2021-12-17
jrmu
378
2021-12-17
jrmu
# match 'John' only if it occurs at end of string
379
2021-12-17
jrmu
380
2021-12-17
jrmu
if($str =~ m{John$}){warn "dollar";}
381
2021-12-17
jrmu
382
2021-12-17
jrmu
383
2021-12-17
jrmu
> alt at ...
384
2021-12-17
jrmu
385
2021-12-17
jrmu
> group matched, captured 'goodday' at ...
386
2021-12-17
jrmu
387
2021-12-17
jrmu
> period at ...
388
2021-12-17
jrmu
389
2021-12-17
jrmu
> class at ...
390
2021-12-17
jrmu
391
2021-12-17
jrmu
> asterisk at ...
392
2021-12-17
jrmu
393
2021-12-17
jrmu
> plus sign at ...
394
2021-12-17
jrmu
395
2021-12-17
jrmu
> question at ...
396
2021-12-17
jrmu
397
2021-12-17
jrmu
> curly brace at ...
398
2021-12-17
jrmu
399
2021-12-17
jrmu
> caret at ...
400
2021-12-17
jrmu
401
2021-12-17
jrmu
> dollar at ...
402
2021-12-17
jrmu
403
2021-12-17
jrmu
404
2021-12-17
jrmu
20.5 Capturing and Clustering Parenthesis
405
2021-12-17
jrmu
406
2021-12-17
jrmu
Normal parentheses will both cluster and capture the pattern they
407
2021-12-17
jrmu
contain. Clustering affects the order of evaluation similar to the way
408
2021-12-17
jrmu
parentheses affect the order of evaluation within a mathematical
409
2021-12-17
jrmu
expression. Normally, multiplication has a higher precedence than
410
2021-12-17
jrmu
addition. The expression "2 + 3 * 4" does the multiplication first and
411
2021-12-17
jrmu
then the addition, yielding the result of "14". The expression "(2 + 3)
412
2021-12-17
jrmu
* 4" forces the addition to occur first, yielding the result of "20".
413
2021-12-17
jrmu
414
2021-12-17
jrmu
415
2021-12-17
jrmu
Clustering parentheses work in the same fashion. The pattern {cats?}
416
2021-12-17
jrmu
will apply the "?" quantifier to the letter "s", matching either "cat"
417
2021-12-17
jrmu
or "cats". The pattern {(cats)?} will apply the "?" quantifier to the
418
2021-12-17
jrmu
entire pattern within the parentheses, matching "cats" or null string.
419
2021-12-17
jrmu
420
2021-12-17
jrmu
421
2021-12-17
jrmu
20.5.1 $1, $2, $3, etc Capturing parentheses
422
2021-12-17
jrmu
423
2021-12-17
jrmu
Clustering parentheses will also Capture the part of the string that
424
2021-12-17
jrmu
matched the pattern within parentheses. The captured values are
425
2021-12-17
jrmu
accessible through some "magical" variables called $1, $2, $3, ... Each
426
2021-12-17
jrmu
left parenthesis increments the number used to access the captured
427
2021-12-17
jrmu
string. The left parenthesis are counted from left to right as they
428
2021-12-17
jrmu
occur within the pattern, starting at 1.
429
2021-12-17
jrmu
430
2021-12-17
jrmu
431
2021-12-17
jrmu
432
2021-12-17
jrmu
my $test="Firstname: John Lastname: Smith";
433
2021-12-17
jrmu
434
2021-12-17
jrmu
############################################
435
2021-12-17
jrmu
436
2021-12-17
jrmu
# $1 $2
437
2021-12-17
jrmu
438
2021-12-17
jrmu
$test=~m{Firstname: (\w+) Lastname: (\w+)};
439
2021-12-17
jrmu
440
2021-12-17
jrmu
my $first = $1;
441
2021-12-17
jrmu
442
2021-12-17
jrmu
my $last = $2;
443
2021-12-17
jrmu
444
2021-12-17
jrmu
print "Hello, $first $last\n";
445
2021-12-17
jrmu
446
2021-12-17
jrmu
447
2021-12-17
jrmu
> Hello, John Smith
448
2021-12-17
jrmu
449
2021-12-17
jrmu
450
2021-12-17
jrmu
451
2021-12-17
jrmu
452
2021-12-17
jrmu
Because capturing takes a little extra time to store the captured result
453
2021-12-17
jrmu
into the $1, $2, <85> variables, sometimes you just want to cluster without
454
2021-12-17
jrmu
the overhead of capturing. In the below example, we want to cluster
455
2021-12-17
jrmu
"day|bye" so that the alternation symbol "|" will go with "day" or
456
2021-12-17
jrmu
"bye". Without the clustering parenthesis, the pattern would match
457
2021-12-17
jrmu
"goodday" or "bye", rather than "goodday" or "goodbye". The pattern
458
2021-12-17
jrmu
contains capturing parens around the entire pattern, so we do not need
459
2021-12-17
jrmu
to capture the "day|bye" part of the pattern, therefore we use
460
2021-12-17
jrmu
cluster-only parentheses.
461
2021-12-17
jrmu
462
2021-12-17
jrmu
463
2021-12-17
jrmu
if($str =~ m{(good(?:day|bye))})
464
2021-12-17
jrmu
465
2021-12-17
jrmu
{warn "group matched, captured '$1'";}
466
2021-12-17
jrmu
467
2021-12-17
jrmu
468
2021-12-17
jrmu
469
2021-12-17
jrmu
Cluster-only parenthesis don't capture the enclosed pattern, and they
470
2021-12-17
jrmu
don't count when determining which magic variable, $1, $2, $3 ..., will
471
2021-12-17
jrmu
contain the values from the
472
2021-12-17
jrmu
473
2021-12-17
jrmu
capturing parentheses.
474
2021-12-17
jrmu
475
2021-12-17
jrmu
476
2021-12-17
jrmu
my $test = 'goodday John';
477
2021-12-17
jrmu
478
2021-12-17
jrmu
##########################################
479
2021-12-17
jrmu
480
2021-12-17
jrmu
# $1 $2
481
2021-12-17
jrmu
482
2021-12-17
jrmu
if($test =~ m{(good(?:day|bye)) (\w+)})
483
2021-12-17
jrmu
484
2021-12-17
jrmu
{ print "You said $1 to $2\n"; }
485
2021-12-17
jrmu
486
2021-12-17
jrmu
487
2021-12-17
jrmu
> You said goodday to John
488
2021-12-17
jrmu
489
2021-12-17
jrmu
490
2021-12-17
jrmu
20.5.2 Capturing parentheses not capturing
491
2021-12-17
jrmu
492
2021-12-17
jrmu
If a regular expression containing capturing parentheses does not match
493
2021-12-17
jrmu
the string, the magic variables $1, $2, $3, etc will retain whatever
494
2021-12-17
jrmu
PREVIOUS value they had from any PREVIOUS regular expression. This means
495
2021-12-17
jrmu
that you MUST check to make sure the regular expression matches BEFORE
496
2021-12-17
jrmu
you use the $1, $2, $3, etc variables.
497
2021-12-17
jrmu
498
2021-12-17
jrmu
499
2021-12-17
jrmu
500
2021-12-17
jrmu
In the example below, the second regular expression does not match,
501
2021-12-17
jrmu
therefore $1 retains its old value of 'be'. Instead of printing out
502
2021-12-17
jrmu
something like "Name is Horatio" or "Name is" and failing on an
503
2021-12-17
jrmu
undefined value, perl instead keeps the old value for $1 and prints
504
2021-12-17
jrmu
"Name is 'be'", instead.
505
2021-12-17
jrmu
506
2021-12-17
jrmu
507
2021-12-17
jrmu
my $string1 = 'To be, or not to be';
508
2021-12-17
jrmu
509
2021-12-17
jrmu
$string1 =~ m{not to (\w+)}; # matches, $1='be'
510
2021-12-17
jrmu
511
2021-12-17
jrmu
warn "The question is to $1";
512
2021-12-17
jrmu
513
2021-12-17
jrmu
514
2021-12-17
jrmu
my $string2 = 'that is the question';
515
2021-12-17
jrmu
516
2021-12-17
jrmu
$string2 =~ m{I knew him once, (\w+)}; # no match
517
2021-12-17
jrmu
518
2021-12-17
jrmu
warn "Name is '$1'";
519
2021-12-17
jrmu
520
2021-12-17
jrmu
# no match, so $1 retains its old value 'be'
521
2021-12-17
jrmu
522
2021-12-17
jrmu
523
2021-12-17
jrmu
> The question is to be at ./script.pl line 7.
524
2021-12-17
jrmu
525
2021-12-17
jrmu
526
2021-12-17
jrmu
> Name is 'be' at ./script.pl line 11.
527
2021-12-17
jrmu
528
2021-12-17
jrmu
529
2021-12-17
jrmu
20.6 Character Classes
530
2021-12-17
jrmu
531
2021-12-17
jrmu
The "." metacharacter will match any single character. This is
532
2021-12-17
jrmu
equivalent to a character class that includes every possible character.
533
2021-12-17
jrmu
You can easily define smaller character classes of your own using the
534
2021-12-17
jrmu
square brackets []. Whatever characters are listed within the square
535
2021-12-17
jrmu
brackets are part of that character class. Perl will then match any one
536
2021-12-17
jrmu
character within that class.
537
2021-12-17
jrmu
538
2021-12-17
jrmu
539
2021-12-17
jrmu
[aeiouAEIOU] any vowel
540
2021-12-17
jrmu
541
2021-12-17
jrmu
[0123456789] any digit
542
2021-12-17
jrmu
543
2021-12-17
jrmu
544
2021-12-17
jrmu
20.6.1 Metacharacters Within Character Classes
545
2021-12-17
jrmu
546
2021-12-17
jrmu
Within the square brackets used to define a character class, all
547
2021-12-17
jrmu
previously defined metacharacters cease to act as metacharacters and are
548
2021-12-17
jrmu
interpreted as simple literal characters. Characters classes have their
549
2021-12-17
jrmu
own special metacharacters.
550
2021-12-17
jrmu
551
2021-12-17
jrmu
\
552
2021-12-17
jrmu
553
2021-12-17
jrmu
554
2021-12-17
jrmu
555
2021-12-17
jrmu
(backslash) demeta the next character
556
2021-12-17
jrmu
557
2021-12-17
jrmu
-
558
2021-12-17
jrmu
559
2021-12-17
jrmu
560
2021-12-17
jrmu
561
2021-12-17
jrmu
(hyphen) Indicates a consecutive character range, inclusively.
562
2021-12-17
jrmu
563
2021-12-17
jrmu
[a-f] indicates the letters a,b,c,d,e,f.
564
2021-12-17
jrmu
565
2021-12-17
jrmu
Character ranges are based off of ASCII numeric values.
566
2021-12-17
jrmu
567
2021-12-17
jrmu
^
568
2021-12-17
jrmu
569
2021-12-17
jrmu
570
2021-12-17
jrmu
571
2021-12-17
jrmu
If it is the first character of the class, then this indicates the class
572
2021-12-17
jrmu
573
2021-12-17
jrmu
is any character EXCEPT the ones in the square brackets.
574
2021-12-17
jrmu
575
2021-12-17
jrmu
Warning: [^aeiou] means anything but a lower case vowel. This
576
2021-12-17
jrmu
577
2021-12-17
jrmu
578
2021-12-17
jrmu
is not the same as "any consonant". The class [^aeiou] will
579
2021-12-17
jrmu
580
2021-12-17
jrmu
match punctuation, numbers, and unicode characters.
581
2021-12-17
jrmu
582
2021-12-17
jrmu
583
2021-12-17
jrmu
20.7 Shortcut Character Classes
584
2021-12-17
jrmu
585
2021-12-17
jrmu
Perl has shortcut character classes for some more common classes.
586
2021-12-17
jrmu
587
2021-12-17
jrmu
588
2021-12-17
jrmu
/*shortcut*/
589
2021-12-17
jrmu
590
2021-12-17
jrmu
591
2021-12-17
jrmu
592
2021-12-17
jrmu
/*class*/
593
2021-12-17
jrmu
594
2021-12-17
jrmu
595
2021-12-17
jrmu
596
2021-12-17
jrmu
/*description*/
597
2021-12-17
jrmu
598
2021-12-17
jrmu
\d
599
2021-12-17
jrmu
600
2021-12-17
jrmu
601
2021-12-17
jrmu
602
2021-12-17
jrmu
[0-9]
603
2021-12-17
jrmu
604
2021-12-17
jrmu
605
2021-12-17
jrmu
606
2021-12-17
jrmu
any *d*igit
607
2021-12-17
jrmu
608
2021-12-17
jrmu
\D
609
2021-12-17
jrmu
610
2021-12-17
jrmu
611
2021-12-17
jrmu
612
2021-12-17
jrmu
[^0-9]
613
2021-12-17
jrmu
614
2021-12-17
jrmu
615
2021-12-17
jrmu
616
2021-12-17
jrmu
any NON-digit
617
2021-12-17
jrmu
618
2021-12-17
jrmu
\s
619
2021-12-17
jrmu
620
2021-12-17
jrmu
621
2021-12-17
jrmu
622
2021-12-17
jrmu
[ \t\n\r\f]
623
2021-12-17
jrmu
624
2021-12-17
jrmu
625
2021-12-17
jrmu
626
2021-12-17
jrmu
any white*s*pace
627
2021-12-17
jrmu
628
2021-12-17
jrmu
629
2021-12-17
jrmu
\S
630
2021-12-17
jrmu
631
2021-12-17
jrmu
632
2021-12-17
jrmu
633
2021-12-17
jrmu
[^ \t\n\r\f]
634
2021-12-17
jrmu
635
2021-12-17
jrmu
636
2021-12-17
jrmu
637
2021-12-17
jrmu
any NON-whitespace
638
2021-12-17
jrmu
639
2021-12-17
jrmu
\w
640
2021-12-17
jrmu
641
2021-12-17
jrmu
642
2021-12-17
jrmu
643
2021-12-17
jrmu
[a-zA-Z0-9_]
644
2021-12-17
jrmu
645
2021-12-17
jrmu
646
2021-12-17
jrmu
647
2021-12-17
jrmu
any *w*ord character (valid perl identifier)
648
2021-12-17
jrmu
649
2021-12-17
jrmu
\W
650
2021-12-17
jrmu
651
2021-12-17
jrmu
652
2021-12-17
jrmu
[^a-zA-Z0-9_]
653
2021-12-17
jrmu
654
2021-12-17
jrmu
655
2021-12-17
jrmu
656
2021-12-17
jrmu
any NON-word character
657
2021-12-17
jrmu
658
2021-12-17
jrmu
659
2021-12-17
jrmu
20.8 Greedy (Maximal) Quantifiers
660
2021-12-17
jrmu
661
2021-12-17
jrmu
Quantifiers are used within regular expressions to indicate how many
662
2021-12-17
jrmu
times the previous item occurs within the pattern. By default,
663
2021-12-17
jrmu
quantifiers are "greedy" or "maximal", meaning that they will match as
664
2021-12-17
jrmu
many characters as possible and still be true.
665
2021-12-17
jrmu
666
2021-12-17
jrmu
667
2021-12-17
jrmu
*
668
2021-12-17
jrmu
669
2021-12-17
jrmu
670
2021-12-17
jrmu
671
2021-12-17
jrmu
match zero or more times (match as much as possible)
672
2021-12-17
jrmu
673
2021-12-17
jrmu
+
674
2021-12-17
jrmu
675
2021-12-17
jrmu
676
2021-12-17
jrmu
677
2021-12-17
jrmu
678
2021-12-17
jrmu
match one or more times (match as much as possible)
679
2021-12-17
jrmu
680
2021-12-17
jrmu
?
681
2021-12-17
jrmu
682
2021-12-17
jrmu
683
2021-12-17
jrmu
684
2021-12-17
jrmu
match zero or one times (match as much as possible)
685
2021-12-17
jrmu
686
2021-12-17
jrmu
{count}
687
2021-12-17
jrmu
688
2021-12-17
jrmu
689
2021-12-17
jrmu
690
2021-12-17
jrmu
match exactly "count" times
691
2021-12-17
jrmu
692
2021-12-17
jrmu
{min, }
693
2021-12-17
jrmu
694
2021-12-17
jrmu
695
2021-12-17
jrmu
696
2021-12-17
jrmu
match at least "min" times (match as much as possible)
697
2021-12-17
jrmu
698
2021-12-17
jrmu
{min,max}
699
2021-12-17
jrmu
700
2021-12-17
jrmu
701
2021-12-17
jrmu
702
2021-12-17
jrmu
match at least "min" and at most "max" times
703
2021-12-17
jrmu
704
2021-12-17
jrmu
*(match as much as possible)*
705
2021-12-17
jrmu
706
2021-12-17
jrmu
707
2021-12-17
jrmu
708
2021-12-17
jrmu
20.10 Position Assertions / Position Anchors
709
2021-12-17
jrmu
710
2021-12-17
jrmu
Inside a regular expression pattern, some symbols do not translate into
711
2021-12-17
jrmu
a character or character class. Instead, they translate into a
712
2021-12-17
jrmu
"position" within the string. If a position anchor occurs within a
713
2021-12-17
jrmu
pattern, the pattern before and after that anchor must occur within a
714
2021-12-17
jrmu
certain position within the string.
715
2021-12-17
jrmu
716
2021-12-17
jrmu
717
2021-12-17
jrmu
^
718
2021-12-17
jrmu
719
2021-12-17
jrmu
720
2021-12-17
jrmu
721
2021-12-17
jrmu
Matches the beginning of the string.
722
2021-12-17
jrmu
723
2021-12-17
jrmu
If the /m (multiline) modifier is present, matches "\n" also.
724
2021-12-17
jrmu
725
2021-12-17
jrmu
$
726
2021-12-17
jrmu
727
2021-12-17
jrmu
728
2021-12-17
jrmu
729
2021-12-17
jrmu
Matches the end of the string.
730
2021-12-17
jrmu
731
2021-12-17
jrmu
If the /m (multiline) modifier is present, matches "\n" also.
732
2021-12-17
jrmu
733
2021-12-17
jrmu
\A
734
2021-12-17
jrmu
735
2021-12-17
jrmu
736
2021-12-17
jrmu
737
2021-12-17
jrmu
Match the beginning of string only. Not affected by /m modifier.
738
2021-12-17
jrmu
739
2021-12-17
jrmu
\z
740
2021-12-17
jrmu
741
2021-12-17
jrmu
742
2021-12-17
jrmu
743
2021-12-17
jrmu
Match the end of string only. Not affected by /m modifier.
744
2021-12-17
jrmu
745
2021-12-17
jrmu
\Z
746
2021-12-17
jrmu
747
2021-12-17
jrmu
748
2021-12-17
jrmu
749
2021-12-17
jrmu
Matches the end of the string only, but will chomp() a "\n" if that
750
2021-12-17
jrmu
751
2021-12-17
jrmu
was the last character in string.
752
2021-12-17
jrmu
753
2021-12-17
jrmu
\b
754
2021-12-17
jrmu
755
2021-12-17
jrmu
word "b"oundary
756
2021-12-17
jrmu
757
2021-12-17
jrmu
A word boundary occurs in four places.
758
2021-12-17
jrmu
759
2021-12-17
jrmu
1) at a transition from a \w character to a \W character
760
2021-12-17
jrmu
761
2021-12-17
jrmu
2) at a transition from a \W character to a \w character
762
2021-12-17
jrmu
763
2021-12-17
jrmu
3) at the beginning of the string
764
2021-12-17
jrmu
765
2021-12-17
jrmu
4) at the end of the string
766
2021-12-17
jrmu
767
2021-12-17
jrmu
\B
768
2021-12-17
jrmu
769
2021-12-17
jrmu
770
2021-12-17
jrmu
771
2021-12-17
jrmu
NOT \b
772
2021-12-17
jrmu
773
2021-12-17
jrmu
\G
774
2021-12-17
jrmu
775
2021-12-17
jrmu
776
2021-12-17
jrmu
usually used with /g modifier (probably want /c modifier too).
777
2021-12-17
jrmu
778
2021-12-17
jrmu
Indicates the position after the character of the last pattern match
779
2021-12-17
jrmu
performed on the string. If this is the first regular expression begin
780
2021-12-17
jrmu
781
2021-12-17
jrmu
performed on the string then \G will match the beginning of the
782
2021-12-17
jrmu
783
2021-12-17
jrmu
string. Use the pos() function to get and set the current \G position
784
2021-12-17
jrmu
785
2021-12-17
jrmu
within the string.
786
2021-12-17
jrmu
787
2021-12-17
jrmu
788
2021-12-17
jrmu
20.10.1 The \b Anchor
789
2021-12-17
jrmu
790
2021-12-17
jrmu
Use the \b anchor when you want to match a whole word pattern but not
791
2021-12-17
jrmu
part of a word. This example matches "jump" but not "jumprope":
792
2021-12-17
jrmu
793
2021-12-17
jrmu
794
2021-12-17
jrmu
my $test1='He can jump very high.';
795
2021-12-17
jrmu
796
2021-12-17
jrmu
if($test1=~m{\bjump\b})
797
2021-12-17
jrmu
798
2021-12-17
jrmu
{ print "test1 matches\n"; }
799
2021-12-17
jrmu
IRCNow