Linguistics 408/508 |
Fall 2003 |
Hammond |
Handout 11
Overview
Regular expressions
- This is probably the most well-known aspect of
Perl and the most useful for linguists. (If
you've had some computational linguistics, it
might be useful to know that regular expressions
are instances of a regular language and can be
expressed with a finite state automaton, but you
don't need the technical details to make use of
regular expressions.)
- Basically, a regular expression is a restricted
way to characterize some set of strings. For
example, a Perl regular expression
/ab/
characterizes any string where
the character a
immediately
precedes the character b
. The
regular expression /a[bc]/
characterizes any string with the character
sequence ab
or ac
.
Finally, the regular expression
/ab*c/
characterizes any string
where the character a
precedes
c
with any number of instances of
the character b
intervening.
- The syntax for checking two strings for a regular
expression match is as follows:
somestring
=~ /regex/
. If the regular
expression in the slashes (regex
)
matches somestring
, this will
evaluate as true (example).
Be careful to note that you use the special
regular expression operator =~
here, rather than the assignment operator
=
; this is really easy
to mess up.
- You can also make substitutions or replacements
using regular expressions with the
s///
command. The syntax is as
follows: $somevariable =~
s/regex/somestring/
. The first
instance of the regular expression in
$somevariable
is replaced with the
string given in slashes (example).
To replace every single instance of the pattern,
we add the g
(for "global") flag
(example).
- There are lots and lots of special symbols that
can be used in Perl regular expressions. (You
can see them all with
perldoc
perlre
.) I'll just mention a
couple here. First, you can designate any
character with a period .
. Thus
/a..b/
matches any four-character
sequence where the first character is
a
and the last character is
b
.
- Another set of useful symbols are
^
and $
, which match the beginning
and end of a string respectively. For example,
/^a.*b$/
matches any string where
the very first character is a
and
the very last character is
b
.
- Finally, with all these special interpretations of
characters, we need some way to search for those
characters themselves: backslash. For example,
/\.\*/
will match the literal
character sequence .*
.
- To play with regular expressions some, let's look
at a very simple perl version of the unix
utility grep.
(The other special regular expression symbols
can be seen with
perldoc
prelre
.)