- Today
- Practical Concerns
- As linguists, much of what we want to do with a programming
language is pattern-matching over texts/corpora.
- Find all words that begin with k and end with a vowel.
- Find all lines that have a k followed by a vowel followed
by the number 7, but not containing the letter u.
- Find all three-syllable words.
- Find all adjectives ending in -ic.
- Find all plural nouns preceded by the in
questions.
- Regular expression searches are the most popular, powerful,
and easiest tool to use.
- Regular expressions can't do everything, but in conjunction a
full programming language, you can do almost anything
with them.
- Theory
- Regular expressions are part of a mathematical theory of
how to describe formal languages, i.e. sets of strings.
- Regular languages are one class of formal languages. These
are easy to express and easy to implement efficiently.
- A formal language is a set of strings defined over a finite
alphabet.
- A regular language is defined with some combination of three
operations: concatenation, union, and Kleene star.
- Any letter from the finite alphabet is a regular
expression, e.g. a, c.
- Concatenation. Any two regular expressions in
sequence comprise a regular expression e.g. aa,
ab, ba, etc. This is recursive, e.g.
(ab)a, b(ca), a((cd)(ab)), etc.
- Union. Any regular expression or ("|") some
other regular expression comprise a regular expression, e.g.
a|b = {a,b}, (ad)|b = {ad,b},
a(d|b) = {ad,ab}, etc.
- Kleene star. Any number of repetitions of some
regular expression ("*") comprise a regular expression, e.g.
a* = {ε, a, aa, aaa, ...},
ab*c* = {a, ab, ac, abbb, abccc, ...},
a(bc)* = {a, abc, abcbc, abcbcbc, ...},
a(b|c)* = {a, ab, ac, abb, abc, acb, acc, ...}, etc.
- All regular expressions are ultimately defined in terms of these
basic operations.
- Haskell Syntax
- To understand Haskell regular expressions, we must first
understand polymorphic functions, functions that can return
multiple different types. The function
read
, for
example, takes a String
argument and can return
different types, e.g. Int
, Double
,
Bool
.
- The context usually determines the return type, but we can
force a particular type by using a special syntax. If you type
an expression like
read "1"
at the
ghci
prompt, you'll get an error. However, if you
type 2 + (read "1")
instead, the context will
tell read
to return a number. You can force
read
to return an Int
with no context
with the following syntax: read "1"::Int
.
- Notice that this use of double-colon is similar to, but not
identical to, its use in type declarations.
- We can do pretty much everything we want with regular
expressions with just one infix function
=~
from
the Text.Regex.Posix
package. You can use it in
your programs by including this at the beginning of the file:
import Text.Regex.Posix ((=~))
.
- The basic syntax is: String
=~
Pattern. There
are at least six different return type options. Let's
consider these options in the context of this call:
"abcd" =~ ".(b|d)"
. We know what the union expression
means; the dot matches any single character. This pattern does
match the string in two ways. Try each of these:
"abcd" =~ ".(b|d)"
"abcd" =~ ".(b|d)"::Bool
"abcd" =~ ".(b|d)"::String
"abcd" =~ ".(b|d)"::Int
"abcd" =~ ".(b|d)"::[String]
"abcd" =~ ".(b|d)"::(String,String,String)
"abcd" =~ ".(b|d)"::(Int,Int)
- What does this do?
import Text.Regex.Posix ((=~))
import System.IO (hFlush,stdout)
main = do putStr "Enter a word: "
hFlush stdout
w <- getLine
putStr "Enter a pattern: "
hFlush stdout
p <- getLine
putStrLn (if w =~ p then "yes" else "nope")
- What does this do?
import Text.Regex.Posix ((=~))
import System.Environment (getArgs)
main = do as <- getArgs
let p = as!!0
let fn = as!!1
f <- readFile fn
putStrLn (show (f =~ p::Int))
- Regular Expression Syntax
- There are a number of different ways to encode and
extend regular expression syntax. The
Text.Regex.Posix
package uses POSIX
syntax for regular expressions. Here
is a Haskell-agnostic description of that syntax.
- There are a number of additional regular expression
widgets.
- Parentheses can be used to show scope, e.g.
a(b|c)
is different from
(ab)|c
, and ab*c
is different
from (ab)*c
.
- Square brackets can be used to express a disjunction
of any number of single characters, e.g.
a[bcd]e
is equivalent to a(b|c|d)e
.
More examples: [abcdefghijkl]a*[def]
,
([de]|g)abc*
, [abcde]*f[ghij]*
, etc.
- If the first character in the square brackets is caret,
e.g.
[^abc]
, then the disjunction is negative. In
this case, any character except a, b, or
c. For example, [^aeiou]
denotes a
consonant letter (of English).
- There is a whole set of special symbols that stand for
sets of characters that can be used
only in square bracket expressions. The expression
[:lower:]
refers
to all lower-case letters. Thus [[:lower:]A]
refers to all lower-case letters plus A. Other
useful expressions are [:alpha:]
for alphabetic
characters or [:space:]
for space or tab. The
page cited above (repeated here)
gives others.
- Character classes can also be represented with square
brackets and hyphens for obvious cases, e.g.
[a-m]
, [A-Z]
,
[0-9]
, etc.
- There are special symbols that match the beginning and
end of a string. We can distinguish them as follows.
The expression
ab
matches any string that contains
ab
; ^ab
requires ab
be
at the beginning of the string. Distinguish this from something
like [^a]b
which matches any string where a b
is preceded by something other than an a. An
expression like ab$
requires the match to be at
the end of the string. Finally, ^ab$
requires
the whole string match.
- Period
.
matches a single character. For
example, a..b
matches any 4-letter sequence that
begins with a and ends with b.
- We have some tweaks for Kleene star:
+
and
?
. Thus ab*c
matches zero or more
instances of b; ab+c
matches one or more
instances; and ab?c
matches zero or one
instances.
- Types Review
- Practice
- Regular expression syntax:
- What's the difference between
^ab
,
[^a]b
and [a^]b
?
- Write a regular expression that matches a 4-segment
word followed by a 2-segment word.
- Write a regular expression that matches any
word of English that ends with a voiceless stop.
- Write a regular expression that matches any word
preceded immediately by a definite article in English.
- Haskell syntax:
- Write a program that will count the number of English
conjunctions in a file.
- Write a program that will count the number of instances
of adjacent vowel letters in a file.
- Write a program that will return a list of all
proper nouns that occur in a file.
- Write a function that will find a pattern in a string and
replace the matched substring with another string.
- Write a function that will find a pattern in a string and
replace the matched substring with another string however many
times it occurs.