Linguistics 408/508 |
Fall 2003 |
Hammond |
Handout 13
Overview
- Questions from last time
-
split()
- Hashes
- Corpus work
Background
- Some of the examples today require some sort of
text file to work with. Download the Brown
Corpus to your machine by
right-clicking on this link.
- Don't forget regular expressions...
Tokenizing
- We've already seen the regular expression
functions:
//
for matching and
s///
for replacing. Both of those
require the matching operator
=~
.
- There is yet another important regular expression
function:
split(/regex/,somestring)
. This
command takes a string and breaks it into a list
of items, where the items are separated by the
pattern designated by the regular expression (example).
We can do more complex splits too (example).
Notice, however, that split()
does
not take the matching operator
=~
, but uses the assignment
operator =
.
Hashes
- One of the most useful data structures that we
have yet to consider is a hashtable
or hash. These are rather similar to arrays.
Like arrays, they represent a set of related
variables. What's different is that the
individual hash elements of the hashtable are
named, rather than indexed. The
entire set is marked with
%
, e.g.
%myhash
. Individual hash members
are marked with $
with the hash
index, or name, in curly braces, e.g.
$myhash{thekey}
.
- For example, if we put three elements in an array
@myarray
, we refer to the
individual elements by numerical indices, e.g.
$myarray[0]
,
$myarray[1]
, and
$myarray[2]
. If we do the same
thing with a hash %myhash
, then we
must specify the names of the
individual hash elements, for example,
tom
, dick
, and
harry
, e.g.
$myhash{tom}
,
$myhash{dick}
, and
$myhash{harry}
. As with an array,
we can put anything we want in these hash
elements (example).
- The name/indices for a hash are referred to as its
keys. These can be obtained from a
hash as a list with the command
keys()
, e.g.
keys(%myhash)
(example).
The order that keys are returned with
keys()
is based on mysterious
internal factors, but it is often useful to
examine the key/value pairs in alphabetical
order (by keys). This can be done easily with
the sort()
command (example).
Concordancing
- One of the most useful things for linguists to do
with a hash is to make a concordance. A
concordance is a list of the words in some text
along with the number of times each word occurs.
This can be constructed with Perl by following
the following steps.
- Open a file (and close
it!).
- Read the lines of the file one by
one (chomping each).
- Split each line into words.
- Use each word as the key to a
hash.
- Go through the words of each line
one by one.
- Each word should be a key to a
hash, the number of times that
word occurs in the file is its
value. As you examine each word,
add one to the current count for
that word or create a new hash
element with a count of
one.
- Print out the keys and counts in
alphabetical order.
- Click here if
you get stuck. We can add various bells and
whistles too:
- Normalization. Do we want to
distinguish words based on case?
Use the
lc()
function if not.
- How do we undo
hyphenization?
- Can we compute some rudimentary
statistics? How do we find all
the hapax legomena? How do we
find the most frequent word?
(The first is easy; the latter
is more difficult.)