Linguistics 408/508 |
Fall 2003 |
Hammond |
Handout 14
Overview
- Questions from last time
- Corpus work (continued)
- Download the newdic
file if you haven't already.
- Download the Brown
corpus file if you haven't
already.
Concordancing
- One of the most useful things for linguists to do
with a hash is to make a concordance. A
concordance is a list of the words in some text
along with the number of times each word occurs.
This can be constructed with Perl by following
the following steps.
- Open a file (and close
it!).
- Read the lines of the file one by
one (chomping each).
- Split each line into words.
- Use each word as the key to a
hash.
- Go through the words of each line
one by one.
- Each word should be a key to a
hash, the number of times that
word occurs in the file is its
value. As you examine each word,
add one to the current count for
that word or create a new hash
element with a count of
one.
- Print out the keys and counts in
alphabetical order.
- Click here if
you get stuck.
- We can add various bells and whistles too:
- Normalization. Do we want to
distinguish words based on case?
Use the
lc()
function if not.
- How do we deal with
hyphenization?
- Can we compute some rudimentary
statistics?
- How do we find all
the hapax legomena?
- How do we find the most frequent
word?
Using Newdic
- How many two-syllable words are there?
- How many three-syllable verbs?
- How many words begin with the orthographic string
<di>?
- What is the largest medial consonant
cluster?
Using the Brown corpus
- How many words are in the text?
- How many distinct words are in the
text?
- How many determiners are in the text?
- How many times is a determiner immediately
followed by a number (on the same line or on two
sequential lines)?
Using them together
- How many uninflected verbs occur in the Brown
corpus?
- How many verbs occur with direct objects in the
text?