- Questions from last time
- Corpus work (continued)
- Download the newdic
file if you haven't already.
- Download the Brown
corpus file if you haven't
- One of the most useful things for linguists to do
with a hash is to make a concordance. A
concordance is a list of the words in some text
along with the number of times each word occurs.
This can be constructed with Perl by following
the following steps.
- Open a file (and close
- Read the lines of the file one by
one (chomping each).
- Split each line into words.
- Use each word as the key to a
- Go through the words of each line
one by one.
- Each word should be a key to a
hash, the number of times that
word occurs in the file is its
value. As you examine each word,
add one to the current count for
that word or create a new hash
element with a count of
- Print out the keys and counts in
- Click here if
you get stuck.
- We can add various bells and whistles too:
- Normalization. Do we want to
distinguish words based on case?
function if not.
- How do we deal with
- Can we compute some rudimentary
- How do we find all
the hapax legomena?
- How do we find the most frequent
- How many two-syllable words are there?
- How many three-syllable verbs?
- How many words begin with the orthographic string
- What is the largest medial consonant
Using the Brown corpus
- How many words are in the text?
- How many distinct words are in the
- How many determiners are in the text?
- How many times is a determiner immediately
followed by a number (on the same line or on two
Using them together
- How many uninflected verbs occur in the Brown
- How many verbs occur with direct objects in the