CS 106 Winter 2016

Assignment 08: Text processing


Question 1 Rhyming dictionary

The wonderful CMUdict dataset contains over a hundred thousand words and names. Each word is accompanied by its pronunciation, in the form of a sequence of "phonemes" (individual sounds) written out in a symbolic form, for example:

MUSTARD  M AH1 S T ER0 D

You don't need to understand all the individual phonemes. All you need to know is that every vowel sound includes a digit that tells you the stress level of the syllable containing that vowel: 0 for unstressed, 1 for primary stress, and 2 for secondary stress. We say "MUS-tard", not "mus-TARD".

Every word has a "rhyming pattern", a sequence of phonemes that will be shared by all words that rhyme. A word's pattern starts at its final stressed vowel and runs to the end of the word. For MUSTARD, that pattern will be AH1 S T ER0 D. Note that CUSTARD, with the pronunciation K AH1 S T ER0 D, will therefore rhyme with MUSTARD.

MUSTARD  M AH1 S T ER0 D
CUSTARD  K AH1 S T ER0 D

For compactness, we can string all the phonemes in the rhyming pattern together into a single "signature". It doesn't really matter how we do this; because it's simpler to program, I suggest running the phonemes together backwards into a single string with no spaces. Thus, AH1 S T ER0 D is stored as the string "DER0TSAH1". Two words rhyme precisely when they have the same signature.

Based on CMUdict, then, we can build a "rhyming dictionary"—a sketch that lets us type in a word, and automatically find all the other words that rhyme with it. We need to build a sequence that associates every word with its rhyming pattern. Given a query word, we look the word up in the list and find the associated pattern. Then we search for all other words that use the same pattern.

If you're comfortable with this preamble, then proceed as follows:

  1. In the starter code, open the sketch RhymingDictionary. Try running it. You can type a word into the text field at the top of the sketch. No matter what you type in, the sketch displays "Word not found", because there's no code there yet to load in the word list and process rhymes. Now read through and understand the code.
  2. Add code to the setup() function to read in the provided file pronunciation.txt. Look for the TODO comment. You'll need to loop over every line in the file. If the line is a comment (i.e., if it starts with ";;;"), ignore it. Otherwise use splitTokens() to turn it into an array. From there you can get the rhyme signature by passing the array to the provided helper function (don't write this function yourself—it's written for you!). Create a new association between the word and its rhyme signature and add that association to the global array in the sketch.
  3. Next, you'll need to find the TODO comment in the function getRhymes() and add code there too. The function is intended to look up all the words that rhyme with the passed-in word, and return them as an array. The words in the array will be added to the user interface for you. This function requires two passes (i.e., two for loops) over the array. First, find the entry for the word you're given, so that you can determine its rhyme signature. Second, find all the words that have that rhyme signature.
  4. You can complete the assignment purely by writing code in the spots labeled TODO, though of course you're welcome to add new global variables and helper functions if they simplify your code.

Submit your work in a sketch titled RhymingDictionary.

Question 2 Word cloud



A simple way to visualize the important words in a large text is to use a "word cloud", in which each word is scaled relative to the frequency with which it occurs. Perhaps the most popular tool for creating word clouds is Wordle. Unfortunately, the algorithm for laying out words in the style of Wordle is too complicated to reproduce in this course. But we can draw simpler arrangements of words by letting the Fisica library drop them into a physical simulation. Fisica will work on our behalf to force words not to overlap, perhaps with a little nudge (see the video above).

You will write a sketch that builds a lo-fi word cloud from a provided corpus of text. The text is in the file bodies.txt, and consists of the bodies of a large collection of emails that Hillary Clinton made public after it was revealed that she used a private email server for a lot of her correspondence. Which words were most important in those emails?

Proceed as follows:

  1. Open the starter code for the WordCloud sketch. It sets up the structure for the sketch, including the Fisica simulation, but leaves most of the text processing for you. When you run it, it will display a few default words to give you a sense ofthe appearance and interaction in the sketch. Read through the code, especially the comments.
  2. There are three places where it is suggested that you add code, all of them in setup(). Your job is to complete those sections of code, borrowing ideas from the comments placed there, and adding variables and helper functions as you wish. In the first section, you must read in and store a provided "stop list", consisting of words that you don't want to see in the cloud (like "the", "a", and so on). In the second section, you must read in the actual corpus (bodies.txt) and count up the number of occurrences of each individual word in it. In the final section you must pull out the most frequent words in the corpus and hand them to Fisica for rendering.
  3. In this sketch you'll be using the class IntDict, a dictionary that maps String keys to int values. You'll use it in two ways: first to store the stop list (by associating each stop list word with the number 1) and then to store the word counts in the corpus (by associating each word with the number of times is occurs in the text). See the Processing documentation for examples. You'll want to look closely at the get(), set(), hasKey() and keyArray() methods, and possibly the add() method.
  4. You must draw at least the top 50 words, or more if you want. You should not try to draw them all! Expect to draw at most 100–200 of them. Each lower-ranked word must be slightly smaller than the word above in the list. The easiest approach is to define a scaling value for the top word. Then, every time you add a word, multiply the scaling value by a percentage, say 0.95. Make sure that you choose the number of words and their sizes so that they all fit comfortably into the sketch window.
  5. Feel free to experiment with fancier rendering styles like fonts or colours. But please stick with Fisica. If you want to experiment with your own layout algorithm we'd still love to see it, but please submit that in a separate sketch with a detailed comment explaining what you did. Any creative extensions to the word cloud rendering might receive bonus marks.

Submit your work in a sketch titled WordCloud.

Submission

Remember to review the Code Style Guide and use Processing's built-in auto format tool. Then review the How To Submit document. At the top of all of your source files, be sure to include a comment with your name and student ID number. When you're ready, zip all of the sketches created above (RhymingDictionary and WordCloud) into a single archive called A08.zip and upload that file to LEARN.