Problem Set 6 : Dictionary

Hello fellow coder! Welcome back to CS50! So, this is the beginning of make your own data structure and optimize your code’s (real-world) running time. What is the problem actually? Well, this problem set is about to make a “Spell-checker”. Like problem set 5, it will comes with the distribution code. How is the spell checker works? Well, we’re going to find out later. This was the hardest problem set that I ever done to be honest.

Now, let’s get started!
Per the instruction open up the terminal window, and execute :

update50

To make sure your appliance is up-to-date.Then, execute :

~/Dropbox

In order to navigate your way to ~/Dropbox directory. Then, execute :

wget http://cdn.cs50.net/2013/fall/psets/6/pset6/pset6.zip

To download a ZIP of this problem set’s disro. Next, execute :

ls

You could see the file called pset6.zip in your ~/Dropbox directory. Then, you’ll want to unzip the file by executing the below :

unzip pset6.zip

Then execute :

ls

To ensure that you have a directory called pset6. If you then want to delete the zip file by execute :

rm -f pset6.zip

To remove the pset6.zip file. Next execute :

cd pset6

Now let’s open what’s inside the pset6 directory by execute :

ls

Then you should the the directory contains the below :

dictionary.c dictionary.h Makefile questions.txt speller.c

Now, let’s get started!

First, let’s be friends with “running time“. On input size of n, an algorithm with a running time of n is asymptotically equivalent in term big O, to an algorithm with a running time of 2n. In the real world, though, the fact of the matter is that the latter feels twice as slow as the former.
The challenge of this problem set is to implement the fastest spell-checker you can! By “fastest”, though, we’re talking actual, real-world, noticeable seconds-none of that asymptotic stuff this time.
Take your time to take a look at dictionary.h on gedit. In this file you should see the cs50’s staff declared a dictionary’s functionality.

dictionary.h
They declared the prototype of four functions, take a note of what each should do. Anyway, you’ll want to implement 4 functions to complete this problem set, and they are : check, load, size and unload. The prototypes inside the dictionary.h are :

// maximum length for a word
// (e.g., pneumonoultramicroscopicsilicovolcanoconiosis)
#define LENGTH 45
/**
 * Returns true if word is in dictionary else false.
 */
bool check(const char* word);

/**
 * Loads dictionary into memory. Returns true if successful else false.
 */
bool load(const char* dictionary);

/**
 * Returns number of words in dictionary if loaded else 0 if not yet loaded.
 */
unsigned int size(void);

/**
 * Unloads dictionary from memory. Returns true if successful else false.
 */
bool unload(void);

#endif // DICTIONARY_H

Okay, next open up speller.c with gedit. And try to understand it. You won’t need to change anything in this file.

/****************************************************************************
 * speller.c
 *
 * Computer Science 50
 * Problem Set 6
 *
 * Implements a spell-checker.
 ***************************************************************************/

#include <ctype.h>
#include <stdio.h>
#include <sys/resource.h>
#include <sys/time.h>

#include "dictionary.h"
#undef calculate
#undef getrusage

// default dictionary
#define DICTIONARY "/home/cs50/pset6/dictionaries/large"

// prototype
double calculate(const struct rusage* b, const struct rusage* a);

int main(int argc, char* argv[])
{
    // check for correct number of args
    if (argc != 2 && argc != 3)
    {
        printf("Usage: speller [dictionary] text\n");
        return 1;
    }

    // structs for timing data
    struct rusage before, after;

    // benchmarks
    double ti_load = 0.0, ti_check = 0.0, ti_size = 0.0, ti_unload = 0.0;

    // determine dictionary to use
    char* dictionary = (argc == 3) ? argv[1] : DICTIONARY;

    // load dictionary
    getrusage(RUSAGE_SELF, &before);
    bool loaded = load(dictionary);
    getrusage(RUSAGE_SELF, &after);

    // abort if dictionary not loaded
    if (!loaded)
    {
        printf("Could not load %s.\n", dictionary);
        return 1;
    }

    // calculate time to load dictionary
    ti_load = calculate(&before, &after);

    // try to open text
    char* text = (argc == 3) ? argv[2] : argv[1];
    FILE* fp = fopen(text, "r");
    if (fp == NULL)
    {
        printf("Could not open %s.\n", text);
        unload();
        return 1;
    }

    // prepare to report misspellings
    printf("\nMISSPELLED WORDS\n\n");

    // prepare to spell-check
    int index = 0, misspellings = 0, words = 0;
    char word[LENGTH+1];

    // spell-check each word in text
    for (int c = fgetc(fp); c != EOF; c = fgetc(fp))
    {
        // allow only alphabetical characters and apostrophes
        if (isalpha(c) || (c == '\'' && index > 0))
        {
            // append character to word
            word[index] = c;
            index++;

            // ignore alphabetical strings too long to be words
            if (index > LENGTH)
            {
                // consume remainder of alphabetical string
                while ((c = fgetc(fp)) != EOF && isalpha(c));

                // prepare for new word
                index = 0;
            }
        }

        // ignore words with numbers (like MS Word can)
        else if (isdigit(c))
        {
            // consume remainder of alphanumeric string
            while ((c = fgetc(fp)) != EOF && isalnum(c));

            // prepare for new word
            index = 0;
        }

        // we must have found a whole word
        else if (index > 0)
        {
            // terminate current word
            word[index] = '';

            // update counter
            words++;

            // check word's spelling
            getrusage(RUSAGE_SELF, &before);
            bool misspelled = !check(word);
            getrusage(RUSAGE_SELF, &after);

            // update benchmark
            ti_check += calculate(&before, &after);

            // print word if misspelled
            if (misspelled)
            {
                printf("%s\n", word);
                misspellings++;
            }

            // prepare for next word
            index = 0;
        }
    }

    // check whether there was an error
    if (ferror(fp))
    {
        fclose(fp);
        printf("Error reading %s.\n", text);
        unload();
        return 1;
    }

    // close text
    fclose(fp);

    // determine dictionary's size
    getrusage(RUSAGE_SELF, &before);
    unsigned int n = size();
    getrusage(RUSAGE_SELF, &after);

    // calculate time to determine dictionary's size
    ti_size = calculate(&before, &after);

    // unload dictionary
    getrusage(RUSAGE_SELF, &before);
    bool unloaded = unload();
    getrusage(RUSAGE_SELF, &after);

    // abort if dictionary not unloaded
    if (!unloaded)
    {
        printf("Could not unload %s.\n", dictionary);
        return 1;
    }

    // calculate time to unload dictionary
    ti_unload = calculate(&before, &after);

    // report benchmarks
    printf("\nWORDS MISSPELLED:     %d\n", misspellings);
    printf("WORDS IN DICTIONARY:  %d\n", n);
    printf("WORDS IN TEXT:        %d\n", words);
    printf("TIME IN load:         %.2f\n", ti_load);
    printf("TIME IN check:        %.2f\n", ti_check);
    printf("TIME IN size:         %.2f\n", ti_size);
    printf("TIME IN unload:       %.2f\n", ti_unload);
    printf("TIME IN TOTAL:        %.2f\n\n", 
     ti_load + ti_check + ti_size + ti_unload);

    // that's all folks
    return 0;
}

/**
 * Returns number of seconds between b and a.
 */
double calculate(const struct rusage* b, const struct rusage* a)
{
    if (b == NULL || a == NULL)
    {
        return 0.0;
    }
    else
    {
        return ((((a->ru_utime.tv_sec * 1000000 + a->ru_utime.tv_usec) -
                 (b->ru_utime.tv_sec * 1000000 + b->ru_utime.tv_usec)) +
                ((a->ru_stime.tv_sec * 1000000 + a->ru_stime.tv_usec) -
                 (b->ru_stime.tv_sec * 1000000 + b->ru_stime.tv_usec)))
                / 1000000.0);
    }
}

Alright, next challenge is to implement load, check, size and unload as efficiently as possible. Now, open up dictionary.c on gedit. But, before you dive in, be sure to read the specifications.

OK, I assume you have read the specs, now let’s get you started! What do we need to do in dictionary.c ? In the dictionary.c, we have the helper function LOAD, which is loads the dictionary. Then, function CHECK, to checks if a given word is in dictionary. Function SIZE, returns the number of words in the dictionary. And function UNLOAD to frees the dictionary from memory .

Implement LOAD
A spell-checker will call LOAD function on the dictionary text file. For each word in the dictionary text file, LOAD function will store those words in the dictionary data structure. You can use either HASHTABLE or TRIE. But before we’re about dealing with linked list aswell, since the linked list is the buckets. In linked list, each node has a value, as well as a pointer to the next node. Dealing with linked list, is very important do not lose any links. And the last node points to NULL. How do we define it in C?

Linked list :

typedef struct node
{
   char word[LENGTH+1];
   struct node * next;
}
node;

And there we have linked list. It just one linked list. But hashtable is an whole array in linked list. We have the same structure like before. But if we wanna go with an actual hashtable, we’re gonna make node pointer array , here for example size of 500.

typedef struct node
{
 char word[LENGTH+1];
 struct node* next;
}
node;
node* hashtable[500]

Then, you’ll want to malloc the node for every new word that you have.

node* new_node = malloc(sizeof(node));
fscanf(file, "%s", new_node → word)

Where new_node -> word is the destination of that word.

Now, we’ll want to hash that word. That’s why we gonna call the HASH Function. Well, hash function takes a string and returns an index, in this case the index has to be less than the number of buckets. And the hash function has to be deterministic, that means, the same value needs to map to the same bucket every time. So, new_node → word has the word from dictionary. Then, hashing new_node → word will give us the index of the bucket of the hashtable. And it will insert that in the specific linked list.

Now, how about TRIE ?
Well, in TRIE function every node contains an array of node* s. One for every letter in alphabet + ”\’. And each element in the array points to another node. The structure is:

typedef struct node
{
   bool is_word;
   struct node* children[27];
}
node;
node* root;

OK now, let’s implement the LOAD function with TRIE in the pseudocode below :

for every dictionary word, iterate thru the trie
each element in children corresponds to a different letter.
check the value at children[i]
  if NULL , malloc the node, have children[i] point to it
  if not NULL, move to the new node and continue
if at the end of word, set is_word to true

Implement CHECK

Speller.c will call CHECK function on every word in the inputted text file and printing all miss-spelled words. CHECK function must operate case-insensitivity. And assume the strings will only alphabetical and/ or apostrophes.

Let’s look at how you might CHECK with HASHTABLE :

if the word exist, it can be found in the hashtable
which bucket would the word be in?
  hashtable[hash(word)]
search in that linked list by using strcmp

Now, let’s look at how you might CHECK with TRIE :

for each letter in input word
  go to corresponding element in children
  if NULL, word is misspelled
  if not NULL, move to next letter
  once at the end of input file
  check if is_word is true

Implement SIZE
Speller.c will also call SIZE function to determine of words in the dictionary. Just returns number of words in dictionary if loaded else 0 if not yet loaded.

Implement UNLOAD
And the last, speller.c will call UNLOAD function to free up memory.
Here’s how you might UNLOAD with hashtable :

for every element in hashtable
for every node in the linked list, you’ll want to free that node.

Freeing linked list :

node* cursor = head;
while(cursor != NULL)
 {
    node *temporary = cursor;
    cursor = cursor->next;
    free(temporary);
 }

Then, how about using TRIE ?

unload from bottom to top
travel to the lowest possible node
 free all pointers in children
 backtrack upwards, freeing all elements in each children array until you hit root node
recursion will coming handy

To make sure you’ve properly freed everything that you’ve malloced you can use VALGRIND. Running valgrind, will run your program how many bytes of memory you’re using and make sure that you’ve freed them all .

valgrind --leak-check=full ./speller ~cs50/pset6/texts/austinpowers.txt

Once you’ve done with dictionary.c, then compile and run your program as always. Then, check how fast your “spell-checker” works  by compare it with the staff’s. If it is done, you’ll want to check your correctness with check50.

Be sure to answer the following questions inside the questions.txt on gedit.

0. What is pneumonoultramicroscopicsilicovolcanoconiosis ?
1. According to its man page, what does getrusage do?
2. Per that same man page, how many members are in a variable of type struct rusage ?
3. Why do you think we pass before and after by reference (instead of by value) to calculate , even though we’re not changing their contents?
4. Explain as precisely as possible, in a paragraph or more, how main goes about reading words from a file. In other words, convince us that you indeed understand how that function’s for loop works.
5. Why do you think we used fgetc to read each word’s characters one at a time rather than use fscanf with a format string like “%s” to read whole words at a time? Put another way, what problems might arise by relying on fscanf alone?
6. Why do you think we declared the parameters for check and load as const ?

This was Problem Set 6.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s