CS 370 Lab #6: Hash (Symbol) Tables

  1. Learn hash table basics. Here is a sample hash function.
    int hash(char *s, int len)
    {
    int i, j, n;
       /*
        * Compute the hash value for the string based on a scaled sum
        *  of its first ten characters, plus its length.
        */
       i = 0;
       j = n = len;
       if (j > 10)		/* limit scan to first ten characters */
          j = 10;
       while (j-- > 0) {
          i += *s++ & 0xFF;	/* add unsigned version of next char */
          i *= 37;		/* scale total by a nice prime number */
          }
       i += n;			/* add the (untruncated) string length */
       if (i<0) return -i;
    return i;
    }
    
  2. Learn and understand "open hashing", which handles arbitrary sizes. Open hashing declares each "bucket" to be a link list.
       struct hashtableentry {
          char *lexeme;    /* actual string matched, unique in the hash table */
          union lexval {
             int category; /* when the entry is an identifier or reserved word */
             int ival;     /* when the entry is an integer literal */
             float rval;   /* when the entry is an real number literal */
             char *sval;   /* when the entry is a string literal */
             } val;
          struct hashtableentry *next;
       };
       typedef struct hashtableentry *hashtable[43]; /* large prime ending in 21? */
    
    Sample code for hash table lookups and/or inserts. This function is intended to do a lookup in the hash table. If the item is found, the shared copy of the lexeme string is returned and the union pointed at by lexval l is filled in from the entry. If the item is NOT found, it is inserted into the table using a strdup()'ed copy of the string, which is returned. I have done most of the changes needed to make this function calculate the lexval at insert-time; you need to fix it for string constants.
       hashtable myhashtable;
    
       char *lookup_insert(char *s, union lexval *l)
       {
       int i = hash(s) % 43; /* compute which bucket to use */
       struct hashtableentry *e = myhashtable[i % 43]; /* select that bucket */
    
       /*
        * From here on out, the hash table works like a link list.
        * This while loop performs a lookup.
        */
       while (e) {
          if (!strcmp(s,e->lexeme)) {
    	 if (l)
    	    *l = e->val;
    	 return e->lexeme;
             }
          e = e->next;
          }
    
       /*
        * insert
        */
       e = malloc(sizeof (struct hashtableentry));
       if (e == NULL) return NULL;
       e->lexeme = strdup(s);
       switch(l->category) {
       case INTEGER: /* calculate ival */
          e->val.ival = atoi(s); break;
       case REAL:    /* calculate sval */
          e->val.rval = atof(s); break;
       case SCONST: /* calculate sval; add your code here */
          e->val.sval = e->lexeme /* WRONG!!! */ ; break;
       default:
          if (l)
             e->val = *l;
          }
       e->next = myhashtable[i];
       myhashtable[i] = e;
       return e->lexeme;
       }
    
  3. Save the above code in a file lextable.c; debug and modify it as needed. Add lextable.c (and lextable.o) to your compiler and makefile. Modify your lexical analyzer to call this function for each token in order to save a copy of the lexeme, in place of malloc/strcpy or strdup. You can get away with passing NULL for the union lexval pointer l in this step.
  4. Use the union lexval pointer to compute lexical attributes only once in your compiler for a given lexeme, same as we only store one copy of the string itself. You pass in a union lexval pointer that holds the integer category; lookup_insert() stores the corresponding lexical attribute value.
  5. (Optional/extra credit) If you "pre-populate" this table, inserting an entry for each of the reserved words with a non-null lexval pointer that contains their corresponding integer category, you can get rid of all the reserved words from your flex.l file and just use the IDENT regular expression to match all reserved words. When you call lookup_insert() later on with a reserved word, you pass in the union lexval pointer saying "I think this is an IDENT", but for reserved words, lookup_insert() overwrites that pointer with the reservd word category number to use instead of IDENT.