CS 370 Assignment 2: A Lexical Analyzer

Due: Monday February 13, 1:00pm

In this assignment you will write a lexical analyzer in C or flex(1), for the "370-C" language. Your lexical analyzer should be compatible (or written in) lex(1); e.g. you should write a function yylex() that returns an integer code for each token. Note: watch this page on the web for additional clarifications as needed based on your questions.

Language Details

"370-C" is defined as the ANSI C language, with modifications listed below. The primary reference is "The C Programming Language" by Kernighan and Ritchie. You may use any C language reference to find the lists of reserved words and operators you need.

The 370-C language has:

For the lexical analyzer, you should try to recognize the whole C language, and print errors for those parts that you don't have to implement. You are strongly urged to develop a set of regular expressions (and if you code in C, the corresponding finite automata) to describe the relevant token types, before you start coding!

Lexical Attributes

In your yylex(), compute attributes for each token, and store them in a global variable named yytoken. You should use the following token type:

struct token {
   int category;   /* the integer code returned by yylex */
   char *text;     /* the actual string (lexeme) matched */
   int lineno;     /* the line number on which the token occurs */
   char *filename; /* the source file in which the token occurs */
   int ival;       /* if you had an integer constant, store its value here */
   int *sval;      /* if you had a string constant, malloc space and store */
   }               /*    the string (less quotes and after escapes) here */

In the last homework (#1) it was OK to have one global variable of this type and overwrite it for each call to yylex(), but in this homework your main() procedure should copy out each token into a separate chunk of memory and build a LINK LIST of all the token structs. In the next assignment, we will insert all these tokens in a giant (syntax) tree.

Example linked list structure:

   struct tokenlist {
      struct token *t;
      struct tokenlist *next;
      }
Use the malloc() function to allocate chunks of memory for structs token and tokenlist.

yylex() and main()

Your yylex() should return a different unique integer > 257 for each reserved word, and for each other token category (identifier, integer literal constant, string literal constant, addition operator, etc). Numbers > 257 are required for the sake of compatibility with the YACC parser generator tool. For each such number, you must #define a symbol, as in
#define IDENTIFIER 260
This is required for the sake of readability. Your yylex() should return -1 when it hits end of file.

In this assignment, your program should be organized the same as in the last assignment. There should be (at least) two separately-compiled .c files and a makefile. The yylex() function will be called by a main() procedure in a loop, similar to the last assignment. The main() procedure should for each token, write out a line containing the token category (an integer > 257) and lexical attributes.

Turn in...

Both a paper copy to Dr. J in class, and an electronic copy to Sudarshan via the class "turnin" page, http://www.cs.nmsu.edu/~jeffery/turnin.html. If you add any new source files (such as a .l file) be sure you add it to the set of files that you turn in.

Example

For the input file

void main() {
   printf("%d", 10+2);
   }

your output should look something like:

Category	Text		Lineno		Filename	Ival/Sval
-------------------------------------------------------------------------
262		void		1		tst.c
271		main		1		tst.c
290		(		1		tst.c
291		)		1		tst.c
292		{		1		tst.c
271		printf		2		tst.c
290		(		2		tst.c
273		"%d"		2		tst.c		%d
288		,		2		tst.c
272		10		2		tst.c		0000000A
300		+		2		tst.c
272		2		2		tst.c		00000002
291		)		2		tst.c
260		;		2		tst.c
293		}		3		tst.c
Note that main and printf belong to the same Category (identifier), and that binary values of integers are printed out in hexadecimal format via printf("%08x", i), which "shows all the bits" once you get used to it.