CS 210: Programming Languages Lecture Notes

lecture #1

Welcome to CS210, here is our Syllabus

The Computer Science Assistance Center (CSAC), located in the JEB floor "2R" area, has tutors available during most of "business hours" Monday through Friday. Most likely you will need help in this course; get to know who works the CSAC and which ones know which languages.

History and Overview of Programming Languages

Why Programming Languages

This course is central to most of computer science.
Definition of "programming language"
a human-readable textual or graphic means of specifying the behavior of a computer.
Programming languages have a short history
~60 years
The purpose of a programming language
allow a human and a computer to communicate
Humans are bad at machine language:
Computers are bad at natural language:
Time flies like an arrow.
So we use a language human and computer can both handle:
procedure main()
   w := open("binary","g", "fg=green", "bg=black")
   every i := 1 to 12 do {
      GotoRC(w,i,1); writes(w, randbits(80))
   WriteImage(w, "binary.gif")
procedure randbits(n)
   if n = 0 then return ""
   else return ((?2)-1) || randbits(n-1)

Even if humans could do machine language very well, it is still better to write programs in a programming language.

Auxiliary reasons to use a programming language:
so that the program can be moved to new computers easily
natural (human) language ambiguity
Computers would either guess, or take us too literally and do the wrong thing, or be asking us constantly to restate the instructions more precisely.

At any rate, programming of computers started with machine language, and programming languages are characterized by how close, or how far, they are from the computers' hardware capabilities and instructions. Higher level languages can be more concise, more readable, more portable, less subject to human error, and easier to debug then lower languages. As computers get faster and software demands increase, the push for languages to become ever higher level is slow but inevitable.

Turing vs. Sapir

The first thing you learn in studying the formal mathematics of computational machines is that all computer languages are equivalent, because they all express computations that can be mapped down onto a Turing Machine, and from there, into any of the other languages. So who cares what language we use, right? This is from the point of view of the computer, and it should be taken with a grain of salt, but I believe it is true that the computer does not in fact care which language you use to write applications.

On the other hand, the Sapir-Whorf hypothesis suggests to us that improving the programming language notation in use will not cause just a first-order difference in programming productivity; it causes a second-order difference in allowing new types of applications to be envisioned and undertaken. This is from the human side of the human-computer relationship.

From a practical standpoint, we study programming languages in order to learn more tools that are good for different types of jobs. An expert programmer knows and uses many different programming languages, and can learn new languages easily when new programming tasks create the need. The kinds of solutions offered in some programming languages suggest approaches to problem solving that are usable in any language, but might not occur to you if you only know one language.

The Ideal programming language is an executable pseudocode that perfectly captures the desired program behavior in terms of software designs and requirements. The two nearly insurmountable problems with this goal are that (a) attempts to create such a language may be notoriously inefficient, and (b) no design notation fits all different types of programs.

A Brief History of Programming Languages

There have been a few major conferences on the History of Programming Languages. By the second one, the consensus was that "the field of programming languages is dead", because "all the important ideas in languages have been discovered". Shortly after this report from the 2nd History of Programming Languages (HOPL II) conference, Java swept the computing world clean, and major languages have been invented since then. It is conceivable that the opposite is true, and the field of programming languages is still in its infancy.

There are way over 1000 major (i.e. publically available and at one point used for real applications) programming languages. Much less than half are still "alive" by most standards. Programming languages mostly have lifespans like pet cats and small dogs. Any language can expect to be obsoleted by advances in technology within a decade or at most two, and requires some justification for its continued existence after that. Nevertheless some dead languages are still in wide use and might be considered "undead", so long as people have businesses or governments that are depending on them.

History of Programming Languages, cont'd.

Languages evolved very approximately thus:
machine code, assembler
instruction sets vary enormously in size, complexity, and capabilities. Difficult for humans.

Basic unit of computation is the machine word, often used as a number.
"high-level" languages. imperative paradigm.

Entire human-readable arithmetic expressions can be written on a single line. Flowcharts widely used to assuage the chaos entailed by "goto"-based program control flow.
functional paradigm and alternatives. interpretive. user-friendlier. slow.

Entire functions, or other complex computations, can be written in a line or two in some of these languages. More important are advances such as automatic recycling of memory, and the ability to modify or construct new code while the program is running. But for some folks, they may have fatal flaws.
Algol, C, Pascal, PL/1
"structured" languages solve/eliminate the "goto" control flow problem. Imperative paradigm; "goto"s considered harmful.

The mainstream of the 1970's. Emphasis on fast execution, and protecting programmers from themselves and each other. Programs tend to become unmaintainable as they grow bigger.
Ada, Modula-2, C++
"modular" systems programming languages. data abstraction.

Improvements in scalability to go along with the fact that you have to write a zillion lines to do anything.
SmallTalk, Prolog; Icon, Perl
"Pure" versions of object-oriented, functional, and declarative paradigms; rapid-prototyping and scripting languages.

Extreme power, often within specific problem domains.
Visual Basic, Python, Java, C#, Ruby, PHP, ...
GUI-oriented and web languages. mix-friendly languages.

The learning curve may be more in the programming environment.

OK, now your turn? What languages should be on this list? What new languages are "hot"?

Programming Language Buzzwords

"low level", "high level", and "very high level"
"low" (machine code level) vs. "high" (anything above machine level) is ubiquitous but inadequate
machine readable vs. human readable
certainly humans have difficulty reading binary codes, but machines find reading human language text vexing as well
data abstraction vs. control abstraction
really, I might prefer data vs. code as my counterpoints
kinds of data abstractions
basic/atomic/scalar vs. structural/composite
"first class" value
an entity in a programming language that can be computed/constructed at runtime, assigned to a variable, passed in or returned out of a subroutine.
kinds of control abstractions
many variants on selection, looping, subroutines
syntax and semantics
meat and potatoes of language comparison and use
translation models
compilation, interpretation, source/target/implementation languages

Googling for History

Here are some highlights from the history of programming languages; google them and see if they give clean answers or raise more questions (for exam purposes):

lecture #2

Reading Assignment

Read all this lecture notes material down to where it starts talking about the Lisp language. Then read (and do) the assigned Common Lisp tutorials.

Paradigms and Languages

Several paradigms, or "schools of thought", have been promulgated regarding how best to program computers.

The dominant imperative paradigm has been gradually refined over time. It basically states that to program a computer, you give it instructions in terms it understands. a.k.a. "procedural" paradigm: a program is a set of procedures/functions. You write new "instructions" by defining procedures. Since the underlying machine works this way, this is the default paradigm and the one that all other paradigms reduce themselves to in order to execute.

Functional and object-oriented paradigms are arguably special cases of imperative programming. In functional programming you give the computer instructions in clean, mathematical formulas that it understands. In object-oriented programming, you give the computer instructions by defining new data types and instructions that operate on those types.

Declarative programming is a polar opposite of imperative programming, introduced in many different application contexts. In declarative programming, you specify what computation is required, without specifying how the computer is to perform that computation. The logic programming paradigm is arguably a special case of declarative programming.

Languages are implemented by compilers or interpreters. There are many implementation techniques that fall somewhere in between.

Pure vs. Impure; Multi-paradigm

Really, when we say a programming language embodies a particular paradigm, we are usually saying what it "mainly" does. Languages can be characterized by evaluating how "pure" is their adherence to their dominant paradigm. Impurity usually means: falling back on imperative paradigm when expedient or necessary. Purity is elegant but often comes at the price of idiocy.

Pure Language Examples
Language Example Commentary
quadMultiply: i1 and: i2 
    "This method multiplies the given numbers by each other and the result by 4."
    | mul |
    mul := i1 * i2.
    ^mul * 4
Pure OO. Even ints are objects.
classic Lisp
(defun fibonacci (N)
  "Compute the N'th Fibonacci number."
  (if (or (zerop N) (= N 1))
    (+ (fibonacci (- N 1)) (fibonacci (- N 2)))))
Pure functional. No I/O, no assignment statements, etc.
perfect(N) :-
    between(1, inf, N), U is N // 2,
    findall(D, (between(1,U,D), N mod D =:= 0), Ds),
    sumlist(Ds, N).
Pure logic. Surprise failures, wild backtracking, nontermination

Different programming paradigms seem ideal for different application domains. What is great for business data processing may be terrible for rocket scientists. A computer scientist should know all the major paradigms well enough to know which paradigm is best for each new project that they come across. One option is to become proficient in several diverse languages.

Another option, sometimes, is to use a language that supports multiple paradigms. These run the risk of being Frankenlanguages. They are more likely to succeed when designed by a genius, and when pragmatic, viewing multi-paradigm as an extension of impurity rather than a theoretical ideal to aspire to.

Example Multi-Paradigm Languages
language example commentary
relation grandChild(var X, Y : names);
var Z : names;
  begin writeln('test father-father descent'); end;
  grandChild(X,Y) :- father(X,Z), father(Z,Y).
  begin writeln('test father-mother descent'); end;
  grandChild(X,Y) :- father(X,Z), mother(Z,Y).
  begin writeln('test mother-father descent'); end;
  grandChild(X,Y) :- mother(X,Z), father(Z,Y).
  begin writeln('test mother-mother descent'); end;
  grandChild(X,Y) :- mother(X,Z), mother(Z,Y).
Logic paradigm default; imperative when needed
proc {Insert Key Value TreeIn ?TreeOut}
   case TreeIn
   of nil then TreeOut = tree(Key Value nil nil)
   [] tree(K1 V1 T1 T2) then 
      if Key == K1 then TreeOut = tree(Key Value T1 T2)
      elseif Key < K1 then T in 
        TreeOut = tree(K1 V1 T T2)
        {Insert Key Value T1 T}
      else T in 
        TreeOut = tree(K1 V1 T1 T)
        {Insert Key Value T2 T}
Pattern matching seems inspired by FORMAN, which is under-credited.
#  Generate words
procedure words()
   while line := read() do {
      lineno +:= 1
      write(right(lineno, 6), "  ", line)
      map(line) ? while tab(upto(&letters)) do {
         s := tab(many(&letters))
         if *s >= 3 then suspend s# skip short words
Imperative default, but logic-style programming when the programmer uses certain constructs. Unicon adds OO (along with a lot of I/O capabilities).


At first glance the syntax of a language is its most defining characteristic. Languages differ in terms of how they form expressions (prefix, postfix, infix), what kinds of control structures govern the evaluation of expressions, and how the programmer composes complex operations from built-ins and simpler operations.

Syntax is described formally using a lexicon and a grammar. A lexicon describes the categories of words in the language. A grammar describes how words may be combined to make programs. We use regular expressions and context free grammars to describe these components in formal mathematical terms. We will define these notations in the coming weeks.

Example Regular Expressions Example Context Free Grammar
 ident	[a-z][a-z0-9]*
 intlit  [0-9]+
 E : ident
 E : intlit
 E : E + E
 E : E - E

Many excellent languages have died (or, been severely hampered) simply because their syntax was poorly designed, or too weird. Introducing new syntax is becoming less and less popular. Recent languages such as Java demonstrate that it is possible to add more power to programming languages without turning their syntax inside out.

Syntax starts with lexicon, then expression syntax, and grammar. We are going to study these ideas in some detail in this course; expect to revisit this topic.

A context free grammar notation is sufficient to completely describe many programming languages, but most popular languages are described using a context free grammar plus a small set of cheat rules where surrounding context or semantic rules affect the legal syntax of the language.

Lexical syntax defines the individual words of the language. Often there are a set of "reserved words", a set of operators, a definition of legal variable names, and a definition of legal literal values for numeric and string types.

Expression syntax may be infix, prefix, or postfix, and may include precedence and associativity rules. Some languages "expression-based", meaning that everything in the language is an expression. This might or might not mean the language is simple to parse without needing a grammar.

Context free grammars are a notion introduced by Chomsky and heavily used in programming languages. It is common to see a variant of BNF notation used to formally specify a grammar as part of a language definition. Context free grammars have terminals, nonterminals, and rewriting rules.

CFG's cannot describe all languages, and some grammars are inherently ambiguous. Consider

1 - 0 - 1
if E1 then if E2 then S1 else S2


However much we love to study syntax, it is semantics that really defines the paradigms. Semantics generally includes type system details and an evaluation model. We will come back to it again and again this semester. For now, note that there can be axiomatic semantics, operational semantics, and denotational semantics.

Runtime Systems

Programming Languages' semantics are partly defined by the compiler or interpreter, and partly by the runtime system. A runtime system consists of libraries that implement the language semantics. They range from tiny to gigantic. The may be linked into generated code, or linked into an interpreter, or sometimes embedded directly in generated code. They include things ranging from implementing language built-ins that aren't supported directly by hardware, to memory managers and garbage collectors, to thread schedulers, to input/output.

Memory: the Most Important Problem Solved by (the field of) Programming Languages

You can argue that the biggest thing languages have done for is us solve the control flow problem, by eliminating goto statements and all the spaghetti coding that made early programs difficult to debug. But Dr. J's Conjecture #1 is that memory management is a dominant aspect of modern computing. If it is not solved by the language, it will dominate the effort required to develop most programs. Example: memory debugging in C and C++ may occupy 60%+ of time spent getting a working solution. Many C/C++ programs ship with memory bugs.

I/O: the Key to All Power in the (Computing) Universe

Almost all programming languages tend to consider I/O an afterthought.

Dr. J's Conjecture #2: I/O is a dominant aspect of modern computing and of the effort required to develop most programs.

Evidence: dominance of graphics, networking, and storage in modern hardware advances; necessity of I/O in communication of results to humans; proliferation of different computing devices with different I/O capabilities.

Implications: programming language syntax and semantics should promote extensible I/O abstractions as central to their language definitions. Ubiquitous I/O harware should be supported by language built-ins.

Expansion on the whole "Compilers" vs. "Interpreters" thing

Remind me of your definitions of "compiler" and "interpreter" in the domain of programming languages. What's the difference? Are they mutually exclusive?

Variants on the Compiler

source code to machine code
source code to...simpler source code (Cfront, Unicon)
compiles at runtime, VM-to-native or otherwise
special-purpose / misc
translate source code to hardware, to network messages, ...

Variants on the Interpreter

executes human-readable text, possibly a statement or line at a time
executes "tokenized" source code (array of array of tokens)
executes via tree traversal
executes via software interpretation of a virtual machine instruction set


enscript(1) is a program that converts ASCII text files into postscript. It has some basic options for readable formatting.
enscript --color=1 -C -Ejava -1 -o hello.ps hello.java && ps2pdf hello.ps
produces a PDF like this.

Lisp Lecture #1

Functional Programming and Lisp

You must unlearn what you have learned. -- Master Yoda
Our first language, Lisp, is one of the oldest languages in common use today. It exemplifies the functional programming paradigm. Although Lisp is an acronym for "LISt Processor", its name is not usually given in all-capitals. Lisp tries to view the entirety of computing in terms of mathematical functions that operate on lists, and it is astonishing how much, and how easily, one can accomplish things with a few simple building blocks.

Lisp was invented by John McCarthy and colleagues at MIT around 1960. It was immediately and tremendously influential, serving as an example of how research in universities helped form computing as we know it, alongside industry R & D.

Lisp was the first interactive language, the first language to come bundled with an IDE, the first language to encourage self-modifying code, the chosen language of the field of artificial intelligence, and owns many other firsts. It was titanically influential in the development of other languages, from SmallTalk (Xerox InterLisp was the environment and culture in which SmallTalk was fostered), to scripting languages such as Python, to later functional languages such as ML and Haskell.

Whole companies were founded on the premise of making hardware-accelerated implementations of Lisp on $30,000 workstations. Large multi-million dollar companies built such machines, such as Symbolics and Texas Instruments. There were dozens of major Lisp dialects, with similar general syntax and myriads of incompatible variants. Eventually, a standard language called Common Lisp emerged and is still popular today. From a pragmatic standpoint, I am interested also in one other modern dialect, Emacs Lisp.

Lisp is small enough that it has been repeatedly used as a scripting/extension language, not just in Emacs but in major commercial programs such as AutoCAD's AutoLISP.

lecture #3

Functional programming in a nutshell


1. Work, your way through Sean Luke's Lisp Quickstart Tutorials 1, 2 and 3 (local mirror: 1, 2, and 3, )

2. Skim or read the Common Lisp reference manual (CMU multi-formatted version) as needed in order to support your understanding.

Additional Common Lisp resources that may be more useful for some of you: For comparison, you might find these other Lisp manuals interesting:

Lisp Topics to Learn

Lisp language
syntax and semantics
Lisp runtime system
garbage collection, symbol table
Using Lisp
know a lot of particular functions and special forms
Lisp execution behavior
be able to diagram memory

Why we (still) study Lisp

Lisp: language considerations

Atoms and Lists

Lisp has two kinds of values:


Both code and data are represented using symbolic expressions, which are parenthesized, and not comma separated. Because code and data are all the same stuff, it is fairly easy to build up some new code on the fly in a data structure, and then execute it.

cons cells

The fundamental building block of lists is the cons cell. It has a data payload (car) and a next-cell pointer (cdr).


A list is a null-terminated chain of cons cells. This has a recursive definition:

A collection of cons cells that is not null terminated is not a list. A dot is used to denote cons cells that are not null-terminated, as in

("hello" . "there")

This is called "dotted-pair notation", mainly for the one-cons-cell case, but you can specify a chain of cells with a dot before the final element to indicate the absence of a null termination.


Lisp interpreters use a read-eval-print loop to interact with the programmer.

Using Lisp

Lisp is normally used interactively.
Normally, once you invoke a Lisp interpreter from the command-line (for example "clisp"), you are sitting at a lisp prompt (for example "[1]> "), typing source code directly at an interpreter.
"Grow" your programs bottom-up
You write one function at a time, often writing and unit-testing helper functions immediately, before using them in larger functions.
Run them as "scripts" when testing complete programs
For example, here is a complete, simple Lisp script that prompts the user and attempts to compute the square of the user's answer. It does not include error checking that might be helpful, it illustrates Lisp scripts.
(defun square (x) (* x x))
(defun readsquare ()
 (print "gimme an x:")
 (setq x (read))
 (square x)
(print (readsquare))
Stored in some text file (say, "square") and marked as executable (via "chmod u+x square"), this program can be invoked from the command line prompt ("./square", or just "square" if it is located on your PATH). Note that a list of strings holding the command line arguments are available in the symbol *args* if there are any; if not it will be NIL.
Compile them when you are finished.
"clisp" features a bytecode compiler; many common lisps will also feature an optimizing native code compiler.

Note on (quit)

With Lisp interpreters, if you fail to halt. or (quit) properly, especially if you ran them from an interesting shell such as a subshell running under emacs, it is possible for your process to be left running after you logout. As far as Dr. J is concerned this is a bug in the operating system, but as a pragmatic good citizen, we should make a point of (quit)ting properly and (kill -9)ing our Lisps when we have to.


As noted earlier, Lisp uses a read-eval-print loop, where "eval" means "evaluate an expression to obtain its value". Generally, evaluation goes like this: Freaky parts:

The Most Universal Lisp Functions

There are several hundreds of built-in functions in Lisp. We start with the most universal.
(cons x y)
(car x)
(cdr x)
(+ x y)
Lists are commonly nested inside each other to form trees or other complex structures. It is common to walk through many car's and cdr's to get to the value that is needed. Built-in functions that perform multiple car's and cdr's are an old-school way of accessing elements in deeper structures. (caar x) produces the first element of the first element of x (assuming x is at least two levels deep). (cadr x) produces the 2nd element of the list x. (cdar x) produces the "rest" of the first element of x. (cddr x) produces the remainder of x after its first two elements. This pattern continues and is good for any combination of at least three and possibly more a's and d's (caaaadr, etc).

The newer way of picking out element i, instead of saying (caddddddddr L) would be (nth L i). In this case, i is 0-based.


Besides list construction and access/traversal, and numeric computations, one of the favorite categories of Lisp functions are those that return true or false. In Lisp, false is denoted as nil and anything that isn't nil is true. There is also a special reserved symbol, named t, that may be used as a generic "true" value. Many predicate functions follow the original hungarian notation convention of ending their name with a "p" for predicate.

Predicate examples:

(null x):bool		; is x nil?
(atom x):bool		; is x at atom?
(listp x):bool		; is x a list?
(numberp x):bool	; is x a number?
(integerp x):bool	; is x an integer?
(zerop x):bool		; is x the value 0?
(oddp x):bool		; is x odd?
(evenp x):bool		; is x even?
(consp x):bool		; is x a cons cell?
(plusp x):bool		; is x positive?
(minusp x):bool		; is x negative?
(< x y):bool		; is x less than y?
See also several equality-test predicates below


Symbols are atoms that can be used as names for values. The concept of symbols replaces that of variable in ordinary languages. Symbols routinely have characters like - and * in them; unlike mainstream languages. Several pieces of information may be associated with each symbol in the symbol table.

Symbol Table

The symbol table is an efficient structure for looking up stuff associated with symbols. Besides the name and evaluation value, the separate slot for function value, there is more stuff -- at the least, a property list that can be used to associate various attributes with a symbol.

Example symbol table entry
field value
name "x"
value 7
function (lambda (a b c) (+ (* a b) c))
properties ...
??? ...

The Lisp evaluator does an implicit/automatic symbol table lookup anytime a symbol appears during an evaluation. It uses the 3rd slot when the symbol appears in an initial position in a list (function value) and the 2nd slot when the symbol appears on a 2nd or subsequent position. Rules are very different for "special forms". Symbol table entries can also be accessed explicitly by programs, which is how property lists are used.

Lisp generally emphasizes a single global symbol table, rather than a hierarchy of little local symbol tables as used in compilers for mainstream languages.

Classical LISP Functions

(cons x L): new L	; allocate cons cell
(car L): x		; L->car
(cdr L): L		; L->cdr
(quit)			; quits LISP
(load f)		; load LISP code from filename f
(setf x 5)		; assignment
(+ x 1)			; arithmetic; takes any # of args
lecture #4

Tip for the homework: a student of mine once wrote:

I was working on my homework which received a [low score] due to load errors. I have spent a good deal of time trying to figure out this error but I cannot seem to get it. I used this online lisp debugger and it provided the same error as did using clisp. The error suggests that I am missing a parens at line 89, but I cannot tell where it needs to go.

If you cannot match your parentheses:

Lisp Special Forms

Learn and understand the defun and let special forms. Compare let with setq. Learn the quote special form. In general, to write your own special forms, you write macros. Lisp macros make C/C++ macros look like "wimps".

Reason #1 that I dislike Common Lisp, compared with Emacs Lisp: Common Lisp's while loop is not a special form named "while".

Aside on setq

In the beginning, there was only set:
(set (quote x) 5))
This was done so often folks got tired of typing the "quote" all the time. They invented a macro named setq:
(setq x 5)
And they invented the quote syntax:
(set 'x 5)
Eventually, set gave way to setf. Setf is smart enough to pick out the location denoted by its first argument, and modify the value at that location.
(set 'x '(1 2 3))
(setf (car (cdr x)) 7)


The defun special form defines a function. It is a list whose first three elemens are the symbol defun, the symbol denoting the function name, and a list of arguments. The rest of the defun are S-expressions that are evaluated when the function is called. The last expression will produce the return value of the function.

The general format

(defun f (x)
   ; code for f given in 1 or more lists
   ; function return value is the value of the last thing evaluated


(defun square (x)
   (* x x)

Common Lisp also has return expressions but they are more involved than in C or C++; look them up in the Common Lisp manual if you need them. Most times you should use the final expression as a return value.

A short summary is:

(return expr)
breaks out of a current loop (doesn't return from the function!)
(return-from f expr)
breaks out of block named f (f can be a function name)


The quote special form is sort of the simplest, no-operation special form, which just passes its argument on without evaluating it. Its syntax, (quote x), is commonly abbreviated with the apostrophe: 'x.


Recursion is central to lisp, not only for certain mathematical computations, but also for various data structure operations, such as traversing a list. A recursive function is a function that may call itself as part of computing its result. It always should consist of
basis case:
a finishing-up circumstance where it does not need to call itself
recursion (induction) step:
circumstances where it solves the problem by combining a little work with a call to itself that does the "rest" of the work.
(defun factorial (n)
   (if (<= n 1) 1
       (* n (factorial (- n 1)))))

Suppose you didn't have a while special form, could you implement a recursion to execute 10 times?

(defun foo (x) (print x) (if (< x 10) (foo (+ x 1)) x))

Compound Expressions

(progn expr expr exprn)
Evaluates each expression in sequence. The value of the whole progn is the value of the final exprn.

Recursion Tip

What is wrong with
(defun f (x y)
   (if (null x) y
     (f (cdr x))
It has a basis case, it has a recursion on x.

Equality Functions

(eq x y)
t if x and y are the same exact object
(eql x y)
t if x and y are eq or if x and y are numerically the same
(equal x y)
t if x and y are eql or if x and y are structurally equivalent

Lisp Function-Writing-O-Rama

Given (defun) and the basic (car, cdr, cons) and functions on atoms, you can write almost any computation, generally using recursion where a loop might otherwise be suggested.
(listn n)
Returns a list containing n numbers from 1 to n.
(isprime x)
Returns whether x is a prime number or not.
(numprimes L)
Returns the number of prime numbers in L.
(app L x)
Return a list that is one longer than list L that is a copy of L except with x added onto the end. Use only car, cdr, and cons.
(copy L)
Return a list that is a copy of L. Seldom needed in Lisp since, if you obey proper functional programming style, you can just pass L around where it is needed and not worry about a called function modifying it on you.
(reverse L)
Return a list that is the reverse of L. Use only car, cdr, and cons. You may use (app L x) if you define it successfully above. Note: there is a built-in reverse in Common Lisp; don't use it until you can write your own. You might need to name yours (myreverse L).
(cat L1 L2)
Return a list that is the concatenation of lists L1 and L2.
(squareL L)
Given a list L, return a list whose elements are the squares of corresponding elements in L
(mylength L)
Given a list L, compute its length.
(widest L)
Given a list L, return its longest (sub)list.
(average2 x y)
Compute the average (mean) of x and y.
(average L)
Compute the average (mean) of elements in L. Sum / Length.
(sum L)
Compute the sum (+) of elements in L.

lecture #5 began here

...we looked at a bunch of recursion practice

Where we left off was: if I can reverse the rest of a list, how do I tack the first element onto the end?
(defun reverse (L)
   (if (null L) nil
                (addend (reverse (cdr L)) (car L))

Review of Dr. J's "Zen of Recursion"

If your Lisp function isn't recursive you are doing it wrong.
Sure you can write C with Lisp syntax, by why would you?*
What is the basis case?
It is usually easy if you look at the function return type (nil, 0, "" etc)
Is the recursive step summative or constructive?
Numeric recursions usually apply some arithmetic to combine current with "rest". List recursions usually cons a result list.
If you can't think how to do your recursion step, write a helper function
For example, consider function (reverse L). Easy to "reverse the rest", hard to place the first element at the end of such a list.
Add a parameter
Recursive helper functions often have more parameters than clean external public API functions.
Additional tips from past students:

Lisp Formatting

*Lisp can be less readable than C if you don't use newlines and indentation. Compare
(if (and (< x y) (< y z)) (progn (print "aha!") (exit)) (progn (print "OK")))
   (if (and (< x y) (< y z)) (progn ; then
        (print "aha!")
       (progn                             ; else
        (print "OK")

Let and Let*

The special form named Let introduces 1 or more local variables. Let* does so and implies they will be introduced one at a time, such that previous ones are available as later ones appear.
(let ((x val) (y (+ x 1)))
     (print y))
Notice the two opening parentheses after the (let part. You have an extra set of parentheses to bound a list of two-element (variable value) lists... even if you have only one local variable to declare.


A cond is a chain of if statements
(cond (bool1 exprs...)
      (bool2 exprs...)
      (booln exprs...))
Often, the final bool is a "t" (default).

Common Lisp also has a case special form

(case key
       ((keylist) exprs...)
       ((keylist) exprs...)
        (t exprs...))

Strings and Characters

Lisp Strings are (0-based) arrays of characters.

String Recursion

Consider the following recursive version of the "ascii to integer" function atoi(s). The recursion would read in English as "if we are length 1, return our ASCII-converted numeric value, else recursively convert all the digits but the last one, multiply by 10, and add in the last one".
(defun atoi (s)
   (if (= (length s) 1) (digit-char-p (char s 0))
       (+ (* (atoi (subseq s 0 (- (length s) 1))) 10)
          (digit-char-p (char s (- (length s) 1)))))
Common Lisp has a built-in for this, (parse-integer "-64"), but this version of (atoi s) is a good example of recursing on strings.


Expect to spend a little while with your common lisp manual getting the details right on these. Recursion is simpler. :-)
(dotimes (var n result) exprs)
(dolist (var L result) exprs)
(do ((var start step) (var2 start step) ...)
    (test actions)

Some more Lisp functions

Note for the last two functions: passing function f into apply or mapcar generally requires that you write 'func instead of just func, since func would be evaluated to obtain its variable value before apply or mapcar ever saw it. apply and mapcar need its "function value" so they need the unevaluated symbol.

lecture #6 began here

HW#1 Status

Common Lisp Data Types (pass 2)

Not just your usual scalar and array-like types; common lisp includes Learning the whole language is a big task; learn the parts you need on-demand and building up your "working set".
I-am-a-symbol		; symbol
"I am a String"		; string
#\a			; character
#(1 2 3)		; vector

Lambda forms

(lambda (x) (* x x)) is an example of a lambda form. It is essentially an anonymous function, usable anywhere a function name would be used. Particularly, they are a technique of choice when a function generates some new code and returns that function as its return value.

lambda forms are a bit "deep" to understand. The Wikipedia entry for anonymous functions states that they are useful for "temporary" functions that might get created on the fly (say, by an A/I program that generates custom functions for some algorithm), used immediately, and then discarded. They are used in other high-powered mathematics ("lambda calculus", invented by Alonzo Church) and as building blocks for certain computing techniques, such as closures, which you can learn about. To sum up, there are a lot of computing techniques (in Lisp) that create new code on the fly as data, and such new code may not have a natural human name. Lambda forms are useful in such circumstances.

There was a basic question in class, how does an anonymous function recurse? An deep theoretical answer no doubt exists. I have a shallow, easier-to-swallow answer. An anonymous function can recurse by defining a local name for itself, and using that name to call itself within itself. The miraculous LABELS special form is a bit like a cross between a LET and a DEFUN... Consider

(funcall (lambda (x) (labels ((myname (x) (if (<= x 1) 1 (+ (myname (- x 2)) (myname (- x 1)))))) (myname x))) 3)
Extra credit points if you can come up with a shorter / more understandable anonymous recursive function.

Of course, even though this works, it would be cooler if Lisp had a function-equivalent of the "self" or "this" variable used in OO languages. You know, sort of a this-func such that you could write

(lambda (x) (if (<= x 1) 1 (+ (this-func (- x 2)) (this-func (- x 1)))))
It is hard to imagine that this hasn't been done in some Lisp dialect already, maybe something similar to it has.

lecture #7 began here

Things I Learned While Writing This Handy Sudoku Solver

Formatted Output

(format t "~A" "hello, world")
More typically, like with printf(), the format string is used to stick a value in the middle of a larger string. ~A is interesting since it does the "right" thing with values of several/many Lisp data types. From Gigamonkeys.com:
(format nil "The value is: ~a" 10)           ==> "The value is: 10"
(format nil "The value is: ~a" "foo")        ==> "The value is: foo"
(format nil "The value is: ~a" (list 1 2 3)) ==> "The value is: (1 2 3)"

Random Numbers

Function (random N) gives a random non-negative number from 0 up to and not including N. There are additional parameters; see the manual. Rolling a dice would look like:
(1+ (random 6))
Note: casual inspection suggests it uses the same random seed every time, unless you do something like:
(setq *random-state* (make-random-state t))

Homework #2

Let's look at homework #2. HW2's topic is inspired by the UI Vandal TESPA Club, and based on a card game called HearthStone, so if you've never heard of these card games, check out video starting at about the 2:50 mark.


Applicative Programming

Eval, Apply, and Funcall

(eval expr) calls the Lisp evaluator on its argument. It is an ordinary Lisp function, and lets you execute arbitrarily constructed data as code.
(setf x '(+ 2 2))
x			; returns (+ 2 2)
(eval x)		; returns 4

Earlier we saw (apply f L) calls function f with list L as its parameters. Apply is kind of like (eval (cons f L))

(apply '+ '(1 3 5))	; returns 9
Note: apply does not work with special forms!

(funcall f &rest args...) is like apply, only the args are supplied directly in the normal function call manner.

Optional Parameters

Lisp allows functions to declare optional parameters like this:
(defun subseq (seq startpos &optional endpos) ...)

Keyword Parameters

Keywords are symbols starting with colon (:) as in
   (with-open-file f "foo" :direction :input)
This is really a user-definable mechanism.

Q: When do you use it?
A: When you have multiple (many) optional parameters

(defun f (x &key (y "because") (z 'zebra))
   ; ... code body uses x, y, and z
allows such calls to f as
(f 1)			; x and y default
(f 2 :y 'not)		; z defaults
(f 3 :z 'zbigniew)	; y defaults
Example: more of with-open-file's keyword parameters

File I/O (in Common Lisp)

By the way, stdout in CommonLisp is named *standard-output*

Programming environment issues

You write Lisp code in a .l or .el file, and load it into your Lisp interpreter (or Emacs session!) session by executing (load "file.el") under Emacs that would be (load-file "file.el"). Many Emacs functions are helpful in finding online documentation on Emacs Lisp, including "apropos" and "describe-function". Invoke these interactively by typing escape-X and then typing the function name and pressing return.

A word about compilation

I did a naive test of the speedup generated by our GNU Common Lisp's bytecode compiler, to see if it is worth your bothering with. After defining a file fib.l:
(defun fib (n) 
  (if (< n 2) 1 (+ (fib (- n 1)) (fib (- n 2)))))
and executing (load "fib.l"), the call (fib 35) was taking be about 30 seconds, while if I do a (compile-file "fib.l") it generates a bytecode file named fib.fas, and if I do a (load "fib.fas"), the call (fib 35) takes about 5 seconds -- rough ballpark is a factor of 6 speedup on a trivial example. Your mileage may vary.

lecture 9

Land of Lisp Video

This is totally what I was thinking of when I came up with "LispStone".

The function's code body

Suppose you
(defun foo (x) (* x x))
In common Lisp, for arbitrary function foo in the symbol table you can request its "function definition" with #'foo, but that gives you a function value, something that might be native code. On the other hand,
(function-lambda-expression #'foo)
peels out the actual lambda form of the function. In our example above you get something like
...so you could get at the code block with something like
(nth 3 (function-lambda-expression #'foo))
giving an answer of
(nth 2 (nth 3 (function-lambda-expression #'foo)))
(* X X)
Now, how would we destructively change that * to a + ... if only Lisp weren't mathematically pure. If only it were evil...

rplaca, rplacd

Having gone to a lot of trouble to say that proper use of Lisp never modifies any existing variable or structure, but instead relies on pure mathematical functions (if you stick to those principles, for example, your code is thread-safe pretty much for free; some Lisps automatically parallelize your code for you)...

Now it is time to show how to modify an existing cons cell. (rplaca L x) replaces L's car with x, and (rplacd L x) replaces L's cdr with x.

(rplaca (nth 2 (nth 3 (function-lambda-expression #'foo))) '+)
Does it actually modify the semantics of foo?
> (foo 3)
...You bet it does!

I guess you could say: self-modifying code starts with being able to modify code...


Generalizing from Lists, Common Lisp has many data types that all fall under the umbrella of an ordered sequence of elements, on which a large set of built-in sequence functions work.


(make-array dimensions &key :element-type :initial-element :initial-contents ...)
The dimensions parameter is one of Supplying an :element-type allows more efficient implementation of specialized arrays. (aref a &rest subscripts) produces an element of array a for assignment for evaluation:
   (aref a 5)		; a[5]
   (aref m 3 2)		; m[3][2]
   (setf (aref a 5) 10) ; a[5] = 10
Note that &rest indicates a function with a variable number of arguments. There are several other array helper functions:
(array-element-type a)
(array-rank a)
(array-dimension a i)
(array-in-bounds-p a &rest subscripts)

Categories of Sequence Operations

Simple functions
elt, length, reverse, subseq, make-sequence
Concatenate, map, reduce
(reduce #'+ '(1 2 3 4)) --> 10
(map 'list #'+ '(1 2 3) '(4 5 6)) --> (5 7 9)
fill, replace, remove, substitute
some, every, notany, notevery
Search functions


(concatenate 'result-type seq1 seq2 ... seqN)
and many other Lisp functions are polymorphic; they operate on any sequence type (lists, strings, arrays/vectors...). You may find it convenient to wrap them in helper functions.
(defun strcat (s1 s2) (concatenate 'string s1 s2))
(defun lcat (L1 L2) (concatenate 'list L1 L2))
(defun cat (x1 x2)
   (typecase x1
      (string (concatenate 'string x1 x2))
      (list (concatenate 'list x1 x2))
Given such an awesome piece of Jeffery-wonderfulness, try
 (cat '(1 2 3) '(4 5 6))
 (cat "hello" "there")
 (cat '(1 2 3) "there") 

More map, and mapcar

We already saw map and reduce, but here's a tip: map may be handy in putting your input data (i.e. for a homework assignment) into a convenient format to work with.

(map type f sequences) calls function f once for each element in its sequences.

(map 'list #'- '(1 2 3)) ; returns (-1 -2 -3)
(defun oddc (n)
   (if (oddp n) #\1 #\0)))
(map 'string #'oddc '(1 2 3 4)) ; returns "1010"
(map 'list #'string "abcd") # returns ("a" "b" "c" "d")
(mapcar f L L2 ... Ln) is about the same, but is specific to lists. mapcar has several relatives.

mapcar is kind of like:

   (cons (f (car L) (car L2) ... (car Ln))
         (mapcar f (cdr L) (cdr L2) ... (cdr Ln)))
(mapcar '1+ '(100 200)) ; returns (101 201)
(mapcar '+ '(1 2 3) '(100 200 300)) ; returns (101 202 303)

Search functions

(find item seq)     ; returns leftmost occurrence of item in seq
                    ; good particularly on lists of structures
(position item seq)  ; leftmost position
(count item seq)     ; # of matches
(mismatch seq1 seq2) ; position of first dissimilar elements
(search s1 s2)       ; position of s1 in s2

Pragmatics of Lisp

Thusfar we have covered classical Lisp topics. The focus of the final Lisp lecture(s) will be practical solution to real-world problems in Common Lisp.

Reading Lines into a List

The wrong way:
(setf L ())
(read-em f)
(defun read-em (f)
   (let ((s (read-line f nil nil)))
      (cond ((not (null s))
		(setf L (cons s L))
		(read-em f))

The right way:

(defun read-em (f)
   (let ((s (read-line f nil nil)))
      (if (null s) nil (cons s (read-em f)))

Load Error?

[16]> (load "words.l")
(load "words.l")[16]> (load "words.l")
;; Loading file words.l ...
*** - READ: input stream
      #<INPUT BUFFERED FILE-STREAM CHARACTER #P"words.l" @30> ends within an
      object. Last opening parenthesis probably in line 13.
The following restarts are available:
ABORT          :R1      Abort main loop
Break 1 [17]> abort

String Processing

String processing is not Lisp's strong point. On sourceforge you can find a valuable resource on string processing in Common Lisp.

Example helper functions

(defun emptystr (s) (= 0 (length s)))
(defun strcat (s1 s2) (concatenate 'string s1 s2))

Recursing on a String

We last time identified our basis case on a string (string length 0). To walk through a string with recursion we might use a "car" and a "cdr" to pick out the first and the rest of the string.
(defun carstr (s) (char s 0) )
(defun cdrstr (s) (substring s 1 (length s)))
Now, what does the following recursion on a string do?
(defun woof (s)
   (if (emptystr s) nil
      (cons (carstr s) (woof (cdrstr s))

How Many Helper Functions

As I think about it, functional programming style is easier if you are more aggressive in writing more, smaller functions than would be normal in C. A good rule of thumb is, if you've got more than 7 (+/-2) levels deep of nesting in the code body (not counting the one for defun, or the parameters or local variable declarations) you might start to bump into your cognitive limits on grokking code, and perhaps should think about breaking it up into more helper functions. But note that this says "deep" and is not a simple matter of counting parentheses in most cases.

Back to (words s)

Last time we started on a function (words s) that would return a list of "words", given an input text (say, a line read from a file). We got about this far:
;; return a list of words from string s
(defun words (s)
   (if (= 0 (length s)) nil
    ; ... add "else" part here
As a programmer, I tend to want a few more base cases with error checks here, a bunch of "if" conditions. I could all these conditions using a ton of "or" clauses, but I might prefer to write them as separate so as to be less confusing. I could write a bunch of nested "if" expressions, chaining them along the else branches, but that is also ugly if it gets too deep. Another option would be to learn and use Common Lisp's "cond" expression, which is perhaps a nicer encapsulation of a classic if-else-if-else-if-else-if chain.

One other thing we left of with last time was: what characters make up a "word" and how do we test for them in Common Lisp? Homework 4 is very precise (and simple) in saying a word starts with a-zA-Z0-9 and anything else is a non-word that separates words. You should look for (in common lisp) a built-in function, or write your own helper function to test whether a character is a "word character" or not.

(defun word-char (c) (or (alpha-char-p c) (digit-char-p c)))
As we walk along happily recursing down the string, we need to remember The easiest way to remember all these things as we walk along, is to pass them as parameters.

Given that we do different things depending on whether we are in a word or not, you can either add extra conditions to track that, or you can write separate recursive functions for when you are in a word at the moment, and when you are not.

;; return a list of words from string s
(defun words (s)
   ; if s is empty
   (if (emptystr s) nil
         ; if first char is a "word char"
	 (if (word-char (char s 0))
            ; then go into "in word" mode
            (in-word (cdrstr s) (string (char s 0)) nil)
          ; else try next char
          (words (cdrstr s))
Note the indentation style, and the comment convention. Also note the repeated calls to the same function with the same arguments. A good optimizing compiler will avoid those duplicate calls, but on an interpreter if we cared about performance we might want to only call those once, storing results in local variables.
;; return a list of words from string s
(defun words (s)
   (if (emptystr s) nil
       (let ((c (char s 0)) (d (cdrstr s)))
	 (if (word-char c)
            (in-word d (string c) nil)
          (words d)
It remains to be seen how to grab more letters in the current word.
;; already in a word w
(defun in-word (s w L)
   (if (emptystr s) (append L (list w))
       (let ((c (char s 0)) (d (cdrstr s)))
	 (if (word-char c)
           (in-word d (strcat w (string c)) L)
           (notin-word (cdrstr s) (append L (list w)))
The interesting thing here is that if c is a word-char we add it onto the current word and call recursively, but if c is not, we add the whole word onto the accumulated words list and call our evil twin to process the next character.
;; not in word w
(defun notin-word (s L)
   (if (emptystr s) L
       (let ((c (char s 0)) (d (cdrstr s)))
	 (if (word-char c) (in-word d (string c) L)
           (notin-word d L)

lecture 10

Strategically, I am choosing today to address assignment-specific topics, rather than getting bogged in string processing details where we left off.

Brief Comment on Mismatch

What was I thinking (mismatch s1 s2) was supposed to do last lecture? It does the obvious.

From the Mailbag

How do I read the user's selection?
How about something like:
(defun promptedread (s) (format t "~A: " s) (read))
You might use it once for the command, again for each of the name(s). Example call:
(promptedread "Enter (1) Quit, (2) End Turn, (3) Play, or (4) Attack")
Of course, you would store the return value in a variable, pass it as a parameter, or whatever. Note: (read) won't be enough if card names are multiple Lisp values, say for example, because they have spaces in them. Consider using (readline) instead.
How do I find the card by name in my Hand in order to play it?
You are searching through a list, looking for an element whose third field matches a string. Whether we were going to use a built-in search function, or write our own recursion, the fundamental building block is the string match against the third field.
(defun cardmatch (card match) (equal (nth 2 card) match))
Example call:
(cardmatch (car Hand) "++")
In class exercise: write a recursive function that searches through a Hand and returns the position of a match, or nil if it is not found.
How do I move a card to my minion list?
Use a variant of built-in (remove) with a smart enough helper function, or use recursion to make a copy of the Hand that is missing the card that is being removed. Use cons to construct a new minion list with a card added to the front, except drop the first two elements (cost and spellp) of the card and add the rest as the new element on the minion list.

lecture #11 began here

Learned the Hard Way

(eql) will do numeric comparisons, individual char comparisons, but not string comparisons:
[1]> (eql "hello" "hello")

Type Conversions

Because one of you asked:
> (coerce 15/2 'float)
> (round 7.5)
8 ;

Multiple Values

This latter is an example of Common Lisp's "multiple value" functions. Multiple value functions are arguably unnecessary since you can just write a function that returns a list. In the case of (round) they let the built-in have its primary return value and essentially make the additional return values optional.

The typical way to access more than one return value is with a form similar to the (let) special form, called multiple-value-bind:

(multiple-value-bind (x y) (round 7.5) (format t "~A, but note ~A~%" x y))
8, but note -0.5

Hash Tables

You probably need to know Common Lisp's hash table type. You can read more about them at the Common Lisp Cookbook Hashes Page.

(let ((tab (make-hash-table)))
   (if (null (gethash s tab))
       (setf (gethash s tab) 0)
       (setf (gethash s tab) (+ 1 (gethash s tab)))
As far as walking through your hash table looking at all of them, you can use (maphash #'helperfunc t) if you write a (helperfunc key value). Or you can use one of several iterator or loop methods given in the cookbook.

Warning about Common Lisp Hash Tables

Common Lisp uses "eq" semantics when hashing and looking up keys. Two strings might look identical, but if they were allocated at different times, in different locations in memory, they will not hash to the same place. To hash reliably on strings, you might convert them into symbols using (intern s).
(setq tab (make-hash-table))
(gethash "hi" tab)
(setf (gethash "hi" tab) 1)
(gethash "hi" tab)
(setf (gethash (intern "bye") tab) 2)
(gethash (intern "bye") tab)


From n-a-n-o's cmu common lisp tutorials we note that Common Lisp has a built-in sort function, (sort L f). Caveat: sort is destructive ?! Make a copy of the cons cell chain before you sort it, if you wish to preserve the original.

One more kind of recursion

It is worth mentioning one more important kind of recursion: problems in which you can divide the problem in half each time, solving recursive subproblems of half the size. You will only have to subdivide in half (log n) times. Quick Sort and Binary Search are examples of algorithms that might employ this kind of recursion.


Consider this practice for the midterm.

lecture #12 began here


Unicon Basics

Variable declaration is optional
This is a compromise between the needs of scripting/prototyping languages, and the need to support larger mainstream software engineering projects.
local, global, and static declarations are recommended in large programs or libraries.
Variables can hold any data type, and reassigned with different types
Like in Lisp, Python etc. But this is very rare in practice.
type(x) returns a string type name ("list", "integer" etc) of x
You can write code that works across multiple types. Heterogeneous, polymorphic awesomeness.
arithmetic is pretty normal
^ is an exponentiation operator. Integers are unlimited precision. Reals are C doubles.
Type conversion is automatic across scalar types
Runtime error when conversion won't work, except in explicit conversion functions, which fail instead.
Strings use double quotes and are escaped using \
indexes are 1-based; they are immutable, atomic; not arrays of char; there is no char type
s[i] := "hello" works
really like s[1:i] || "hello" || s[i+1:0]
*s is a length operator, repl(s,i) is i concatenations of s
expressions in Icon can fail to produce a result
failure cascades to surrounding expressions
Built-in types include lists, tables, sets, csets, and records.
Arguably simpler to use than Common Lisp's
Classes and packages
Well-suited for large-scale apps
Easy I/O capabilities
2D, 3D, and network programming

Unicon as yet-another-typical-scripting-language

procedure fwords(s)
    t := table(0)
    L := wordsinfile(s)
    while s := pop(L) do t[s] +:= 1
    every k := key(t) do if t[k]=1 then delete(t,k)
    return sort(t)
# v1
procedure wordsinfile(s)
    f := open(s)
    L := []
    while line := read(f) do L |||:= wordsinline(line)
    return L
procedure wordsinline(s)
   alnum := &letters ++ &digits
   L := []
   s ? while tab(upto(alnum)) do put(L, map(tab(many(alnum))))
   return L
Note: there is something terribly bad about v1 of wordsinfile() above. List concatenation ||| allocates a whole new list each time. I have seen this in the wild, i.e. in real programs written by end users. The larger the list gets, the less efficient it is to allocate a copy in order to add a tiny number of elements on to the end. How would you rewrite wordsinfile() to just put the elements on to the end of L instead of list concatenating every time?

Fundamentals of the Goal-Directed Paradigm

Ordinary Languages Goal-Directed Evaluation
expression evaluation computes a return value, no matter what expression evaluation can succeed or fail
If you have a problem:
  • return an "error code" or "sentinel value", or
  • raise an exception
If you have a problem:

If your expression has multiple answers
  • compute first, write a loop to get the rest
  • compute all, return an array/list/whatever
If your expression has multiple answers:

   generate results as needed by surrounding computation

Fallible Expressions


can't fail can't succeed test (fallible) depends on operands
1 &fail x < 1 x+1

lecture #13 began here


Generators are simply expressions that logically might produce more than one result. For further reading, see "Generators in Icon", by Griswold, Hanson, and Korb.

Some common generators in Unicon include:

In the realm of string scanning: In addition to chaining all of these (and a few other built-in generators) together, you can create your own generators. We'll show this in a bit.

String Scanning

   s ? expr
evaluates expr in a string scanning environment in which string s is analyzed (terminology: s is the subject string). While in a string scanning environment, string functions all have a default string, and a default position within the string at which they are to operate.

   s ? find(s2)
searches for s2 within s and is a lot like find(s2, s, 1).

You almost never use string scanning if you only have one string function to call, but rather, when you are breaking up a string into pieces with multiple functions. In this case, function tab(i) changes the position to i, and function move(i) moves the position by i characters. tab() and move() return the substring between the start position and where they change it to.

    s ? {
       if write(f, tab(find("//"))) then {
	  move(2) # move past //
          write(&errout, "trimmed comment ", tab(0))
       else write(&errout, "there was no comment")
Built-in scanning functions include:
search for a string
search for a position at which any character in set c can be found
if current position starts with s, return position after it
if current character is in c, return position after it
if current position starts with characters in c, return position after them
like upto(), but only return positions at which string is "balanced" with respect to c2, c3. Tricky in one respect.
Actually several of these are generators.

lecture #14 began here

More about Generators

a | b
The simplest generator is alternation. Instead of saying
x = 5 | x = 10
you can just say x = (5|10). You might not want to hear it, but: yes this is shorter and more readable than ordinary programming languages, instead of adding power by being "weirder". Maybe read | as "then" instead of "or". So what does
  (1 | 2) + (x | y)
i to j
i to j by step
The coolness here is that a traditional language's "for-loop" has been generalized not just into an iterator, but into an expression that can be smoothly blended into any surrounding expression context.
All data structures in the language support the "generate" operator to produce their contents. Files generate their contents a line at a time. Consider
   s == !f
find(s), upto(c), and bal(c1,c2,c3)
These classic string pattern matching generators produce indices within a string. They have several optional parameters for string to examine, and start and end positions to consider. They are usually used in a string scanning environment where these parameters may be omitted. Of the three, bal() is seldom used and a bit trickier than the others. It generates positions containing characters in c1 (like upto) balanced with respect to c2 and c3. Note that if *c2 and *c3 are greater than 1, though, it does not distinguish different kinds of parentheses.
seq(), key()
For completeness sake, we list the remaining two "built-in" generators. seq() generates an infinite sequence of integers. key() generates the "keys" of a table or set.

User-Defined Generators

Generators are often a convenient way to write dynamic programming solutions.
procedure fib()
   local u1, u2, f, i
   suspend 1|1
   u1 := u2 := 1
   repeat {
      f := u1 + u2
      suspend f
      u1 := u2
      u2 := f

lecture #15 began here


Please skim chapters 5-9 and read chapters 10-12 of Programming with Unicon. The rest of the Unicon book is also useful I hope, but you will receive no exam questions from it.

Recursive Generators

Given a record tree(data, ltree, rtree), what does the following procedure do?
procedure walk(t)
   if /t then
   else {
      suspend walk(t.ltree | t.rtree)
      return t.data
Compare that with a non-generator, conventional "Visitor" design pattern solution:
procedure walk(t, p)
   if /t then fail
   walk(t.ltree, p)
   walk(t.rtree, p)
What does this procedure do?
procedure leaves(t)
   if /t then fail
   else if /(t.ltree === t.rtree) then
      return t.data
   else {
      suspend leaves(t.ltree | t.rtree)

Recursion and Backtracking

Recursive backtracking examples, UT Longhorn-style.

This is a long slide set. You may wish to review additional slides in this slide deck, beyond the set covered in class.

Unicon: highlights of built-in data types

What did we miss?
s1 || s2, s1 == s2, s[i:j]
L1 ||| L2, push(L, x), pop(L), put(L, x), pull(L)
beware using lists (etc.) as keys
sets and csets
S1 ++ S2, S1 ** S2, S1 -- S2
records and classes
constructors, etc.

Unicon: Classes and OOP

Unicon's object oriented programming capabilities include closure based inheritance semantics for multiple inheritance.

Here is a gentle syntax comparison, adapted from Hani Bani-Salameh.
C++ Unicon
class Example_Class {
   int x;
   int y;
   Example_Class() {
      x = y = 0;
   ~Example_Class() { }
   int Add()
      return x + y;
class Example_Class (x,y)
   method Add()
      return x + y
   x := y := 0

Unicon Tips from the Past

procedures end with end
not { } as in C/C++/Java. Same goes for classes, methods
&& is not an "and" operator
& is an "and" operator
a generator is only as generative as its surrounding expression demands
if its not driven by "every", it may well stop at its first result
if its already a generator, ! won't make it more so
rather, it will generally mess it up
Can't just start assigning elements of an empty list
After L:=[], you will find that L[1] does not exist yet. Create with list(n) or put() or push() elements before you try to subscript them.

lecture #16 began here

Unicon: Graphics

Unicon has some of the world's easiest 2D graphics (open() mode "g"), inspired by the TRS-80 Extended Color BASIC graphics, and influenced by the X Window System. X11 (and class Mac and Windows 2D APIs) were all inspired heavily by the original Xerox graphics workstations.

The 3D facilities (open() mode "gl") are also pretty darn simple. They are built atop (classic) OpenGL and have grown to emphasize the use of textures over time.

Q: When is "graphics" a programming language concept, and when is it software engineering, operating systems, architecture, or mathematics?
There are many answers. language vs. library. application layer vs. system layer. software vs. hardware. idea vs. implementation. At least in Unicon, one can say: there is a built-in data type. There is semantics in the VM / runtime system happening even when you are not in a graphics function call. Perhaps there should be: control structures.

Main concepts:

  1. window = canvas + context
  2. "attribute=value" strings
  3. pixels, coordinates, colors, fonts
  4. input processing and callback routines
  5. language level (built-in) tries to provide essential features with simplest API possible, relatively complete programmer control
  6. class (library) level features extensive GUI, modern concepts
By way of saying hello, we submit this entry to Brad Myers' "rectangle follows mouse" challenge.
procedure main()
   &window := open("rfm", "g", "fg=blue", "drawop=reverse")
   repeat {
      e := Event()
      case e of {
         &ldrag | &mdrag | &rdrag : {
	    FillRectangle(\x, y, 10, 10)
	    FillRectangle(x := &x, y := &y, 10, 10)
         "q" : exit(0)
For the sake of comparison, here is an application to render a simple textured 3D scene.

procedure main() 
   &window :=open("textured.icn","gl","bg=black","size=700,700")

   # Draw the floor of the room 
   WAttrib("texmode=on", "texture=carpet.gif")  
   FillPolygon(-7.0, -0.9, -14.0, -7.0, -7.0, -14.0,
                       7.0, -7.0, -14.0, 7.0, -0.9, -14.0, 3.5, 0.8, -14.0)
   # Draw the right wall
   WAttrib("texture=wall1.gif", "texcoord=0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0") 
   FillPolygon(2.0, 4.0, -8.0, 8.3, 8.0, -16.0, 8.3, -1.2, -16.0, 2.0, 0.4, -8.0)
   # Draw the left wall
   FillPolygon(2.0, 4.0 ,-8.0, -9.0, 8.0, -16.0, -9.0,-1.2,-16.0, 2.0, 0.4, -8.0)
   # Draw a picture
   WAttrib("texture=poster.gif", "texcoord=0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0")
   FillPolygon(1.0, 1.2, -3.0, 1.0, 0.7, -3.0, 1.2, 0.5, -2.6, 1.2, 1.0, -2.6)
   # Draw another picture
   WAttrib("texture=unicorn.gif", "texcoord=1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0")
   FillPolygon(0.8, 2.0, -9.0, -3.0, 1.6, -9.0, 3.0, 3.9,-9.0, 0.8, 4.0, -9.0)
   # Draw the lamp
   Translate(0.7, 0.20, -0.5)
   Fg("emission pale weak yellow")
   Rotate(-5.0, 1.0, 0.0, 0.0)
   Rotate( 5.0, 0.0, 0.0, 1.0)
   DrawCylinder(-0.05, 0.570, -2.0, 0.15, 0.05, 0.17)
   Fg("diffuse grey; emission black")
   Rotate(-5.0, 1.0, 0.0, 0.0)
   Rotate( 6.0, 0.0, 0.0, 1.0)
   DrawCylinder(0.0, 0.0, -2.5, 0.7, 0.035, 0.035)
   Rotate(6.0, 0.0, 0.0, 1.0)
   DrawTorus(-0.02, -0.22, -2.5, 0.03, 0.05)
   # Draw the table 
   WAttrib("texcoord=auto", "texmode=on", "texture=table.gif")
   Rotate(-10.0, 1.0, 0.0,0.0)
   DrawCylinder(0.0, 0.2, -2.0, 0.1, 0.3, 0.3)
   Translate(0.0, -0.09, -1.8)
   Rotate(65.0, 1.0, 0.0, 0.0)
   DrawDisk(0.0, 0.0, 0.0, 0.0, 0.29) 
   WAttrib("texmode=off", "fg=diffuse weak brown")
   Rotate(-20.0, 1.0, 0.0,0.0)
   DrawCylinder(0.0, 0.2, -2.2, 0.3, 0.1, 0.1)
   while (e := Event()) ~== "q" do {
      write(image(e), ": ", &x, ",", &y)

lecture #17 began here

Unicon: Networking

Unicon has some of the world's easiest internet client and server facilities. There are basic TCP and UDP protocols accessed via open() mode "n" and "nu", and there are several higher level internet protocols such as HTTP and POP that are accessed via open() mode "m".

Main concepts:

  1. slow, reliable and ordered (TCP) or fast (UDP)
  2. asynchronous, non-blocking I/O and timeouts
  3. dropped connections and widely varying delays
  4. multi-plexing and select()
  5. built-in higher level messaging (HTTP, SMTP, etc.)

Discussion of Scoping Rules, Suspend

Backtracking Control vs. Data

An OOP Example

class listable(L,T)
   method insert(k,value)
      /value := k
      T[k] := value
      put(L, value)
   method lookup(k)
      return T[k]
   method gen_in_order()
      suspend !L
   L := [ ]
   T := table(defaultvalue)
So, this is a table, except it remembers the order in which its elements are inserted. Like Java, because we don't have operator overloading, we can't make it look exactly like a table...
procedure main(argv)
   LT := listable(0)
   every s := !argv do
      LT.insert(s, LT.lookup(s)+1)
   every x := LT.gen_in_order() do
What is wrong with this picture?

Three Pillars of Object Orientation?

Some people have written about three principles of object-orientation, claiming that they are: encapsulation, polymorphism, and inheritance. First, what are they? And second, do they define OOP?

Unicon Inheritance

Inheritance in Unicon has "closure-based" semantics. Instead of a kid being an instance of a parent with some additions, a kid is its own being, who pulls in fields and methods via transitive closure (depthfirst search) of its superclasses. Closure-based semantics gives the cleanest resolution of multiple inheritance conflicts that I am aware of. Most of the time you do not notice or care.
class fraction(numerator, denominator)
   #methods here

class inverse : fraction(denominator)
  numerator := 1

class sub : A : B(x)
   x := 0
   self.A.initially()	# calling parent method in overriding subclass method
   self.B.initially()	# self is implicit in most other contexts.



thread write(1 to 3)
is equivalent to
 spawn( create write(1 to 3) )

The usual problem with a thread is: you aren't waiting for it to be done, and you can't even tell when it finishes. Well, assign it to a variable and you can at least do that much.

mythread := thread write(1 to 3)
waits for a thread to be done.

Typically, a thread has some work (data structure) and an id passed into some function. After the thread is finished, the results will have to be incorporated back into the main computation somehow

t1 := thread sumlist(2, [4,5,6])
procedure sumlist(id, L)
   s := 0
   every s +:= !L
   #... can't easily just "return" the value

The classic way threads might communicate is: global variables! But these have race conditions. Alternatives include files or pipes or network connections (all slow), or an extra language feature, but first: how to avoid race conditions.

global mtx
mtx := mutex()
critical mtx: expr
is equivalent to

Another way to avoid race conditions in Unicon is to use a "mutex'ed" data structure, as in

L := mutex([])

There are also thread-based versions of the activate operator: four or eight of them:

@> @>> <@ <<@
send blocking send receive blocking receive

They follow this (weird) model:

There is more to concurrency: condition variables, private channels... this was just your gentle introduction. See UTR14 for more.

A Unicon Thread Story

Real Life intrudes upon our tender classroom...

Discussion of Sort Module

The Icon Program Library sort module handles more exotic sorting needs than those of the built-in sort(). We have an example to consider, but we almost have to get some more core data types and control structures covered in order to appreciate it.

Bits of Icon/Unicon Wisdom

Things I love about Icon and Unicon

Yeah, this list isn't complete...
x1 < y < x2
ranges the way I saw them back in math class
lists and tables
the most convenient data structures building blocks in any language
!L === x and P(!L) and such
the most convenient algorithms building blocks in any language
open() and friends
the most convenient graphics and network I/O in any language

Things I hate about Icon and Unicon

Run-time errors that have &null values because of typos
compiler option -u helps but isn't a cure-all
Run-time errors that have &null values because of surprise failure
if's are needed to check for failure...in a large percent of expressions
Computational accidents because of surprise generators
some things were never meant to be backtracked-into.
the language is slow
from time to time I get help from students interested in fixing this
the IDE is immature
many Bothan spies died to bring you this IDE.

OOP Lessons from the Unicon Class Libraries

The unicon distribution is basically an Icon with an extensively modified VM, plus a uni/ directory that looks like
3d/   guidemos/  iyacc/     Makefile   progs/  ulex/	unidoc/
CVS/  ide/	 lib/	    native/    shell/  unicon/	util/
gui/  ivib/	 makedefs   parser/    udb/    unidep/	xml/
We can't cover all the libraries in a single lecture, but we can learn about objects from some of the highlights.

lecture #18 began here

Where we are At

Flex and Bison

Our next "language" in this course is really two languages that were designed to work together.

Reading Assignment: Flex

Read Sections 3-6 of the Flex manual, Lexical Analysis With Flex.

Regular Expressions

The notation we use to precisely capture all the variations that a given category of token may take are called "regular expressions" (or, less formally, "patterns". The word "pattern" is really vague and there are lots of other notations for patterns besides regular expressions). Regular expressions are a shorthand notation for sets of strings. In order to even talk about "strings" you have to first define an alphabet, the set of characters which can appear.
  1. Epsilon (ε) is a regular expression denoting the set containing the empty string
  2. Any letter in the alphabet is also a regular expression denoting the set containing a one-letter string consisting of that letter.
  3. For regular expressions r and s,
             r | s
    is a regular expression denoting the union of r and s
  4. For regular expressions r and s,
             r s
    is a regular expression denoting the set of strings consisting of a member of r followed by a member of s
  5. For regular expression r,
    is a regular expression denoting the set of strings consisting of zero or more occurrences of r.
  6. You can parenthesize a regular expression to specify operator precedence (otherwise, alternation is like plus, concatenation is like times, and closure is like exponentiation)
Although these operators are sufficient to describe all regular languages, in practice everybody uses extensions:

lecture #19 began here

Midterm Review

Some Regular Expression Examples

In a previous lecture we saw regular expressions, the preferred notation for specifying patterns of characters that define token categories. The best way to get a feel for regular expressions is to see examples. Note that regular expressions form the basis for pattern matching in many UNIX tools such as grep, awk, perl, etc.

What is the regular expression for each of the different lexical items that appear in C programs? How does this compare with another, possibly simpler programming language such as BASIC?
lexical category BASIC C
operators the characters themselves For operators that are regular expression operators we need mark them with double quotes or backslashes to indicate you mean the character, not the regular expression operator. Note several operators have a common prefix. The lexical analyzer needs to look ahead to tell whether an = is an assignment, or is followed by another = for example.
reserved words the concatenation of characters; case insensitive Reserved words are also matched by the regular expression for identifiers, so a disambiguating rule is needed.
identifiers no _; $ at ends of some; 2 significant letters!?; case insensitive [a-zA-Z_][a-zA-Z_0-9]*
numbers ints and reals, starting with [0-9]+ 0x[0-9a-fA-F]+ etc.
comments REM.* C's comments are tricky regexp's
strings almost ".*"; no escapes escaped quotes
what else?

lex(1) and flex(1)

These programs generally take a lexical specification given in a .l file and create a corresponding C language lexical analyzer in a file named lex.yy.c. The lexical analyzer is then linked with the rest of your compiler.

The C code generated by lex has the following public interface. Note the use of global variables instead of parameters, and the use of the prefix yy to distinguish scanner names from your program names. This prefix is also used in the YACC parser generator.

FILE *yyin;	/* set this variable prior to calling yylex() */
int yylex();	/* call this function once for each token */
char yytext[];	/* yylex() writes the token's lexeme to an array */
                /* note: with flex, I believe extern declarations must read
                   extern char *yytext;
int yywrap();   /* called by lex when it hits end-of-file; see below */

The .l file format consists of a mixture of lex syntax and C code fragments. The percent sign (%) is used to signify lex elements. The whole file is divided into three sections separated by %%:

   helper functions

lecture #20 began here

Lex/Flex Powerpoint

What is a "lexical attribute" ?

A lexical attribute is a piece of information about a token. These typically include:
category an integer code used to check syntax
lexeme actual string contents of the token
line, column, file where the lexeme occurs in source code
value for literals, the binary data they represent
Reason for lexical attributes: In order to pass lexical attributes to the rest of the program, they are stored in an object instance (in C++) or a struct. The fields might look like:
struct token {
   int category;
   char *text;
   int   linenumber;
   char *filename;
   union literal value;
The "union literal" at the bottom holds computed binary values of integers, real numbers, and strings. A union holds any one field, and which field (if any) is used for a token would be determined by its category.
union literal {
   int ival;
   double rval;
   char *sval;

Flex Header Section

The header consists of C code fragments enclosed in %{ and %} as well as macro definitions consisting of a name and a regular expression denoted by that name. lex macros are invoked explicitly by enclosing the macro name in curly braces. Following are some example lex macros.
letter		[a-zA-Z]
digit		[0-9]
ident		{letter}({letter}|{digit})*

The body consists of of a sequence of regular expressions for different token categories and other lexical entities. Each regular expression can have a C code fragment enclosed in curly braces that executes when that regular expression is matched. For most of the regular expressions this code fragment (also called a semantic action consists of returning an integer that identifies the token category to the rest of the compiler, particularly for use by the parser to check syntax. Some typical regular expressions and semantic actions might include:

" "		{ /* no-op, discard whitespace */ }
{ident}		{ return IDENTIFIER; }
"*"		{ return ASTERISK; }
"."		{ return PERIOD; }
You also need regular expressions for lexical errors such as unterminated character constants, or illegal characters.

The helper functions in a lex file typically compute lexical attributes, such as the actual integer or string values denoted by literals. One helper function you have to write is yywrap(), which is called when lex hits end of file. If you just want lex to quit, have yywrap() return 1. If your yywrap() switches yyin to a different file and you want lex to continue processing, have yywrap() return 0. The lex or flex library (-ll or -lfl) have default yywrap() function which return a 1, and flex has the directive %option noyywrap which allows you to skip writing this function.

A Short Comment on Lexing C Reals

C float and double constants have to have at least one digit, either before or after the required decimal. This is a pain:
([0-9]+"."[0-9]* | [0-9]*"."[0-9]+) ...
You may be happier with something like:
([0-9]*"."[0-9]*)    { return (strcmp(yytext,".")) ? REAL : PERIOD; }

([0-9]*"."[0-9]*)    { return (strlen(yytext)>1) ? REAL : PERIOD; }

You-all know and love C/C++'s ternary e1 ? e2 : e3 operator, don't ya? It's an if-then-else expression, very slick. Since flex allows more than one regular expression to match, and breaks ties by using the regular expression that appears first in the specification, perhaps the following is best:

"."                { return PERIOD; }
([0-9]*"."[0-9]*)  { return REAL; }
This is still not complete.
After you add in optional "e" scientific exponent notation, what should it look like?
If present, it is an E followed by an integer with an optional minus sign.
Remember that there are optional suffixes F and L.
E, F, and L are case insensitive (either upper or lower case) in real constants if present.

Lex extended regular expressions

Lex further extends the regular expressions with several helpful operators. Lex's regular expressions include:
normal characters mean themselves
backslash escapes remove the meaning from most operator characters. Inside character sets and quotes, backslash performs C-style escapes.
Double quotes mean to match the C string given as itself. This is particularly useful for multi-byte operators and may be more readable than using backslash multiple times.
This character set operator matches any one character among those in s.
A negated-set matches any one character not among those in s.
The dot operator matches any one character except newline: [^\n]
match r 0 or more times.
match r 1 or more times.
match r 0 or 1 time.
match r between m and n times.
concatenation. match r1 followed by r2
alternation. match r1 or r2
simple parentheses specify precedence but do not match anything
(?o:r), (?-o:r), (?o1-o2:r)
parentheses followed by a question mark trigger (or if preceded by a hyphen, suppress) various options when interpreting the regular expression
i case-insensitivity
s interpret dot (.) to mean any character including \n
x ignore whitespace and (C) comments
# a real Flex comment. Looks like (?# ... )
This is some of the most awful and embarrassing language design I have ever seen in a production tool. Enjoy.
lookahead. match r1 when r2 follows, without consuming r2
match r only when it occurs at the beginning of a line
match r only when it occurs at the end of a line

lecture #21 began here

Flex Manpage Examplefest

To read a UNIX "man page", or manual page, you type "man command" where command is the UNIX program or library function you need information on. Read the man page for man to learn more advanced uses ("man man").

It turns out the flex man page is intended to be pretty complete, enough so that we can draw our examples from it. Perhaps what you should figure out from these examples is that flex is actually... flexible. The first several examples use flex as a filter from standard input to standard output.

lecture 22

Toy compiler example

What is similar here to your HW assignment? What must be different?
  /* scanner for a toy Pascal-like language */

  /* need this for the call to atof() below */
  #include <math.h>

  DIGIT    [0-9]
  ID       [a-z][a-z0-9]*


  {DIGIT}+    {
     printf("An integer: %s (%d)\n", yytext,
            atoi( yytext ) );

  {DIGIT}+"."{DIGIT}*        {
     printf( "A float: %s (%g)\n", yytext,
     atof( yytext ) );

  if|then|begin|end|procedure|function        {
     printf( "A keyword: %s\n", yytext );

  {ID}        printf( "An identifier: %s\n", yytext );

  "+"|"-"|"*"|"/"   printf( "An operator: %s\n", yytext );

  "{"[^}\n]*"}"     /* eat up one-line comments */

  [ \t\n]+          /* eat up whitespace */

  .           printf( "Unrecognized character: %s\n", yytext );


  int main(int argc, char **argv )
     ++argv, --argc;  /* skip over program name */
     if ( argc > 0 )
        yyin = fopen( argv[0], "r" );
        yyin = stdin;

     return 0;


Consider how yyin is used in the preceding toy compiler example, if you have not already done so. You may need to do something similar.

lecture 23

Warning: Flex is Idiosyncratic!

Flex is a declarative language. The declarative paradigm is the highest-level paradigm, so why is it so difficult to debug?

Examples of past student consultations:

Doctor J, my program is sick:
IDENT	[a-zA-Z_]+		/* this is an ident */
C comments are allowed some places in Lex/Flex, but I guess not all. This one causes a cryptic error message where the macro is used.
Doctor J, my program won't do the regular expression I wrote:
[ \t\n]+		{ /* skip whitespace*/ }
^[ ]*[a-zA-Z_]+		{ return IDENT; }
If the newline and whitespace are consumed by one big grab, the newline won't still be sitting around in the input buffer to match against ^ in this ident rule.

Point: a language can be declarative, but if it is cryptic and/or gives poor error diagnostics, much of the claimed benefits of declarative paradigm are lost.

Warning: Flex can be Arbitrary and Capricious!

Perhaps because of a desire for brevity, the lex family of tools makes one the same, fatal and idiotic mistakes as Python and FORTRAN: using whitespace as a significant part of the syntax! Consider when are %{ and %} needed in
No errors, but fails to declare num_lines and num_chars unless you add whitespace to the front or use %{ ... %}
Gives cryptic flex syntax errors unless you add whitespace to the front or use %{ ... %}
The proper way to include C code in a Flex header.

Matching C-style Comments

Will the following work for matching C comments? A student e-mail proposed:
[ \t]*"/*".*"*/"[ \t]*\n
What parts of this are good? Are there any flaws that you can identify?

The use of square-bracket character sets in Flex

A student once sent me an example regular expression for comments that read:
   COMMENT [/*][[^*/]*[*]*]]*[*/]
This is actually trying to be much smarter that the previous example. One problem here is that square brackets are not parentheses, they do not nest, they do not support concatenation or other regular expression operators. They mean exactly: "match any one of these characters" or for ^: "match any one character that is not one of these characters". Note also that you can't use ^ as a "not" operator outside of square brackets: you can't write the expression for "stuff that isn't */" by saying (^ "*/")

Does your assignment this semester need to detect anything similar to C style comments? If so, you should find or invent a working regular expression that is better than the "easy, wrong" one. Many different solutions are available around the Internet and in books on lex and yacc, but let's see what we can do. On a midterm exam, I am likely to ask you not for this regular expression, but for a regular expression that matches some pattern of comparable complexity.

Danger Will Robinson:

/\* ... \*/
legal in classic regular expressions, not so in Flex which uses / as a lookahead operator! Feel free to try
\/\* ... \*\/

But I prefer double-quoting over all those slashes. A famous non-solution:

and another, pathologically bad attempt:

Flex End-of-file semantics

yylex() returns integers. From the Flex manual, it returns 0 at end of file. HW#1 NOTE: originally the HW#1 spec said to return -1 on end of file. To do that, you would write a regular expression like
<<EOF>>		{ return -1; }
This would be compatible with C language tradition of using -1 to indicate EOF in functions such as fgetc(). However, I changed the main.c spec to say it would continue to ask for words/tokens as long as it is getting positive values returned, and it will not matter whether your yylex() function returns 0 or -1 to indicate end of file. Still, you should know about this EOF thing in case I make you do multiple files (and use yywrap()) later on.

Flex "States" (Start Conditions)

Section 10 of the Flex Manual discusses start conditions, which allow you to specify a set of states and apply different regular expressions in those different states. State names are declared in the header section on lines beginning with %s or %x. %s states will also allow generic regular expressions while in that state. %x states will only fire regular expressions that are explicitly designated as being for that state.

There is effectively an implicit global variable that remembers what state you are in. That variable is set using a macro named BEGIN(); in the C code body in response to seeing some regular expression that you want to indicate the start of a state.

ALL your regular expressions in the main section may optionally specify via <sc> what start condition(s) they belong to.

Chomsky Hierarchy

lecture 24

HW#4 makefile and main.c fixes

Extended Flex Demo

Let's pretend we are doing HW#4 for a bit. In particular, let's try doing as much as is needed for this program: wh.icn.
procedure main()
   i := 1
   while i <= 3 do

Lexical Structure of Languages

A vast majority of languages can be studied lexically and found to have the following kinds of token categories:

In addition, almost all languages will have separators/whitespace that occur between tokens, and comments.

As you may have seen from homeworks 1-2, regular expressions can't always handle real world lexical specifications. FORTRAN, for example, has lexical challenges such as having no reserved words. Consider the line

DO 99 I = 1.10
FORTRAN doesn't use spaces as separators. The keyword DO isn't a keyword, unless you change the period to a comma, in which case we can't be doing an assignment to a variable named "DO99I" any more...

How many of you used "states" (a.k.a. "start conditions")? What online resources for flex have you found? Googling "lex manual" or "flex manual" gives great results.

lecture 25

Reference solution to HW#3

There were lots of ways to do a solution.

Extra Credit Unicon

Since Lisp had two assignments and Unicon only had one, some folks have asked for extra Unicon work, either for extra credit, or for your own reasons. I am willing to entertain proposals, and it is true that I am looking for Unicon talent. Here are some stray ideas: Such an exercise should not be undertaken at the expense of any current or future 210 homework, but may be awarded extra credit proportional to its size and features.

Syntax Analysis

Lexical analysis was about what words occur in a given language. Syntax analysis is about how words combine. In natural language this would be about "phrases" and "sentences"; in a programming language it is how to express meaningful computations. If you could make up any three improvements to C++ syntax, what would they be? Some syntax is a lot more powerful or more readable for humans than others, so syntax design actually matters. And some syntax is a lot harder for the machine to parse. The next language (Bison/YACC) is all about syntax analysis. But first, some broader thoughts.

Some Comments on Language Design

Language Design Criteria

"(programming) language design is compiler construction" - Wirth

Syntax design considerations

Context Free Grammars

A context free grammar G has: A context free grammar can be used to generate strings in the corresponding language as follows:
let X = the start symbol s
while there is some nonterminal Y in X do
   apply any one production rule using Y, e.g. Y -> ω
When X consists only of terminal symbols, it is a string of the language denoted by the grammar. Each iteration of the loop is a derivation step. If an iteration has several nonterminals to choose from at some point, the rules of derivation would allow any of these to be applied. In practice, parsing algorithms tend to always choose the leftmost nonterminal, or the rightmost nonterminal, resulting in strings that are leftmost derivations or rightmost derivations.

lecture 26

HW#4 Q&A

Your text says 0 is an octal integer constant, but the example implies it is a decimal constant. What gives?
Since 0 octal is the same number as 0 integer, it doesn't matter, but for the sake of consistency, I have changed the example to indicate 0 is an octal (code 210) not a decimal (code 208).
What should we return for all those other syntax values besides ANY? Is IDENTIFIER fine?

Context Free Grammar Examples

OK, so how much of the C language grammar can we come up with in class today? Start with expressions, work on up to statements, and work there up to entire functions, and programs.

YACC (and Bison)

YACC ("yet another compiler compiler") is a popular tool which originated at AT&T Bell Labs.
The folks that gave us C, UNIX, and the transistor.
YACC takes a context free grammar as input, and generates a parser as output.
Writes out C code. Handles a subset of all possible CFG's
YACC's success spawned a whole family of tools
Many independent implementations (AT&T yacc, Berkeley yacc, GNU Bison) for C and most other popular languages.

YACC files end in .y and take the form

The declarations section defines the terminal symbols (tokens) and nonterminal symbols. The most useful declarations are:
%token a
declares terminal symbol a; YACC can generate a set of #define's that map these symbols onto integers, in a y.tab.h file. Note: don't #include your y.tab.h file from your grammar .y file, YACC generates the same definitions and declarations directly in the .c file, and including the .tab.h file will cause duplication errors.
%start A
specifies the start symbol for the grammar (defaults to nonterminal on left side of the first production rule).

The grammar gives the production rules, interspersed with program code fragments called semantic actions that let the programmer do what's desired when the grammar productions are reduced. They follow the syntax

A : body ;
Where body is a sequence of 0 or more terminals, nonterminals, or semantic actions (code, in curly braces) separated by spaces. As a notational convenience, multiple production rules may be grouped together using the vertical bar (|).

rttgram.y example

A Little Peek Behind Lex and Yacc Magic

Why? Because you should never trust a declarative language unless you trust its underlying math.

lecture 27

Reading Assignment

Read Bison Manual chapter 1-4, 6, and skim chapter 5.


In normal English, ambiguity refers to a situation where the meaning is unclear, but in context free grammars, ambiguity refers to an unfortunate property of some grammars that there is more than one way to derive some input, starting from the start symbol. Often it is necessary or desirable to modify the grammar rules to eliminate the ambiguity.

The simplest possible ambiguous CFG:

S -> x
S -> x
Maybe you wouldn't write that, but it is pretty easy to do it accidentally:
S -> A | B
A -> w | x
B -> x | y
In this grammar, if the input is "x", the grammar says it is legal. But what is it, an A or a B?

Conflicts in Shift-Reduce Parsing

"Conflicts" occur when an ambiguity in the grammar creates a situation where the parser does not know which step to perform at a given point during parsing. There are two kinds of conflicts that occur.
a shift reduce conflict occurs when the grammar indicates that different successful parses might occur with either a shift or a reduce at a given point during parsing. The vast majority of situations where this conflict occurs can be correctly resolved by shifting.
a reduce reduce conflict occurs when the parser has two or more handles at the same time on the top of the stack. Whatever choice the parser makes is just as likely to be wrong as not. In this case it is usually best to rewrite the grammar to eliminate the conflict, possibly by factoring.
Example shift reduce conflict:
S->if E then S
S->if E then S else S

Consider the sample input

if E then if E then S1 else S2
In many languages, nested "if" statements produce a situation where an "else" clause could legally belong to either "if". The usual rule attaches the else to the nearest (i.e. inner) if statement. This corresponds to choosing to shift the "else" on as part of the current (inner) if-statement being parsed, instead of finishing up that "if" with a reduce, and using the else for the earlier if which was unfinished and saved previously on the stack.

Example reduce reduce conflict:

(1)	S -> id LP plist RP
(2)	S -> E GETS E
(3)	plist -> plist, p
(4)	plist -> p
(5)	p -> id
(6)	E -> id LP elist RP
(7)	E -> id
(8)	elist -> elist, E
(9)	elist -> E
By the point the stack holds ...id LP id
the parser will not know which rule to use to reduce the id: (5) or (7).

YACC error handling and recovery

Improving YACC's Error Reporting

yyerror(s) overrides the default error message, which usually just says either "syntax error" or "parse error", or "stack overflow".

You can easily add information in your own yyerror() function, for example GCC emits messages that look like:

goof.c:1: parse error before '}' token
using a yyerror function that looks like
void yyerror(char *s)
   fprintf(stderr, "%s:%d: %s before '%s' token\n",
	   yyfilename, yylineno, s, yytext);

lecture 28

Yacc/Bison syntax error reporting, cont'd

Instead of just saying "syntax error", you can use the error recovery mechanism to produce better messages. For example:
lbrace : LBRACE | { error_code=MISSING_LBRACE; } error ;
Where LBRACE is an expected token '{'.
This assigns a global variable error_code to pass parse information to yyerror().

Another related option is to call yyerror() explicitly with a better message string, and tell the parser to recover explicitly:

package_declaration: PACKAGE_TK error
	{ yyerror("Missing name"); yyerrok; } ;

Using error recovery to perform better error reporting runs against conventional wisdom that you should use error tokens very sparingly. What information from the parser determined we had an error in the first place? Can we use that information to produce a better error message?

Getting Flex and Bison to Talk

The main way that Flex and Bison communicate is by the parser calling yylex() once for each terminal symbol in the input sequence. The terminal symbol is indicated by the integer values returned by function yylex().

An extended example of this functioning can be built by expanding the earlier Toy compiler example Flex file for a subset of Pascal so that it talks to a similar toy Bison grammar.

lecture 29

This was a nice lecture on Flex and Bison with a hands-on end-to-end example consisting of a lexer and parser for a subset of English language dates. The main difference between this and your homework, structurally, was the placement of main() in dates.y instead of a separate .c file. The example is incomplete; what refinements are needed?

lecture 30

Getting Lex and Yacc to Talk ... More

In addition, YACC uses a global variable named yylval, of type YYSTYPE, to collect lexical information from the scanner. Whatever is in this variable each time yylex() returns to the parser is copied over onto the top of a parser data structure called the "value stack" when the token is shifted onto the parse stack.

The YACC Value Stack

yacc/bison: The Calc Demo

The first of these files includes a full handwritten yylex() in C, which the second file would replace via flex. A "token" must be returned for a newline character if one wishes the calculator to calculate at that point.

Here is another "calc" example from [Louden], that

E : E '+' T | E '-' T | T ;
T : T '*' G | T '/' G | G ;
G : F '^' G | F ;
F : N | '(' E ')' ;
N : D N | D ;
D : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;

Question: how would you extend this grammar to do the exponentiation (using '^' for exponentiation)? How would you encode its right-associativity? Question: how would you modify this grammar to compute the values using the value stack? Especially, how would you compute $$ values for nonterminals D and N?

Note: C version of yylex() around 32 lines; Flex version 17 lines.

Using the Value Stack for More Than Just Integers

You can either declare that struct token may appear in the %union, and put a mixture of struct node and struct token on the value stack, or you can allocate a "leaf" tree node, and point it at your struct token. Or you can use a tree type that allows tokens to include their lexical information directly in the tree nodes. If you have more than one %union type possible, be prepared to see type conflicts and to declare the types of all your nonterminals.

Getting all this straight takes some time; you can plan on it. Your best bet is to draw pictures of how you want the trees to look, and then make the code match the pictures. No pictures == "Dr. J will ask to see your pictures and not be able to help if you can't describe your trees."

Declaring value stack types for terminal and nonterminal symbols

Unless you are going to use the default (integer) value stack, you will have to declare the types of the elements on the value stack. Actually, you do this by declaring which union member is to be used for each terminal and nonterminal in the grammar.

Example: in a .y file we could add a %union declaration to the header section with a union member named treenode:

%union {
  nodeptr treenode;
This will produce a compile error if you haven't declared a nodeptr type using a typedef, but that is another story. To declare that a nonterminal uses this union member, write something like:
%type < treenode > function_definition
Terminal symbols use %token to perform the corresponding declaration. If you had a second %union member (say struct token *tokenptr) you might write:
%token < tokenptr > SEMICOL

Comments from (Old) Student Office-Hour Visits

Debugging a Bison Program

The power of lex and yacc (flex and bison) is that they are declarative: you don't have to supply the algorithm by which they work, you can treat it as if it is magic. Good luck debugging magic. Good luck using gdb to try and step through the generated parser. If "bison --verbose" generates enough information for you to debug your problem, great. If not, your best hope is to go into the .tab.c file that Bison generates, and turn on YYDEBUG and then assign yydebug=1. If you do, you will get a runtime trace of the shifts and the reduces. Between that and a trace of every token returned by yylex(), you can figure out what is going on, or get help with it.

An Inconvenient Truth about YACC and Bison

Did we mention that the parsing algorithm used by YACC and Bison (LALR) can only handle a subset of all legal context free grammars?

Hand-simulating an LR parser

Suppose we simulate the "calc" parser on an example input. It uses the following algorithm. The details are sort of beyond the scope of this class; what you are supposed to get out of this is some intuition.
ip = first symbol of input
repeat {
   s = state on top of parse stack
   a = *ip
   case action[s,a] of {
      SHIFT s': { push(a); push(s') }
      REDUCE A -> β: {
         pop 2*|β| symbols; s' = new state on top
         push A
         push goto(s', A)
      ACCEPT: return 0 /* success */
      ERROR: { error("syntax error", s, a); halt }

LR Parsing Cliffhanger.

OK, here comes a sample input data! The grammar is:
E : E '+' T | E '-' T | T ;
T : T '*' G | T '/' G | G ;
G : F '^' G | F ;
F : NUM | '(' E ')' ;
What we are really missing in order to actually simulate a shift-reduce parse of this are the parse tables and how they are calculated -- this is covered thoroughly in a number of compiler writing textbooks. By the way LR parsing (the magic that YACC does) is not the only or most human-friendly of parsing methods.

discussion of parsing "(213*11^5)-8"

One thing left implicit in the previous lecture was that the lexical analysis and the parsing are usually interleaved -- it is not like the whole array of tokens has been constructed before parsing. Rather, yyparse() calls yylex() once every time it shifts, and lexical analysis is gradually performed. This might mix CPU operations and I/O operations in an attractive balance, but in practice, the I/O has to be heavily buffered to get good performance at it. You can at least figure that you are starting with an array of characters

Now, let's see that parse again. The array of char looks like.
The parse stack is empty, yyparse() calls yylex() to read the first token

Parse stack current token remaining input
empty '('

Shift or reduce ? -- shift. Note that you could reduce, even in this empty stack case, if the grammar had a production rule where there was some optional thing at the start.

Parse stack current token remaining input
'(' NUM213

Shift or reduce ? -- shift. Can't reduce '('.

Parse stack current token remaining input

Shift or reduce ?? Before we can shift a '*' onto the stack, we have to have an T. We don't have one, we have to reduce. What can we reduce? We can reduce NUM to an F.

Parse stack current token remaining input

Shift or reduce ?? We still have to have a T and don't, so reduce again.

Parse stack current token remaining input

Shift or reduce ?? Shift the '*'

Parse stack current token remaining input

Shift or reduce ??

Extended Discussion of Parse Trees and Tree Traversals

Scope Rules

    1. Local overrides global 2. If you have classes, and member functions, where do they fit? 3. If you don't have to declare variables, are they local, or global, or class? 4. By the way, there exists dynamic scope versus static scope.
global x

class C ( x, y)

method g()

method f()
   (let x




Semantics, as you may recall, means is the study of what something means.


It is tempting to use the heavily-overloaded term attributes when talking about semantic properties that a compiler or interpreter would know about a name in order to apply its meaning in terms of code. When we talk about lexical analysis we have lexical attributes, when we talk about syntax we have syntactic attributes (which can build on or make use of lexical attributes), and when we talk about semantics, we have semantic attributes (which can build on or make use of lexical and syntactic attributes). Cheesey example:
double f(int n)
In order for any code elsewhere in the program to use f correctly, it had better know what attributes? So for example, if the input included somewhere later in the program
    x = f('\007');
The compiler can check whether this call to f() makes sense. It can check that the # of parameters is correct, generate code that promotes the character parameter to an integer, check that the variable x is compatible with return type double, and generate code for any conversion that is required in assigning a double to x.

Environment and State

Environment maps source code names onto storage addresses (at compile time), while state maps storage addresses into values (at runtime). Environment relies on binding rules and is used in code generation; state operations are loads/stores into memory, as well as allocations and deallocations. Environment is concerned with scope rules, state is concerned with things like the lifetimes of variables.

Scopes and Bindings

Variables may be declared explicitly or implicitly in some languages

Scope rules for each language determine how to go from names to declarations.

Each use of a variable name must be associated with a declaration. This is generally done via a symbol table. In most compiled languages it happens at compile time, but interpreters will build and maintain a symbol table while the program runs.

A few comments about Nested Blocks

Different languages vary as to how they do nesting of blocks and variable declarations. Semantics has to map names to addresses, and it can be confusing especially when the name name is "live" with different memory locations at the same time ... in different scopes.

Symbol Tables

Symbol tables are used to resolve names within name spaces. Symbol tables are generally organized hierarchically according to the scope rules of the language. Although initially concerned with simply storing the names of various that are visible in each scope, symbol tables take on additional roles in the remaining phases of the compiler. In semantic analysis, they store type information. And for code generation, they store memory addresses and sizes of variables.

Runtime Memory Regions

Operating systems vary in terms of how the organize program memory for runtime execution, but a typical scheme looks like this:

static data
stack (grows down)
heap (may grow up, from bottom of address space)

The code section is usually read-only, and shared among multiple instances of a program. Dynamic loading may introduce multiple code regions, which may not be contiguous, and some of them may be shared by different programs. The static data area may consist of two sections, one for "initialized data", and one section for uninitialized (i.e. all zero's at the beginning). Some OS'es place the heap at the very end of the address space, with a big hole so either the stack or the heap may grow arbitrarily large. Other OS'es fix the stack size and place the heap above the stack and grow it down.

Much CPU architecture has included sophisticated support for making the stack as fast as possible, and more generally, for making repeated and sequential memory accesses as fast as possible. This sort of ideally fits C and Pascal (i.e. traditional "structured" imperative programming) and performs pathologically poorly on Lisp (functional) and OOP languages that exhibit poor locality of reference, exaggerating the already extreme speed differences between medium-level languages and very high level languages. Hardware that eschews caches in favor of "more cores" are not as biased.

Allocation and Variable Lifetimes

Since around 80% of the time spent debugging programs written in systems programming languages is spend debuging memory management problems, and since around 67% of total software development costs are spent in debugging and software maintenance, it can be argued that understanding memory allocation and variable lifetimes is the single most important thing for you to master as you move past the "novice" level of programming skill.

Activation Records

Activation records organize the stack, one record per method/function call.
return value
previous frame pointer (FP)
saved registers
FP-->saved PC
At any given instant, the live activation records form a chain and follow a stack discipline. Over the lifetime of the program, this information (if saved) would form a gigantic tree. If you remember prior execution up to a current point, you have a big tree in which its rightmost edge are live activation records, and the non-rightmost tree nodes are an execution history of prior calls.

Garbage Collection

Automatic storage management plays a prominent role in most modern languages; it is one of the single most important features that makes programming easier.

The Basic problem in garbage collection: given a piece of memory, are there any pointers to it? (And if so, where exactly are all of them please). Approaches:

Supplemental Comments on Imperative Programming

Imperative programming is programming a computer by means of explicit instructions. Assembler language uses imperative programming, as do C, C++, and most other popular languages.

One way to think of imperative programming is that it is any programming in which the programmer determines the control flow of execution. This might be using goto's or loops and conditionals or function calls. It contrasts with declarative programming, where the programmer specifies what the program ought to do, but does not determine the control flow.

Def: a program is structured if the flow of control through the program is evident from the syntactic structure of the program text. "evident" means single-entry/single-exit.

Common constructs in imperative programming include:

Assertions, invariants, preconditions, and postconditions

The problem with imperative programming is: you know you told the computer to do something, but how do you know that you told it to do what you want? In particular, people write code that behaves differently than they intend all the time. We reason about program correctness by inserting logical assertions into our code; these may be annotations or actual checks at runtime to verify that expected conditions are true. Curly brackets {expr} are often used to enclose assertions, especially among former Pascal programmers; another common convention is assert(expr), which is a macro available in many C compilers.

A precondition is an assertion before a statement executes, that defines the expected state. It defines requirements that must be true in order for the statement to do what it intends. A postcondition is an assertion after a statement executes that describes what the statement has caused to become true. An invariant is an assertion of things that do not change during the execution of a statement. An invariant is particularly useful with loop statements.

while x >= y do
   { x >= y if we get here }
   x := x - y
suppose {x >= 0 and y > 0} is true. Then we can further say { x >= y > 0} inside the loop. After the assignment, a different assertion holds:
{ x >= 0 and y > 0}
while  x >= y do
   { y >= 0 and x >= y }
   x := x - y
   { x >= 0 and y > 0 }
While these kinds of assertions can allow you to prove certain things about program behavior, they only allow you to prove that program behavior corresponds to requirements if requirements are defined in terms of formal logic. There is a certain difficulty in scaling up this approach to handle real-world software systems and requirements, but there is certainly a great need for every technique that helps programmers write correct programs.

lecture #31

Q&A on HW#5

I get no results for these op-codes: rcv, rcvbk, rswap, snd, sndbk, static, quit, fquit, tally, apply, acset, areal, astr, aglobal, astatic, agoto, amark, noop, operror, copyd, trapret, and trapfail.
You were advised that if you can't find any operand requirements for a given opcode, you can assume it takes no operands. I can go further and say that some of these are guaranteed never to appear in a .u file, because they are generated by the VM at runtime: acset, areal, astr, aglobal, astatic, agoto, amark. Some of the others might never appear in a .u file, but might be generated by the linker.
Do I really have to create 16 more integer codes for the "synt" operand values besides ANY?
Correction function dictates that "synt" be followed by an operand, and that that operand must be one of the 17 syntax types. Whether you do that with separate integer codes, a single common integer code that generalizes from ANY, or a common code such as IDENTIFIER which you subsequently check, is up to you. Different options move the work into the lexical analyzer, or the parser, or semantic analysis during or after the parse.
For the op-code trace when I used grep I could only find it being used a couple times and it was always after the op-code keywd. But keywd seems to be followed by random identifiers.
keywd is not followed by random identifiers, it is followed by keyword names. It should be handled like synt, with a fixed set of possible operands, enforced either lexically or in syntax or in a semantic check.


One popular representative modern object-oriented language is Java.

Reading Assignment

Some Java Slides

Compiling and Running Java Locally on Wormulon

Add the following to your ~/.profile, and/or your ~/.bashrc file. They specify the sizes of Java's heap memory region. By default Java asks for a size that fails on wormulon!
alias java="java -Xmx512m -Xms512m"
alias javac="javac -J-Xmx512m"
These aliases should be placed in your ~/.profile or possibly ~/.bashrc file. You may have to "source" the file that you place them in order for the current shell session to see those aliases, but in subsequent logins they should just be there for you automatically since shells autoload such commands.

Once you have your aliases setup, compile with "javac hello.java" and run with "java hello"

Example #0

We looked at a hello.java that was specially tailored to remind you of features you would need for homework 1: random numbers from java.util and the command line arguments passed into main(). lecture #32

Things to Learn About Java Today

Java is an Almost-SmallTalk?

A few languages (mainly SmallTalk) have chosen to be "pure OO", meaning that everything down to basic integers and characters are objects. Most languages don't go that far -- Java for example has built-in types like "int" and constructs like arrays, but then very quickly you are forced to use system classes, and encouraged to organize your own code with classes.

So, it isn't about whether you will use classes a lot in Java, like it would be in C++. It is: how are you going to map your application domain onto a set of (built-in system, or new written-by-you) classes? For many problems, this is a natural fit, but for other problems it is silly and awkward.

When to OOP?

When you use a language where OOP is optional, go OOP under two (2) circumstances:
  1. your application domain maps naturally onto a set of classes, or
  2. your problem is so large that you will have trouble wrapping your brain around the whole thing.
In other words: OOP becomes more and more useful as your program size grows.

An Example of Bad OOP in Java

A Lisp HW in Java
Sure you can use Java to write recursive Lisp functions. But if your class is a set of unrelated functions that do not share state, it is pretty bad OOP.

Java Concepts (and APIs) to Learn Today

IO: the next steps

lecture 33

Exception Basics

Object Orientation: Language-Centric Viewpoint

Y'all have programmed in an object-oriented language such as C++ for awhile now; what does it mean to think object-orientedly?

As a young computer scientist, I read and believed that object-orientation consisted of:

encapsulation + polymorphism + inheritance
Each of these terms is important to this course.
closely related to information hiding, this is the idea that access to a set of related data can be protected and controlled, so as to avoid bugs and ensure consistency between different bits of data. This concept has been mathematically expressed in the notion of an Abstract Data Type (ADT), which is a set of values and a set of rules (operations) for manipulating those values. In programming languages, it is provided by a class or module construct.
Literally meaning "many shapes" or more loosely "shape changing", this idea is that if you write an algorithm in terms of a set of abstract operations, that algorithm can work on different data types. It occurs in some languages as templates (C++), generics (Ada), interfaces (Java), by passing functions as parameters (C), or simply going with a flexible, dynamic type system (Lisp).
By analogy to biological inheritance of traits or genes, inheritance is when you define a class in terms of an existing class.


Write functions (a la functional programming) around collections of related data. By convention or language construct, hide/protect that (private) data behind a set of public interface functions.

This is the single most important principle of OOP. It is more than just saying "class" a few times in each program. It is usually well-supported in any OO language. The potential abuse comes from the encumbrance of too much required syntax which distracts programmers from the actual problems they need to solve.

Algorithms written to use an encapsulated object and access it only via its interface functions will not mind if you totally rewrite its innards to fix it, make it faster, etc.


Algorithms written to use an encapsulated object and access it only via its interface functions will not mind if you totally substitute other types of objects, including unrelated objects that implement the same interface.

Dynamic OOP languages usually support this well. Static OOP languages usually support polymorphism somewhat awkwardly, as is the case of C++ templates.


The major difference between OO languages and other languages with strong information hiding encapsulation is inheritance. Inheritance can mean: starting with generic code, and augmenting it gradually with special cases and extra details. There is abstract vs. concrete inheritance, and parent-centric vs. child-centric inheritance. There is multiple inheritance.

The above concepts are important and useful. They are what object-oriented programming languages typically try to directly support. However, they do not tell the whole story, and programmers who stop there often write bad OO code.

Object-oriented Thinking: Design-centric Viewpoint

The best way to think object-orientedly is to think of the computer program as modeling some application domain. The model of the application domain is the heart of the software design for any program that you write, so the best way to think object-orientedly is from a software engineering perspective, constructing the pieces that the customer needs in order for this program to solve their problems.

CS 210 Java Example: Hamurabi

A previous semester's CS 210 homework assignment was to use Java to write the classic resource simulation program called Hammurabi, with local extensions described below.

Hammurabi in a Nutshell

Hammurabi, the Babylonian king, is a tyrant who wants to grow his population to the largest possible size in order to be the most powerful ruler on earth. In ancient mesopotamia there is a lot of fertile land due to the annual flooding, but there are no defendable borders and the only safety lies in numbers (of spears). To make more people, you have to grow more food, which means you have to plant more land, which takes more seed grain. And by the way, the harvest yield varies from year to year, ranging from 0 to enormous. But the more grain you store, the higher percentage of stored grain is lost each year (rats, corruption, whatever).

Students were asked to modify an existing Java program to fill in the missing Java code to report on current population and grain and land holdings, and then ask Hamurabi each year:

Hamurabi: the Java Code

Sample code at http://www.roseindia.net/java/java-tips/oop/q-hammurabi/q-pr-hammurabi-1.shtml was given as a starting point; its open source source files were locally copied at

lecture 34

End of Semester Planning

What to Learn About Java from the Hamurabi Code

There is some substantially interesting code there. What Java can we learn from it?
Code by delta (Δ refers to change)
Whether you call it extension, modification, generalization, or filling in the blanks, lots of Java programs are written by modifying existing classes. Sometimes that means writing subclasses. How much inheritance have you done so far in your programming?
Object creation and method invocation
Have you gotten the basic OO syntax of Java yet? Is it any different from C++ so far? if so, how so?
Wrapper Classes
Java deals with its impurity by providing wrappers for non-class builtin types. Java programmers should know the basics of Integer, Double, Float, Short, Long, Character, Boolean, Void, and Byte. Start with the parse*() methods, e.g. Integer.parseInt(s)
Did we say "No preprocessor"?
Constant names get awkward:
private final static int POUND_DEFINE_WAS_SO_COOL = 1;
Getters and setters = lame-o-OO
But I guess setters are the ones that really bug me. And I can live with them so long as they are controlled.
Know how to (use) "swing"?
javax.swing is a graphical user interface library. Most Java applications might be written using this class library, unless they are applets, or are written in JOGL or something like that.
Graphical interface
In order to run swing programs, you almost have to either install and run Java on a local computer, or run on Linux machines in the lab. It is possible run swing and other graphic programs on wormulon, but only if you install an "X Window server" program on your local machine, and have an SSH connection that does "X11 port forwarding". And that can be slow, especially if you are not on campus. Avoid using wormulon this way unless you have good reason.
Who/what is JOptionPane?
Minimally you should know its showInputDialog() and showMessageDialog() methods.

Java Tips from the Past

Don't use an object instance to invoke a static method.
It would be more object-oriented to not use static methods at all, but if you must use a static method, it is CLASS.mystaticmethod(), not instance.mystaticmethod()
Do use templated collection typenames in constructors (after "new")
ArrayList<String> names = new ArrayList<String>();

Using a Class to Make "Swing" Optional

When I compiled and tried to run the hamurabi from roseindia.net on wormulon, I originally got:
> java hamurabi
Exception in thread "main" java.awt.HeadlessException: 
No X11 DISPLAY variable was set, but this program performed an operation which requires it.
... long java runtime exception stack trace ...
If no X11 were available, what would a person do? Options include:
  1. Rewrite the game code to just use the console, skip the GUI dialogs.
  2. Run locally, instead of running on the machine where we turn code in.
  3. Modify the game to ask whether a GUI is available, and use the console when no GUI will work.
Option #3 has more options.
  1. Try and detect whether graphics are present, without using them, in order to avoid the exception in the example.
  2. Just go ahead and try to use graphics, and if they fail, handle the exception and enable the fallback.
At first I checked if the DISPLAY environment variable was set; if it isn't, then we should use the console:
if (System.getenv("DISPLAY") == null) // ... use console
but that is not exactly portable -- on MS Windows no DISPLAY is needed. So a better solution is to use an exception handler to catch that fatal error we saw earlier, and revert to console IO:
	use_swing = true;
	try {
					  "Minister says we are swinging");
	} catch (Exception e) {
	    System.out.println("Minister says we are using the console.");
	    use_swing = false;

Using Exceptions in OO Design

Last time we saw a try...catch statement that allows Java to gracefully recover from a runtime error and fall back to using the console when Swing is not available. Where to put this code?

At this point, our object-oriented version of Hammurabi looks like the following picture:

About Inheritance

OOP experts will tell you that there are different kinds of inheritance: abstract inheritance and concrete inheritance.
abstract inheritance
inheritance of a public interface, which is to say, a set of methods with matching/compatible signatures. Abstract inheritance is exactly that (sub)part of inheritance necessary for polymorphism to work. This is the kind of inheritance that says "if it looks like a duck, and walks like a duck, and quacks like a duck, it is duck"
A signature
Is a function's prototype information: name, number and type of parameters, and return type
concrete inheritance
concrete inheritance consists of inheriting actual code. This is the kind of inheritance that says "a mallard is a kind of duck with the following additional traits and behavior". While you might be thinking and writing code about mallards right now, the more code you manage to place in the duck class, or possibly a bird class above it, instead of the mallard class, the more "code sharing" you will see if you have many different kinds of ducks or other kinds of birds later on.


Java has an explicit construct for abstract inheritance: Interfaces. From the Java Tutorials we see:
interface Bicycle {
    void changeCadence(int newValue);    //  wheel revolutions/minute
    void changeGear(int newValue);
    void speedUp(int increment);
    void applyBrakes(int decrement);
This contains you no code. All it enables is that various classes can now be declared to implement the interface as follows:
class ACMEBicycle implements Bicycle {
    // remainder of this class 
    // implemented as before
This let's you write code that takes parameters of type Bicycle. Such code will be inherently polymorphic, working with any classes that implement the Bicycle interface.

Concrete Inheritance

Java has a limited, simple form of concrete inheritance. Suppose you have a nice generic bicycle class implemented:
public class Bicycle {
    public int cadence, gear, speed;
    public Bicycle(int startCadence, int startSpeed, int startGear) {
        gear = startGear; cadence = startCadence; speed = startSpeed; }
    public void setCadence(int newValue) {  cadence = newValue; }
    public void setGear(int newValue)    {  gear = newValue;    }
    public void applyBrake(int decrement) { speed -= decrement; }
    public void speedUp(int increment)    { speed += increment; }
For any number of customized, specialty bicycles, you might want to start by saying "they behave just like a regular bike, except ..." and then give some changes. In Java you declare such a subclass with the extends reserved word:
public class MountainBike extends Bicycle {
    public int seatHeight; // subclass adds one field
    // overrides constructor, calls superclass constructor
    public MountainBike(int startHeight, int startCadence,
                        int startSpeed,  int startGear) {
        super(startCadence, startSpeed, startGear);
        seatHeight = startHeight;
    public void setHeight(int newValue) {    // subclass adds one method
        seatHeight = newValue;

Extra Credit? turn it in on cscheckin as hwec*

hwec.zip or whatever

Two ways to check whether your Bicycle is a mountain bike

  1. MountainBike mb = (MountainBike)b;
  2. if (b instanceof MountainBike) ...
But note that usually if you were going to say:
if (b instanceof MountainBike) b.doMountainyStuff()
else if (b instanceof RacingBike) b.doRacingStuff()
you'd be more object-oriented, and more efficient, to be defining a method doStuff and having each class override it, so you can just say

Arrays Example

Have you seen this syntax enough to be familiar with it yet?
int[] anArray;
anArray = new int[10];
Note: an array's size is permanently decided at construction time! If you want a growable array, look to class Vector.

Also, be sure you can recognize (and write) code like:

int[] anArray = {100, 200, 300, 400, 500, 600, 700, 800, 900, 1000};
Arrays are not objects, but they have (at least) one field: anArray.length gives the array's size.

Strings versus arrays of char

Strings really are not arrays of char. Consider this example:
public class hello {
   public static void main(String[]args){
     String s = "Niagara. O roar again!"; 
     char c = s[9];
     System.out.println("10th char of "+s+" is "+c);
You have to say s.charAt(9) instead of s[9].

lecture 35

More on the Java String class

Be sure you know at least this much:
static method String.valueOf(x)
overloaded 9 times, produces string representation of x
static method String.format(formatstr, objs...)
returns a formatted string, a la printf
s.indexOf(c) and s1.indexOf(s2), lastIndexOf
similar to strchr, strstr
s1.compareTo(s2) and s1.compareToIgnoreCase(s2)
+ and s1.concat(s2)
s.matches(String regex)
Note: Java was arguably the first major language to be Unicode-based. How does this impact the string type?

Java Trails Commentary

Do the required online reading of the Trails Covering the Basics! Be sure you know about:
Know what /** */ comments are for, and be able to give examples.
This component technology seems to be famous or important. For what?
What are applets, and how do I write one?
What is NetBeans good for?
Java's byte vs. char types
What is the difference? What's with those '\uffff'-style char literals?


Who it is for: large scale software system builders.

What it does: write out a collection of webpages to help "navigate" your Java class libraries.

Big success, inspired numerous copycats!!

Writing Doc Comments [from Oracle documentation]

A doc comment is written in HTML and must precede a class, field, constructor or method declaration. It is made up of two parts -- a description followed by block tags. In this example, the block tags are @param, @return, and @see.
 * Returns an Image object that can then be painted on the screen. 
 * The url argument must specify an absolute {@link URL}. The name
 * argument is a specifier that is relative to the url argument. 

* This method always returns immediately, whether or not the * image exists. When this applet attempts to draw the image on * the screen, the data will be loaded. The graphics primitives * that draw the image will incrementally paint on the screen. * * @param url an absolute URL giving the base location of the image * @param name the location of the image, relative to the url argument * @return the image at the specified URL * @see Image */ public Image getImage(URL url, String name) { try { return getImage(new URL(url, name)); } catch (MalformedURLException e) { return null; } }

printf / Math

Note the %n, which may write out \n, \r, or \r\n depending on which platform you are on. The Math class methods are static; the System.out methods are not.
public class BasicMathDemo {
    public static void main(String[] args) {
        double a = -191.635, b = 43.74;
        int c = 16, d = 45;
        double degrees = 45.0, radians = Math.toRadians(degrees);

        System.out.printf("The absolute value of %.3f is %.3f%n", 
                          a, Math.abs(a));

        System.out.printf("The ceiling of %.2f is %.0f%n", 
                          b, Math.ceil(b));

        System.out.format("The cosine of %.1f degrees is %.4f%n",
                          degrees, Math.cos(radians));

To get at the Math static functions without having to say "Math." all the time, use "import static":
import static java.lang.Math.*;
public class BMD {
   public static void main(String[]args)
   System.out.printf("Hello, world %.3f%n", ceil(3.14159));
Note however from stackoverflow: If you overuse the static import feature, it can make your program unreadable and unmaintainable.

More on Exceptions

Three kinds:
probably recoverable. catch-or-specify required
you can catch it, but you probably can't recover. problem outside the app.
you can catch it, but you probably can't recover. problem inside the app, i.e. a bug that needs to be fixed.

Observation Regarding Exceptions

try {
    out = new PrintWriter(new FileWriter("OutFile.txt"));
    for (int i = 0; i < SIZE; i++) {
        out.println("Value at: " + i + " = " + list.get(i));
} catch (FileNotFoundException e) {
    System.err.println("FileNotFoundException: " + e.getMessage());
    throw new SampleException(e);

} catch (IOException e) {
    System.err.println("Caught IOException: " + e.getMessage());
By the way, if you don't handle an exception (no "catch"), you can still use a try { } block to document that you know an exception may occur there. Also, a finally clause will execute at the end of a try block whether an exception is handled or not.
static String readFirstLineFromFileWithFinallyBlock(String path)
throws IOException {
    BufferedReader br = new BufferedReader(new FileReader(path));
    try {
        return br.readLine();
    } finally {
        if (br != null) br.close();

JAR files

java archive file format bundles multiple files (usually .class files) into a single archive. They are really ZIP files, but the jar command-line program uses commands similar to the classic UNIX tar(1) command.

Unlike C/C++, Java does not have a "linker" that resolves symbols at "link time" to produce an executable. Symbols are resolved at "load time" which is generally the first time that a class is needed/used, often during program startup/initialization. This can mean that Java programs are slower to start than native code executables, but it does provide a certain flexibility.

Since Java does not have a linker, JAR files are the closest approximation that it has: a Jar archive can bundle a collection of .class files as one big file that can be run directly by the java VM (using the -jar option). To build a JAR that will run as a program, you specify the options "cfe", the name of which class' main() function to use at startup, and the set of class files:

jar cfe foo.jar foo foo.class bar.class baz.class
java -jar foo.jar
The options cfe stand for "create" a "file" with an "entrypoint".

Separate Compilation and Make

You might have seen the world-famous and ultra-fabulous "make" tool already. If you already know it, awesome. In any case, "make" is an example of the declarative programming paradigm.

Consider this example makefile:

hello.jar: hello.class
	jar cfe hello.jar hello hello.class

run: hello.jar
	java -jar hello.jar

hello.class: hello.java
	javac hello.java
What it defines are build rules for building a set of files, and a dependency graph of files that combine to form a whole program.



A thread is a computation, with a set of CPU registers and an execution stack on which to evaluate expressions, call methods, etc.

In Java, threads can be created for any Runnable class, which must implement a public void method named run().

public class HelloRunnable implements Runnable {
    public void run() {
        System.out.println("Hello from a thread!");
    public static void main(String args[]) throws InterruptedException {
    Thread t;
    HelloRunnable r = new HelloRunnable();
        (t = new Thread(r)).start();
         // can use r to "talk" to the child thread via class variables...

Easy Synchronization

Synchronization means: forcing concurrent threads to take turns, and wait for each other to finish. Imagine trying to talk at the same time as someone you are with.
    public synchronized void increment() {


Threads are in the same address space so they can can "talk" by just storing values in variables that each other can see. Examples would be static variables, and class fields in instances that both threads know about (how would both threads know about an instance???).

The main kicker is to avoid race conditions, where two threads get inconsistent information by writing to the same variable at the same time. How to avoid that? Synchronization.


The -cp command line argument (to java) or CLASSPATH environment variable specifies a list of directories and/or .jar files in which to search for user class files. In large/complex Java applications, it is often Very difficult to keep this straight.


Compared with more dynamic languages, Java has to spend a fair amount of work to provide full compile-time type safety and reasonable polymorphism. The organization of its "collections framework" reflects that challenge. They use template classes a lot to allow types like "collection of X" but are not great at handling "collection of mixed stuff" codes. You can declare an ArrayList containing Object elements...
There is a whole hierarchy of collection interfaces algorithms code for.
A set of reusable data structures
Searching, sorting, etc.
Per the Oracle docs:

Typical is to declare via:

 abstracttype<elem> var = new concretetype<elem>(...);
The actual Collection base interface mainly defines size(), isEmpty(), contains(o), iterator(), plus the ability to convert to/from other collections and/or arrays. They usually also have add(o) and remove() operation(s) of some kind.


Iterable classes have an iterator() method that returns an object Iterator() that sort of keeps track of where they are in the original object and let's you walk through its elements. Mainly Iterators provide a next() method to get the next element, and a hasNext() to say whether they are done or not.

I now have it on good authority that iterators can be used aggressively to implement full Unicon-style generators and goal-directed evaluation; they are just more long-winded and cumbersome to write.


Ordered collections know how to: sort, shuffle, reverse, rotate, swap, replaceAll, fill, copy, binarySearch... kind of obviously related to Lisp lists, but several implementations available with different performance strengths and weaknesses.


Hash tables are one of the most important types in any "high level" language.

Notice that in order to initialize this "word frequency counter", you first do a m.get(), and if it is null you start the count at 1. Otherwise, you increment the count.

import java.util.*;
public class Freq {
    public static void main(String[] args) {
        Map<String, Integer> m = new HashMap<String, Integer>();
        // Initialize frequency table from command line
        for (String a : args) {
            Integer freq = m.get(a);
            m.put(a, (freq == null) ? 1 : freq + 1);
        System.out.println(m.size() + " distinct words:");


"to look inside oneself" -- really in programming languages, it is the ability of an object to describe itself at runtime. C++ has the concept of "runtime type information" which is similar. In Java, any object can be asked its getClass() method, which returns a Class object that can cough up its fields, methods, etc. Consider the following example from http://www.cs.grinnell.edu/~rebelsky/Courses/CS223/2004F/Handouts/introspection.html
public static void summarize(Object o) throws Exception
    Class c = o.getClass();
    System.out.println("Class: " + c.getName());
    Method[] methods = c.getMethods();
    System.out.println("  Methods: ");
    for (int i = 0; i < methods.length; i++) {
      System.out.print("    " + methods[i].toString());
      if (methods[i].getDeclaringClass() != c)
        System.out.println(" (inherited from " +
          methods[i].getDeclaringClass().getName() + ")");
  } // summarize(String)


Just so you all have heard a bit about them, JavaBeans are reusable software components. They are just classes that follow a few conventions.


An Applet is a Java program that will run in a web browser.
import javax.swing.JApplet;
import javax.swing.SwingUtilities;
import javax.swing.JLabel;

public class HelloWorld extends JApplet {
    //Called when this applet is loaded into the browser.
    public void init() {
        //Execute a job on the event-dispatching thread; creating this applet's GUI.
        try {
            SwingUtilities.invokeAndWait(new Runnable() {
                public void run() {
                    JLabel lbl = new JLabel("Hello World");
        } catch (Exception e) {
            System.err.println("createGUI didn't complete successfully");
In addition to the init() method, many applets will have start() and stop() methods to do any additional computation (such as launching/killing threads) other than responding to GUI clicks.

To deply an applet, compile the code and package it as a JAR file. Then in your web page you write

<applet code=AppletClassName.class
        width=width height=height>

lecture 36

Final Exam Review

Review language paradigms
Know what imperative, functional, declarative, object-oriented, and goal-directed languages are about.