CS472/572
Evolutionary Computation

Project 2
Symbolic Regression

The symbolic regression problem requires finding a function, y = g(x), that fits a set of (x,y) data points (the training set) with minimal error. The set of points comes from a target function, y = f(x) (see figure below). Thus, we would like to find a g(x) that is equal to f(x). For this project you get to choose the target function f(x) and the training set of points. You should use enough training points to reasonably outline the target function (e.g. 3 training points is not enough to learn a sin function). However, the more training points you use, the slower the GP will run.

It is possible for a GP to find a function that perfectly matches the training points, but that is completely different from the target function. I.e. g(x) = f(x) only at the training points. This is referred to as overfitting and is not desirable. Instead we want a solution that is close to the target function, that is we want the GP to generalize from the training set to the target function.

When the GP is done running you should make sure the GP did generalize, that the evolved function is close to the target function. A simple way to do this is to plot the evolved function and the target function to see how closely the resemble each other. Alternatively, you can pick a different set of (x,y) pairs from the target function and see how well the evolved function matches those pairs. This is referred to as a test set. If the evolved function is close to the target function the error should be low on the test set.

Often the fitness function for symbolic regression problems is the sum of the absolute value of the error at each tst point. Alternatively, the fitness is the square root of the sum ofthe errors squared. The function set consists of mathematical operators such as +, -, *, and /. It also may help to add a conditional statement such as an if. The terminal set usually consists of the variable x and constants. You should chose a target function that is not too easy to find for the function set. For example, if the target function is sin(x), the function set should not include sin.

Remember the closure requirement for function sets: a function must be able to gracefully handle all possible inputs. For example, the division function must be able to handle being passed a zero divisor.

There are several standard variations that can be used with this problem. The points in the training set can be changed each generation. Note that the target function does not change, only which points from the target function are used. This can increase generalization, because it is much harder to overfit a constantly changing training set.

Noise can be added to the training points. That is each (x,y) pair is calculated and then a small random value is added to the y value. This randomness is the noise and it reflects real world problems where the information is not always perfect.

This symbolic regression problem corresponds to many real world problems in which we want to find a function to describe a limited set of actual data. The function may predict future data.