# Introduction to GP¶

$Owing \,to \,this \,struggle \,for \,life,$
$any \,variation, \,however \,slight \,and \,from \,whatever \,cause \,proceeding,$
$if \,it \,be \,in \,any \,degree \,profitable \,to \,an \,individual \,of \,any \,species,$
$in \,its \,infinitely \,complex \,relations \,to \,other \,organic \,beings \,and \,to \,external \,nature,$
$will \,tend \,to \,the \,preservation \,of \,that \,individual,$
$and \,will \,generally \,be \,inherited \,by \,its \,offspring.$
$- \,Charles \,Darwin, \,On \,the \,Origin \,of \,Species \,(1859)$

gplearn extends the scikit-learn machine learning library to perform Genetic Programming (GP) with symbolic regression.

Symbolic regression is a machine learning technique that aims to identify an underlying mathematical expression that best describes a relationship. It begins by building a population of naive random formulas to represent a relationship between known independent variables and their dependent variable targets in order to predict new data. Each successive generation of programs is then evolved from the one that came before it by selecting the fittest individuals from the population to undergo genetic operations.

Genetic programming is capable of taking a series of totally random programs, untrained and unaware of any given target function you might have had in mind, and making them breed, mutate and evolve their way towards the truth.

Think of genetic programming as a stochastic optimization process. Every time an initial population is conceived, and with every selection and evolution step in the process, random individuals from the current generation are selected to undergo random changes in order to enter the next. You can control this randomness by using the random_state parameter of the estimator.

So you’re skeptical. I hope so. Read on and discover the ways that the fittest programs in the population interact with one another to yield an even better generation.

## Representation¶

As mentioned already, GP seeks to find a mathematical formula to represent a relationship. Let’s use an arbitrary relationship as an example for the different ways that this could be written. Say we have two variables X0 and X1 that interact as follows to define a dependent variable y:

$y = X_0^{2} - 3 \times X_1 + 0.5$

This could be re-written as:

$y = X_0 \times X_0 - 3 \times X_1 + 0.5$

Or as a LISP symbolic expression (S-expression) representation which uses prefix-notation, and happens to be very common in GP, as:

$y = (+ (- (\times X_0 X_0) (\times 3 X_1)) 0.5)$

Or, since we’re working in python here, let’s express this as a numpy formula:

y = np.add(np.subtract(np.multiply(X0, X0), np.multiply(3., X1)), 0.5)


In each of these representations, we have a mix of variables, constants and functions. In this case we have the functions addition, subtraction, and multiplication. We also have the variables $$X_0$$ and $$X_1$$ and constants 3.0 and 0.5. Collectively, the variables and constants are known as terminals. Combined with the functions, the collection of available variables, constants and functions are known as the primitive set.

We could also represent this formula as a syntax tree, where the functions are interior nodes, shown in dark blue, and the variables and constants make up the leaves (or terminal nodes), shown in light blue:

Now you can see that the formula can be interpreted in a recursive manner. If we start with the left-hand leaves, we multiply $$X_0$$ and $$X_0$$ and that portion of the formula is evaluated by the subtraction operation (once the $$X_1 \times 3.0$$ portion is also evaluated). The result of the subtraction is then evaluated by the addition operation as we work up the syntax tree.

Importantly for GP the $$X_0 \times X_0$$ sub-expression, or sub-tree, can be replaced by any other valid expression that evaluates to a numerical answer, even if that is a constant. That sub-expression, and any larger one such as everything below the subtraction function, all reside adjacent to one another in the list-style representation, making replacement of these segments simple to do programatically.

A function has a property known as its arity. Arity, in a python functional sense, refers to the number of arguments that the function takes. In the cases above, all of the functions require two arguments, and thus have an arity of two. But other functions such as np.abs(), only require a single argument, and have an arity of 1.

Since we know the arity of all the available functions in our function set, we can actually simplify the S-expression and remove all of the parentheses:

$y = + - \times X_0 X_0 \times 3 X_1 0.5$

This could then be evaluated recursively, starting from the left and holding onto a stack which keeps track of how much cumulative arity needs to be satisfied by the terminal nodes.

Under the hood, gplearn’s representation is similar to this, and uses Python lists to store the functions and terminals. Constants are represented by floating point numbers, variables by integers and functions by a custom Function object.

In gplearn, the available function set is controlled by an arguments that is set when initializing an estimator. The default set is the arithmetic operators: addition, subtraction, division and multiplication. But you can also add in some transformers, comparison functions or trigonometric identities that are all built-in. These strings are put into the function_set argument to include them in your programs.

• ‘sub’ : subtraction, arity=2.
• ‘mul’ : multiplication, arity=2.
• ‘div’ : division, arity=2.
• ‘sqrt’ : square root, arity=1.
• ‘log’ : log, arity=1.
• ‘abs’ : absolute value, arity=1.
• ‘neg’ : negative, arity=1.
• ‘inv’ : inverse, arity=1.
• ‘max’ : maximum, arity=2.
• ‘min’ : minimum, arity=2.
• ‘sin’ : sine (radians), arity=1.
• ‘cos’ : cosine (radians), arity=1.
• ‘tan’ : tangent (radians), arity=1.

You should chose whether these functions are valid for your program.

You can also set up your own functions by using the functions.make_function() factory function which will create a gp-compatible function node that can be incorporated into your programs. See advanced use here.

## Fitness¶

Now that we can represent a formula as an executable program, we need to determine how well it performs. In a throwback to Darwin, in GP this measure is called a program’s fitness. If you have used machine learning before, you may be more familiar with terms such as “score”, “error” or “loss”. It’s basically the same thing, and as with those other machine learning terms, in GP we have to know whether the metric needs to be maximized or minimized in order to be able to select the best program in a group.

In gplearn, several metrics are available by setting the metric parameter.

For the SymbolicRegressor several common error metrics are available and the evolution process seeks to minimize them. The default is the magnitude of the error, ‘mean absolute error’. Other metrics available are:

• ‘mse’ for mean squared error, and
• ‘rmse’ for root mean squared error.

For the SymbolicTransformer, where indirect optimization is sought, the metrics are based on correlation between the program’s output and the target, these are maximized by the evolution process:

• ‘pearson’, for Pearson’s product-moment correlation coefficient (the default), and
• ‘spearman’ for Spearman’s rank-order correlation coefficient.

You can also set up your own fitness measures by using the fitness.make_fitness() factory function which will create a gp-compatible fitness function that can be used to evaluate your programs. See advanced use here.

Evaluating the fitness of all the programs in a population is probably the most expensive part of GP. In gplearn, you can parallelize this computation by using the n_jobs parameter to choose how many cores should work on it at once. If your dataset is small, the overhead of splitting the work over several cores is probably more than the benefit of the reduced work per core. This is because the work is parallelized per generation, so use this only if your dataset is large and the fitness calculation takes a long time.

## Closure¶

We have already discussed that the measure of a program’s fitness is through some function that evaluates the program’s predictions compared to some ground truth. But with functions like division in the function set, what happens if your denominator happens to be close to zero? In the case of zero division, or near-zero division in a computer program, the result happens to be an infinite quantity. So there goes your error for the entire test set, even if all other fitness samples were evaluated almost perfectly!

Thus, a critical component of rugged GP becomes apparent, we need to protect against such cases for functions that might break for certain arguments. Functions like division must be modified to be able to accept any input argument that could turn up to return a valid number at evaluation so that nodes higher up the tree can successfully evaluate their output.

In gplearn, several protected functions are used:

• division, if the denominator lies between -0.001 and 0.001, returns 1.0.
• square root returns the square root of the absolute value of the argument.
• log returns the square root of the absolute value of the argument, or for very small values less than 0.001, it returns 0.0.
• inverse, if the argument lies between -0.001 and 0.001, returns 0.0.

In this way, no matter the layout of the input data or structure of the evolved program, a valid numerical output can be guaranteed, even if we must sacrifice some interpretability to get there.

If you define your own functions, you will need to guard for this as well. The functions.make_function() factory function will perform some basic checks on your function to ensure it will guard against the most common invalid operations with negative or near-zero operations.

## Sufficiency¶

Another requirement of a successful GP run is called sufficiency. Basically, can this problem be solved to an adequate level with the functions and variables available.

For toy symbolic regression tasks like that solved in example 1, this is easy to ascertain. But in real life, things are less easy to quantify. It may be that there is a good solution lurking in that multi-dimensional space, but there were insufficient generations evolved, or bad luck turned the evolution process in the wrong direction. It may also be possible that no good relationship can be found through symbolic combinations of your variables.

In application, try to set the constant range to a value that will be helpful to get close to the target. For example, if you are trying to regress on a target with values from 500 – 1000 using variables in a range of 0 – 1, a constant of 0.5 is unlikely to help, and the “best” solution is probably just going to be large amounts of irrelevant additions to try and get close to the lower bound. Similarly, standardizing or scaling your variables and targets can make the problem much easier to learn in some cases.

If you are using the trigonometric functions, make sure to convert any degree angles into radians as well. If you don’t have any angles to convert, consider if these functions are useful for your problem, though a seasonal relationship might be discoverable if there is a temporal element in the data.

If you think that the problem requires a very large formula to approach a solution, start with a larger program depth. And if your dataset has a lot of variables, perhaps the “full” initialization method makes more sense to kick start the initial population with bigger programs that encompass more of the data than “grow” might yield.

## Initialization¶

When starting a GP run, the first generation is blissfully unaware that there is any fitness function that needs to be maximized. These initial naive programs are a totally random mix of the available functions and variables. But the user might know a little bit more about the problem before hand and give the evolution process a kick in the right direction in terms of the complexity of the problem at hand. Probably the biggest aspect that goes into this decision is the number of features in your dataset.

The first parameter to look at is the init_depth of the programs in the initial population. This controls the range of program sizes to initialize in the first generation (after that it’s up to evolution). init_depth accepts a tuple of two integers. When generating the initial population, a random maximum depth is chosen within this range for each individual, and the program is grown to satisfy this requirement. The default range of 2 – 6 is generally a good starting point, but if your dataset has a lot of variables, you may wish to make larger programs at first, if only to have more of them included in the initial population.

Next, you should consider population_size. This controls the number of programs generated in both the initial population and every generation following it. If you have very few variables, and have a limited function set, a smaller population size may suffice. If you have a lot of variables, or expect a very large program is required you may want to start with larger programs. More likely, the number of programs you wish to maintain will be constrained by the amount of time you want to spend evaluating them.

Finally, you need to decide on the init_method appropriate for your data. For all options, the root node is a function to avoid having degenerative programs representing only a single variable or constant in the initial population.

For the ‘grow’ method, nodes are chosen at random from both functions and terminals, allowing for smaller trees than init_depth allows. This tends to grow asymmetrical trees as terminals can be chosen before the max depth is reached. If your dataset has a lot of variables, this will likely result in much smaller programs that the init_depth range requests. Similarly, if you have very few variables and have chosen a lot of function sets, you will likely see programs approaching the maximum depth range in the population.

The ‘full’ method chooses nodes from the function set until the max depth is reached, and then terminals are chosen. This tends to grow ‘bushy’ symmetrical trees.

The default is the ‘half and half’ method. Program trees are grown through a 50/50 mix of ‘full’ and ‘grow’, making for a mix of tree shapes in the initial population. When combined with init_method='half and half' this yields the well-known ‘ramped half and half’ initialization method which seeds the population with lots of programs of different sizes and shapes, leading to a diverse mix of representations.

## Selection¶

Now that we have a population of programs, we need to decide which ones will get to evolve into the next generation. In gplearn this is done through tournaments. From the population, a smaller subset is selected at random to compete, the size of which is controlled by the tournament_size parameter. The fittest individual in this subset is then selected to move on to the next generation.

Having a large tournament size will generally find fitter programs more quickly and the evolution process will tend to converge to a solution in less time. A smaller tournament size will likely maintain more diversity in the population as more programs are given a chance to evolve and the population may find a better solution at the expense of taking longer. This is known as selection pressure, and your choice here may be governed by the computation time.

## Evolution¶

As discussed in the selection section, we use the fitness measure to find the fittest individual in the tournament to survive. But this individual does not just graduate unaltered to the next generation, genetic operations are performed on them. Several common genetic operations are supported by gplearn.

Crossover

Crossover is the principle method of mixing genetic material between individuals and is controlled by the p_crossover parameter. Unlike other genetic operations, it requires two tournaments to be run in order to find a parent and a donor.

Crossover takes the winner of a tournament and selects a random subtree from it to be replaced. A second tournament is performed to find a donor. The donor also has a subtree selected at random and this is inserted into the original parent to form an offspring in the next generation.

Subtree Mutation

Subtree mutation is one of the more aggressive mutation operators and is controlled by the p_subtree_mutation parameter. The reason it is more aggressive is that more genetic material can be replaced by totally naive random components. This can reintroduce lost functions and operators into the population to maintain diversity.

Subtree mutation takes the winner of a tournament and selects a random subtree from it to be replaced. A donor subtree is generated at random and this is inserted into the original parent to form an offspring in the next generation.

Hoist Mutation

Hoist mutation is a bloat-fighting mutation operation. It is controlled by the p_hoist_mutation parameter and solely removes genetic material from tournament winners.

Hoist mutation takes the winner of a tournament and selects a random subtree from it. A random subtree of that subtree is then selected and this is ‘hoisted’ into the original subtrees location to form an offspring in the next generation.

Point Mutation

Point mutation is probably the most common form of mutation operations in genetic programming. It can reintroduce lost functions and operators into the population to maintain diversity.

Point mutation takes the winner of a tournament and selects random nodes from it to be replaced. Terminals are replaced by other terminals and functions are replaced by other functions that require the same number of arguments as the original node. The resulting tree forms an offspring in the next generation.

Functions and terminals are randomly chosen for replacement as controlled by the p_point_replace parameter which guides the average amount of replacement to perform.

Reproduction

Should the sum of the above genetic operations’ probabilities, as set by their independent probabilities, be less than one, the balance of genetic operations shall fall back on reproduction. That is, a tournament winner is cloned and enters the next generation unmodified.

## Termination¶

There are two ways that the evolution process will stop. The first is that the maximum number of generations, controlled by the parameter generations, is reached. The second way is that at least one program in the population has a fitness that exceeds the parameter stopping_criteria, which defaults to being a perfect score. You may need to do a couple of test runs to determine what metric is possible if you are working with real-life data in order to set this value appropriately.

## Bloat¶

A program’s size can be measured in two ways: its depth and length. The depth of a program is the maximum distance from its root node to the furthest leaf node. A degenerative program with only a single value, ie y = X0, has a depth of zero. The length of a program is simply the number of elements in the formula, or the count of the number of nodes.

An interesting phenomena is often encountered in GP where the population sizes grow larger and larger with no significant improvement in fitness. This is known as bloat and leads to longer and longer computation times with little benefit to the solution.

Bloat can be fought in gplearn in several ways. The principle weapon is using a penalized fitness measure during selection where the fitness of an individual is made worse the larger it is. In this way, should there be two programs with identical fitness competing in a tournament, the smaller program will be selected and the larger one discarded. The parsimony_coefficient parameter controls this penalty and may need to be experimented with to get good performance. Too large a penalty and your smallest programs will tend to be selected regardless of their actual performance on the data, too small and bloat will continue unabated. The final winner of the evolution process is still chosen based on the unpenalized fitness, otherwise known as its raw fitness.

A recent paper introduced the covariant parsimony method which can be used by setting parsimony_coefficient='auto'. This method adapts the penalty depending on the relationship between program fitness and size in the population and will change from generation to generation.

Another method to fight bloat is by using genetic operations that make programs smaller. gplearn has hoist mutation which removes parts of programs during evolution. It can be controlled by the p_hoist_mutation parameter.

Finally, you could increase the amount of subsampling performed on your data to get more diverse looks at individual programs from smaller portions of the data. max_samples controls this rate and defaults to no subsampling. As a bonus, if you choose to subsample, you also get to see the “out of bag” fitness of the best program in the verbose reporter (activated by setting verbose=1). Hopefully this is pretty close to the in-sample fitness that is also reported.

## Transformer¶

The SymbolicTransformer works slightly differently to the SymbolicRegressor. While the regressor seeks to minimize the error between the programs’ outputs and the target variable based on an error metric, the transformer seeks an indirect relationship that can then be exploited by a second estimator. Essentially, this is automated feature engineering and can create powerful non-linear interactions that may be difficult to discover in conventional methods.

Where the regressor looks to minimize the direct error, the transformer looks to maximize the correlation between the predicted value and the target. This is done through either the Pearson product-moment correlation coefficient (the default) or the Spearman rank-order correlation coefficient. In both cases the absolute value of the correlation is maximized in order to accept strongly negatively correlated programs.

The Spearman correlation is appropriate if your next estimator is going to be tree-based estimator such as a Random Forest or Gradient Boosting Machine. If you plan to send the new transformed variables into a linear model, it is probably better to stick with the default Pearson correlation.

The SymbolicTransformer looks at the final generation of the evolution and picks the best programs to evaluate. The number of programs it will look at is controlled by the hall_of_fame parameter.

From the hall of fame, it will then whittle down the best programs to the least correlated amongst them as controlled by the n_components parameter. You may have the top two programs being almost identical, so this step removes that issue. The correlation between individuals within the hall of fame uses the same correlation method, Pearson or Spearman, as used by the evolution process.

Convinced?