API reference¶

Symbolic Regressor¶

class gplearn.genetic.SymbolicRegressor(*, population_size=1000, generations=20, tournament_size=20, stopping_criteria=0.0, const_range=(-1.0, 1.0), init_depth=(2, 6), init_method='half and half', function_set=('add', 'sub', 'mul', 'div'), metric='mean absolute error', parsimony_coefficient=0.001, p_crossover=0.9, p_subtree_mutation=0.01, p_hoist_mutation=0.01, p_point_mutation=0.01, p_point_replace=0.05, max_samples=1.0, feature_names=None, warm_start=False, low_memory=False, n_jobs=1, verbose=0, random_state=None)[source]¶

A Genetic Programming symbolic regressor.

A symbolic regressor is an estimator that begins by building a population of naive random formulas to represent a relationship. The formulas are represented as tree-like structures with mathematical functions being recursively applied to variables and constants. Each successive generation of programs is then evolved from the one that came before it by selecting the fittest individuals from the population to undergo genetic operations such as crossover, mutation or reproduction.

Parameters:

population_sizeinteger, optional (default=1000)

The number of programs in each generation.

generationsinteger, optional (default=20)

The number of generations to evolve.

tournament_sizeinteger, optional (default=20)

The number of programs that will compete to become part of the next generation.

stopping_criteriafloat, optional (default=0.0)

The required metric value required in order to stop evolution early.

const_rangetuple of two floats, or None, optional (default=(-1., 1.))

The range of constants to include in the formulas. If None then no constants will be included in the candidate programs.

init_depthtuple of two ints, optional (default=(2, 6))

The range of tree depths for the initial population of naive formulas. Individual trees will randomly choose a maximum depth from this range. When combined with init_method=’half and half’ this yields the well- known ‘ramped half and half’ initialization method.

init_methodstr, optional (default=’half and half’)

‘grow’ : Nodes are chosen at random from both functions and terminals, allowing for smaller trees than init_depth allows. Tends to grow asymmetrical trees.
‘full’ : Functions are chosen until the init_depth is reached, and then terminals are selected. Tends to grow ‘bushy’ trees.
‘half and half’ : Trees are grown through a 50/50 mix of ‘full’ and ‘grow’, making for a mix of tree shapes in the initial population.

function_setiterable, optional (default=(‘add’, ‘sub’, ‘mul’, ‘div’))

The functions to use when building and evolving programs. This iterable can include strings to indicate either individual functions as outlined below, or you can also include your own functions as built using the make_function factory from the functions module.

Available individual functions are:

‘add’ : addition, arity=2.
‘sub’ : subtraction, arity=2.
‘mul’ : multiplication, arity=2.
‘div’ : protected division where a denominator near-zero returns 1., arity=2.
‘sqrt’ : protected square root where the absolute value of the argument is used, arity=1.
‘log’ : protected log where the absolute value of the argument is used and a near-zero argument returns 0., arity=1.
‘abs’ : absolute value, arity=1.
‘neg’ : negative, arity=1.
‘inv’ : protected inverse where a near-zero argument returns 0., arity=1.
‘max’ : maximum, arity=2.
‘min’ : minimum, arity=2.
‘sin’ : sine (radians), arity=1.
‘cos’ : cosine (radians), arity=1.
‘tan’ : tangent (radians), arity=1.

metricstr, optional (default=’mean absolute error’)

The name of the raw fitness metric. Available options include:

‘mean absolute error’.
‘mse’ for mean squared error.
‘rmse’ for root mean squared error.
‘pearson’, for Pearson’s product-moment correlation coefficient.
‘spearman’ for Spearman’s rank-order correlation coefficient.

Note that ‘pearson’ and ‘spearman’ will not directly predict the target but could be useful as value-added features in a second-step estimator. This would allow the user to generate one engineered feature at a time, using the SymbolicTransformer would allow creation of multiple features at once.

parsimony_coefficientfloat or “auto”, optional (default=0.001)

This constant penalizes large programs by adjusting their fitness to be less favorable for selection. Larger values penalize the program more which can control the phenomenon known as ‘bloat’. Bloat is when evolution is increasing the size of programs without a significant increase in fitness, which is costly for computation time and makes for a less understandable final result. This parameter may need to be tuned over successive runs.

If “auto” the parsimony coefficient is recalculated for each generation using c = Cov(l,f)/Var( l), where Cov(l,f) is the covariance between program size l and program fitness f in the population, and Var(l) is the variance of program sizes.

p_crossoverfloat, optional (default=0.9)

The probability of performing crossover on a tournament winner. Crossover takes the winner of a tournament and selects a random subtree from it to be replaced. A second tournament is performed to find a donor. The donor also has a subtree selected at random and this is inserted into the original parent to form an offspring in the next generation.

p_subtree_mutationfloat, optional (default=0.01)

The probability of performing subtree mutation on a tournament winner. Subtree mutation takes the winner of a tournament and selects a random subtree from it to be replaced. A donor subtree is generated at random and this is inserted into the original parent to form an offspring in the next generation.

p_hoist_mutationfloat, optional (default=0.01)

The probability of performing hoist mutation on a tournament winner. Hoist mutation takes the winner of a tournament and selects a random subtree from it. A random subtree of that subtree is then selected and this is ‘hoisted’ into the original subtrees location to form an offspring in the next generation. This method helps to control bloat.

p_point_mutationfloat, optional (default=0.01)

The probability of performing point mutation on a tournament winner. Point mutation takes the winner of a tournament and selects random nodes from it to be replaced. Terminals are replaced by other terminals and functions are replaced by other functions that require the same number of arguments as the original node. The resulting tree forms an offspring in the next generation.

Note : The above genetic operation probabilities must sum to less than one. The balance of probability is assigned to ‘reproduction’, where a tournament winner is cloned and enters the next generation unmodified.

p_point_replacefloat, optional (default=0.05)

For point mutation only, the probability that any given node will be mutated.

max_samplesfloat, optional (default=1.0)

The fraction of samples to draw from X to evaluate each program on.

feature_nameslist, optional (default=None)

Optional list of feature names, used purely for representations in the print operation or export_graphviz. If None, then X0, X1, etc will be used for representations.

warm_startbool, optional (default=False)

When set to True, reuse the solution of the previous call to fit and add more generations to the evolution, otherwise, just fit a new evolution.

low_memorybool, optional (default=False)

When set to True, only the current generation is retained. Parent information is discarded. For very large populations or runs with many generations, this can result in substantial memory use reduction.

n_jobsinteger, optional (default=1)

The number of jobs to run in parallel for fit. If -1, then the number of jobs is set to the number of cores.

verboseint, optional (default=0)

Controls the verbosity of the evolution building process.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes:

run_details_dict

Details of the evolution process. Includes the following elements:

‘generation’ : The generation index.
‘average_length’ : The average program length of the generation.
‘average_fitness’ : The average program fitness of the generation.
‘best_length’ : The length of the best program in the generation.
‘best_fitness’ : The fitness of the best program in the generation.
‘best_oob_fitness’ : The out of bag fitness of the best program in the generation (requires max_samples < 1.0).
‘generation_time’ : The time it took for the generation to evolve.

Symbolic Classifier¶

class gplearn.genetic.SymbolicClassifier(*, population_size=1000, generations=20, tournament_size=20, stopping_criteria=0.0, const_range=(-1.0, 1.0), init_depth=(2, 6), init_method='half and half', function_set=('add', 'sub', 'mul', 'div'), transformer='sigmoid', metric='log loss', parsimony_coefficient=0.001, p_crossover=0.9, p_subtree_mutation=0.01, p_hoist_mutation=0.01, p_point_mutation=0.01, p_point_replace=0.05, max_samples=1.0, class_weight=None, feature_names=None, warm_start=False, low_memory=False, n_jobs=1, verbose=0, random_state=None)[source]¶

A Genetic Programming symbolic classifier.

A symbolic classifier is an estimator that begins by building a population of naive random formulas to represent a relationship. The formulas are represented as tree-like structures with mathematical functions being recursively applied to variables and constants. Each successive generation of programs is then evolved from the one that came before it by selecting the fittest individuals from the population to undergo genetic operations such as crossover, mutation or reproduction.

Parameters:

population_sizeinteger, optional (default=500)

The number of programs in each generation.

generationsinteger, optional (default=10)

The number of generations to evolve.

tournament_sizeinteger, optional (default=20)

The number of programs that will compete to become part of the next generation.

stopping_criteriafloat, optional (default=0.0)

The required metric value required in order to stop evolution early.

const_rangetuple of two floats, or None, optional (default=(-1., 1.))

The range of constants to include in the formulas. If None then no constants will be included in the candidate programs.

init_depthtuple of two ints, optional (default=(2, 6))

The range of tree depths for the initial population of naive formulas. Individual trees will randomly choose a maximum depth from this range. When combined with init_method=’half and half’ this yields the well- known ‘ramped half and half’ initialization method.

init_methodstr, optional (default=’half and half’)

‘grow’ : Nodes are chosen at random from both functions and terminals, allowing for smaller trees than init_depth allows. Tends to grow asymmetrical trees.
‘full’ : Functions are chosen until the init_depth is reached, and then terminals are selected. Tends to grow ‘bushy’ trees.
‘half and half’ : Trees are grown through a 50/50 mix of ‘full’ and ‘grow’, making for a mix of tree shapes in the initial population.

function_setiterable, optional (default=(‘add’, ‘sub’, ‘mul’, ‘div’))

The functions to use when building and evolving programs. This iterable can include strings to indicate either individual functions as outlined below, or you can also include your own functions as built using the make_function factory from the functions module.

Available individual functions are:

‘add’ : addition, arity=2.
‘sub’ : subtraction, arity=2.
‘mul’ : multiplication, arity=2.
‘div’ : protected division where a denominator near-zero returns 1., arity=2.
‘sqrt’ : protected square root where the absolute value of the argument is used, arity=1.
‘log’ : protected log where the absolute value of the argument is used and a near-zero argument returns 0., arity=1.
‘abs’ : absolute value, arity=1.
‘neg’ : negative, arity=1.
‘inv’ : protected inverse where a near-zero argument returns 0., arity=1.
‘max’ : maximum, arity=2.
‘min’ : minimum, arity=2.
‘sin’ : sine (radians), arity=1.
‘cos’ : cosine (radians), arity=1.
‘tan’ : tangent (radians), arity=1.

transformerstr, optional (default=’sigmoid’)

The name of the function through which the raw decision function is passed. This function will transform the raw decision function into probabilities of each class.

This can also be replaced by your own functions as built using the make_function factory from the functions module.

metricstr, optional (default=’log loss’)

The name of the raw fitness metric. Available options include:

‘log loss’ aka binary cross-entropy loss.

parsimony_coefficientfloat or “auto”, optional (default=0.001)

This constant penalizes large programs by adjusting their fitness to be less favorable for selection. Larger values penalize the program more which can control the phenomenon known as ‘bloat’. Bloat is when evolution is increasing the size of programs without a significant increase in fitness, which is costly for computation time and makes for a less understandable final result. This parameter may need to be tuned over successive runs.

If “auto” the parsimony coefficient is recalculated for each generation using c = Cov(l,f)/Var( l), where Cov(l,f) is the covariance between program size l and program fitness f in the population, and Var(l) is the variance of program sizes.

p_crossoverfloat, optional (default=0.9)

The probability of performing crossover on a tournament winner. Crossover takes the winner of a tournament and selects a random subtree from it to be replaced. A second tournament is performed to find a donor. The donor also has a subtree selected at random and this is inserted into the original parent to form an offspring in the next generation.

p_subtree_mutationfloat, optional (default=0.01)

The probability of performing subtree mutation on a tournament winner. Subtree mutation takes the winner of a tournament and selects a random subtree from it to be replaced. A donor subtree is generated at random and this is inserted into the original parent to form an offspring in the next generation.

p_hoist_mutationfloat, optional (default=0.01)

The probability of performing hoist mutation on a tournament winner. Hoist mutation takes the winner of a tournament and selects a random subtree from it. A random subtree of that subtree is then selected and this is ‘hoisted’ into the original subtrees location to form an offspring in the next generation. This method helps to control bloat.

p_point_mutationfloat, optional (default=0.01)

The probability of performing point mutation on a tournament winner. Point mutation takes the winner of a tournament and selects random nodes from it to be replaced. Terminals are replaced by other terminals and functions are replaced by other functions that require the same number of arguments as the original node. The resulting tree forms an offspring in the next generation.

Note : The above genetic operation probabilities must sum to less than one. The balance of probability is assigned to ‘reproduction’, where a tournament winner is cloned and enters the next generation unmodified.

p_point_replacefloat, optional (default=0.05)

For point mutation only, the probability that any given node will be mutated.

max_samplesfloat, optional (default=1.0)

The fraction of samples to draw from X to evaluate each program on.

class_weightdict, ‘balanced’ or None, optional (default=None)

Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

feature_nameslist, optional (default=None)

Optional list of feature names, used purely for representations in the print operation or export_graphviz. If None, then X0, X1, etc will be used for representations.

warm_startbool, optional (default=False)

When set to True, reuse the solution of the previous call to fit and add more generations to the evolution, otherwise, just fit a new evolution.

low_memorybool, optional (default=False)

When set to True, only the current generation is retained. Parent information is discarded. For very large populations or runs with many generations, this can result in substantial memory use reduction.

n_jobsinteger, optional (default=1)

The number of jobs to run in parallel for fit. If -1, then the number of jobs is set to the number of cores.

verboseint, optional (default=0)

Controls the verbosity of the evolution building process.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes:

run_details_dict

Details of the evolution process. Includes the following elements:

‘generation’ : The generation index.
‘average_length’ : The average program length of the generation.
‘average_fitness’ : The average program fitness of the generation.
‘best_length’ : The length of the best program in the generation.
‘best_fitness’ : The fitness of the best program in the generation.
‘best_oob_fitness’ : The out of bag fitness of the best program in the generation (requires max_samples < 1.0).
‘generation_time’ : The time it took for the generation to evolve.

Symbolic Transformer¶

class gplearn.genetic.SymbolicTransformer(*, population_size=1000, hall_of_fame=100, n_components=10, generations=20, tournament_size=20, stopping_criteria=1.0, const_range=(-1.0, 1.0), init_depth=(2, 6), init_method='half and half', function_set=('add', 'sub', 'mul', 'div'), metric='pearson', parsimony_coefficient=0.001, p_crossover=0.9, p_subtree_mutation=0.01, p_hoist_mutation=0.01, p_point_mutation=0.01, p_point_replace=0.05, max_samples=1.0, feature_names=None, warm_start=False, low_memory=False, n_jobs=1, verbose=0, random_state=None)[source]¶

A Genetic Programming symbolic transformer.

A symbolic transformer is a supervised transformer that begins by building a population of naive random formulas to represent a relationship. The formulas are represented as tree-like structures with mathematical functions being recursively applied to variables and constants. Each successive generation of programs is then evolved from the one that came before it by selecting the fittest individuals from the population to undergo genetic operations such as crossover, mutation or reproduction. The final population is searched for the fittest individuals with the least correlation to one another.

Parameters:

population_sizeinteger, optional (default=1000)

The number of programs in each generation.

hall_of_fameinteger, or None, optional (default=100)

The number of fittest programs to compare from when finding the least-correlated individuals for the n_components. If None, the entire final generation will be used.

n_componentsinteger, or None, optional (default=10)

The number of best programs to return after searching the hall_of_fame for the least-correlated individuals. If None, the entire hall_of_fame will be used.

generationsinteger, optional (default=20)

The number of generations to evolve.

tournament_sizeinteger, optional (default=20)

The number of programs that will compete to become part of the next generation.

stopping_criteriafloat, optional (default=1.0)

The required metric value required in order to stop evolution early.

const_rangetuple of two floats, or None, optional (default=(-1., 1.))

The range of constants to include in the formulas. If None then no constants will be included in the candidate programs.

init_depthtuple of two ints, optional (default=(2, 6))

The range of tree depths for the initial population of naive formulas. Individual trees will randomly choose a maximum depth from this range. When combined with init_method=’half and half’ this yields the well- known ‘ramped half and half’ initialization method.

init_methodstr, optional (default=’half and half’)

‘grow’ : Nodes are chosen at random from both functions and terminals, allowing for smaller trees than init_depth allows. Tends to grow asymmetrical trees.
‘full’ : Functions are chosen until the init_depth is reached, and then terminals are selected. Tends to grow ‘bushy’ trees.
‘half and half’ : Trees are grown through a 50/50 mix of ‘full’ and ‘grow’, making for a mix of tree shapes in the initial population.

function_setiterable, optional (default=(‘add’, ‘sub’, ‘mul’, ‘div’))

The functions to use when building and evolving programs. This iterable can include strings to indicate either individual functions as outlined below, or you can also include your own functions as built using the make_function factory from the functions module.

Available individual functions are:

‘add’ : addition, arity=2.
‘sub’ : subtraction, arity=2.
‘mul’ : multiplication, arity=2.
‘div’ : protected division where a denominator near-zero returns 1., arity=2.
‘sqrt’ : protected square root where the absolute value of the argument is used, arity=1.
‘log’ : protected log where the absolute value of the argument is used and a near-zero argument returns 0., arity=1.
‘abs’ : absolute value, arity=1.
‘neg’ : negative, arity=1.
‘inv’ : protected inverse where a near-zero argument returns 0., arity=1.
‘max’ : maximum, arity=2.
‘min’ : minimum, arity=2.
‘sin’ : sine (radians), arity=1.
‘cos’ : cosine (radians), arity=1.
‘tan’ : tangent (radians), arity=1.

metricstr, optional (default=’pearson’)

The name of the raw fitness metric. Available options include:

‘pearson’, for Pearson’s product-moment correlation coefficient.
‘spearman’ for Spearman’s rank-order correlation coefficient.

parsimony_coefficientfloat or “auto”, optional (default=0.001)

This constant penalizes large programs by adjusting their fitness to be less favorable for selection. Larger values penalize the program more which can control the phenomenon known as ‘bloat’. Bloat is when evolution is increasing the size of programs without a significant increase in fitness, which is costly for computation time and makes for a less understandable final result. This parameter may need to be tuned over successive runs.

If “auto” the parsimony coefficient is recalculated for each generation using c = Cov(l,f)/Var( l), where Cov(l,f) is the covariance between program size l and program fitness f in the population, and Var(l) is the variance of program sizes.

p_crossoverfloat, optional (default=0.9)

The probability of performing crossover on a tournament winner. Crossover takes the winner of a tournament and selects a random subtree from it to be replaced. A second tournament is performed to find a donor. The donor also has a subtree selected at random and this is inserted into the original parent to form an offspring in the next generation.

p_subtree_mutationfloat, optional (default=0.01)

The probability of performing subtree mutation on a tournament winner. Subtree mutation takes the winner of a tournament and selects a random subtree from it to be replaced. A donor subtree is generated at random and this is inserted into the original parent to form an offspring in the next generation.

p_hoist_mutationfloat, optional (default=0.01)

The probability of performing hoist mutation on a tournament winner. Hoist mutation takes the winner of a tournament and selects a random subtree from it. A random subtree of that subtree is then selected and this is ‘hoisted’ into the original subtrees location to form an offspring in the next generation. This method helps to control bloat.

p_point_mutationfloat, optional (default=0.01)

The probability of performing point mutation on a tournament winner. Point mutation takes the winner of a tournament and selects random nodes from it to be replaced. Terminals are replaced by other terminals and functions are replaced by other functions that require the same number of arguments as the original node. The resulting tree forms an offspring in the next generation.

Note : The above genetic operation probabilities must sum to less than one. The balance of probability is assigned to ‘reproduction’, where a tournament winner is cloned and enters the next generation unmodified.

p_point_replacefloat, optional (default=0.05)

For point mutation only, the probability that any given node will be mutated.

max_samplesfloat, optional (default=1.0)

The fraction of samples to draw from X to evaluate each program on.

feature_nameslist, optional (default=None)

Optional list of feature names, used purely for representations in the print operation or export_graphviz. If None, then X0, X1, etc will be used for representations.

warm_startbool, optional (default=False)

When set to True, reuse the solution of the previous call to fit and add more generations to the evolution, otherwise, just fit a new evolution.

low_memorybool, optional (default=False)

When set to True, only the current generation is retained. Parent information is discarded. For very large populations or runs with many generations, this can result in substantial memory use reduction.

n_jobsinteger, optional (default=1)

The number of jobs to run in parallel for fit. If -1, then the number of jobs is set to the number of cores.

verboseint, optional (default=0)

Controls the verbosity of the evolution building process.

random_stateint, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Attributes:

run_details_dict

Details of the evolution process. Includes the following elements:

‘generation’ : The generation index.
‘average_length’ : The average program length of the generation.
‘average_fitness’ : The average program fitness of the generation.
‘best_length’ : The length of the best program in the generation.
‘best_fitness’ : The fitness of the best program in the generation.
‘best_oob_fitness’ : The out of bag fitness of the best program in the generation (requires max_samples < 1.0).
‘generation_time’ : The time it took for the generation to evolve.

User-Defined Functions¶

gplearn.functions.make_function(*, function, name, arity, wrap=True)[source]¶

Make a function node, a representation of a mathematical relationship.

This factory function creates a function node, one of the core nodes in any program. The resulting object is able to be called with NumPy vectorized arguments and return a resulting vector based on a mathematical relationship.

Parameters:

functioncallable: A function with signature function(x1, *args) that returns a Numpy array of the same shape as its arguments.
namestr: The name for the function as it should be represented in the program and its visualizations.
arityint: The number of arguments that the function takes.
wrapbool, optional (default=True): When running in parallel, pickling of custom functions is not supported by Python’s default pickler. This option will wrap the function using cloudpickle allowing you to pickle your solution, but the evolution may run slightly more slowly. If you are running single-threaded in an interactive Python session or have no need to save the model, set to False for faster runs.

User-Defined Fitness Metrics¶

gplearn.fitness.make_fitness(*, function, greater_is_better, wrap=True)[source]¶

Make a fitness measure, a metric scoring the quality of a program’s fit.

This factory function creates a fitness measure object which measures the quality of a program’s fit and thus its likelihood to undergo genetic operations into the next generation. The resulting object is able to be called with NumPy vectorized arguments and return a resulting floating point score quantifying the quality of the program’s representation of the true relationship.

Parameters:

functioncallable: A function with signature function(y, y_pred, sample_weight) that returns a floating point number. Where y is the input target y vector, y_pred is the predicted values from the genetic program, and sample_weight is the sample_weight vector.
greater_is_betterbool: Whether a higher value from function indicates a better fit. In general this would be False for metrics indicating the magnitude of the error, and True for metrics indicating the quality of fit.
wrapbool, optional (default=True): When running in parallel, pickling of custom metrics is not supported by Python’s default pickler. This option will wrap the function using cloudpickle allowing you to pickle your solution, but the evolution may run slightly more slowly. If you are running single-threaded in an interactive Python session or have no need to save the model, set to False for faster runs.

API reference¶

Symbolic Regressor¶

Symbolic Classifier¶

Symbolic Transformer¶

User-Defined Functions¶

User-Defined Fitness Metrics¶

Table of Contents

Previous topic

Next topic