Introduction to Genetic Programming

Introduction to Genetic Programming

Genetic programming addresses the problem of automatic programming, namely, the problem of how to enable a computer to do useful things without instructing it, step by step, on how to do it. (John Koza in Genetic Programming: An Introduction, p. vii)

Genetic Programming is a type of Machine Learning that builds programs via a simulated form of Darwin's evolution by natural selection. Starting with a code base I got from Toby Segaran's excellent book Collective Intelligence (which might be one of the best and briefest books on Machine Learning I've seen) I will layout how Genetic Programming works and then perform a couple of experiments to see if I can improve the starting code base. I hope to go further in future posts.

The actual code I used can be found in this Python Notebook on Google Colaboratory and you can even tweak and run the code yourself.

Darwin's Program

In Genetic Programming we have a hidden function we want to automatically write a program to solve. All of life is a function, so this might be anything: predicting chances of a heart attack, find the best strategy playing a game, etc. For our purposes, we're going to be using a simple hidden function (taken from Segaran's book):

X^2 + 2Y + 3X + 5

What we'll do is write a function to randomly feed various X and Y parameters into that function and get back the correct solution. These results will all be stored in a table. That table will look something like this:

Hidden Function Table

Of course in reality it will have a lot more rows. But the goal of our Genetic Programming algorithm is to build up a population of code that, over generations, gets better and better at predicting the correct result using the data in that table. With some luck, the Genetic Programming algorithm will find the actual hidden function and perfectly predict the results.

Our Toy Programming Language

For this experiment, we're going to stick with a very simple programming model that is made up of the following functions: add, multiply, if, >, subtract. In addition, it will be able to specify (for each of these functions) a parameter or a constant. While this is hardly a complete programming language, it is sufficient to find our hidden function and it should give you a pretty good idea how genetic programming works.

Programming Trees

Now you might be wondering how in the world can we evolve a program? If you're asking this, you probably have in mind typing a program into a text editor or Visual Studio and then randomly throwing statements together. Of course, if you did this, there would be primarily just bad syntax. To avoid this problem, we're going to use a trick where the program has to exist in a tree. So imagine a program that looks like this:

Programming Tree

This programming tree is equivalent to our hidden function and would perfectly predict each result without error.

Building a Population

So how do we actually evolve toward a tree like that? We start with a population of randomly generated trees. Most of them will be terrible — they'll wildly mispredict the results in the table. But "terrible" is still useful information. We score each program by running it against the data and measuring how far off its predictions are. This score is called the fitness of the program. Low error = high fitness.

Now here's where Darwin comes in.

We take the fittest programs and use them to breed the next generation. There are two main operations for doing this:

Mutation — Take a program tree and randomly change one of its nodes. Maybe swap a multiply for an add, or replace a constant with a parameter. This is the equivalent of a random genetic mutation.

Crossover — Take two program trees and swap a random subtree between them. This is the equivalent of sexual reproduction — combining the best features of two successful programs in hopes of producing something even better.

Run enough generations of selection, mutation, and crossover, and the population tends to converge on better and better solutions. Sometimes it finds the exact hidden function. Sometimes it finds a different function that happens to produce the same outputs for all the data in the table (a perfectly valid solution, even if it isn't the "original" one).

The Experiment

I ran the algorithm against our hidden function X^2 + 2Y + 3X + 5 with a population of 500 programs over 100 generations. The results were encouraging. Within a few dozen generations the best program in the population had dramatically reduced its error score, and by the end it had converged on a solution that perfectly predicted every row in the table.

You can see the exact results — and run the code yourself — in the Python Notebook. I'd encourage you to play with the population size and generation count and see how it affects both accuracy and speed.

What's Next

This is just the starting point. A few obvious questions worth exploring in future posts: Can we improve the algorithm's speed by being smarter about how we generate the initial population? Does a larger population always produce better results, or does it just produce slower ones? And perhaps most interestingly — how does Genetic Programming hold up against other Machine Learning approaches on harder problems?

All theories, including this algorithm, are only as good as the problems we test them against. So the next step is to make the problem harder.

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter