Many common applications of predictive analytics, from customer segmentation to medical diagnosis, arise from complex relationships between features (also called variables or characteristics).
Feature selection is the process of finding the most relevant variables for a predictive model. These techniques can be used to identify and remove unneeded, irrelevant and redundant features that do not contribute or decrease the accuracy of the predictive model.
Mathematically, feature selection is formulated as a combinatorial optimization problem. Here the function to optimize is the generalization performance of the predictive model, represented by the error on a selection data set. The design variables are the inclusion (1) or the exclusion (0) of the features.
An exhaustive selection of features would evaluate lots of different combinations (2^{N}, where N is the number of features). This process requires lots of computational work and, if the number of features is big, becomes impracticable. Therefore, we need intelligent methods that allow to perform feature selection in practice.
One of the most advanced algorithms for feature selection is the genetic algorithm. This is a stochastic method for function optimization based on the mechanics of natural genetics and biological evolution. In this article we show how genetic algorithms can be applied to optimize the performance of a predictive model, by selecting the most relevant features.
In nature, the genes of organisms tend to evolve over successive generations to better adapt to the environment. The Genetic Algorithm is an heuristic optimization method inspired by that procedures of natural evolution.
In feature selection, the function to optimize is the generalization performance of a predictive model. More specifically, we want to minimize the error of the model on an independent data set not used to create the model. This function is called the selection error. The design variables are the presence (1) or absence (0) of every possible feature in the model.
Genetic algorithms operate on a population of individuals to produce better and better approximations. At each generation, a new population is created by the process of selecting individuals according to their level of fitness in the problem domain, and recombining them together using operators borrowed from natural genetics. The offspring might also undergo mutation. This process leads to the evolution of populations of individuals that are better suited to their environment than the individuals that they were created from, just as in natural adaptation. A state diagram for the training process with the genetic algorithm is depicted next.
In our case, each individual in the population represents a predictive model. The number of genes is the total number of features in the data set. Genes here are binary values, and represent the inclusion or not of particular features in the model. The number of individuals, or population size, must be chosen for each appplication. Usually, this is set to be 10N, being N the number of features.
Now we are going to describe in detail the operators and the corresponding parameters used by the genetic algorithm.
The fist step is to create and initialize the individuals in the population. As the genetic algorithm is a stochastic optimization method, the genes of the individuals are usually initialized at random.
In order to illustrate this operator, consider a predictive model represented by a neural network with 6 possible features. If we generate a population of 4 individuals, then we will have 4 different neural networks with random features. The next figure illustrates this population.
As we can see, each individual is represented by 6 binary genes. Each positive gen means that the corresponding feature is included in the model.
Once we have generated and initialized the population, we need to assign the fitness to each individual. To evaluate the fitness, we need to train the predictive model with the training data, and then evaluate its selection error with the selection data. Obviously, a high selection error means a low fitness. Those individuals with greater fitness will have a greater probability of being selected for recombination.
The most used method for fitness assignment is known as "Rank based". With this method, the selection errors of all the individuals are sorted. Then, the fitness assigned to each individual only depends on its position in the individuals rank and not on the actual selection error.
The fitness value assigned to each individual with the Rank based method will be:
Here κ is a constant called selective pressure, and its value is fixed between 1 and 2. Greater selective pressure values will make the fittest individuals to have more probability of recombination. The parameter R(i) is the rank of individual i.
Coming back to our example, the following depicts the selection error, the rank and the corresponding fitness of each individual.
Where we have chosen κ=1.5 to calculate the fitness value. Note that, for calculating the population fitness, we have neeed to train 4 different neural networks.
We can plot the above fitness values in a pie chart. Here the area for each individual in the pie is proportional to its fitness. The following picture shows the fitness pie.
As we can see, the fittest individual (#4) is the one who has the biggest area (40%), and the least fitted individual (#1) has the smallest area (10%).
After fitness assignment has been performed, the selection operator chooses the individuals that will recombine for the next generation. The individuals most likely to survive are those more fitted to the environment. Therefore, the selection operator selects the individuals according to their fitness level. The number of selected individuals is N/2, being N the population size.
Elitism selection makes the fittest individuals to survive directly for the next generation. The elitism size controls the number of individuals which are directly selected, and it is usually set to a small value (1,2...).
One of the most used selection methods is roulette wheel, also called stochastic sampling with replacement. This method places all the individuals on a roulette, with areas proportional to their fitness, as we saw above. Then, the roulette is turned and the individuals are selected at random. The corresponding individual is selected for recombination.
The next figure illustrates the selection process for our example. In this case, the individual #4 has been selected by elitism, and the #3 has been selected by roulette wheel. Note that, although the individual #2 has more fitness than the #3, it has not been selected due to the stochastic nature of the genetic algorithm.
Here the number of selected individuals is half of the population size.
Once the selection operator has chosen half of the population, the crossover operator recombines the selected individuals to generate a new population. This operator picks two individuals at random and combines their features to get four offsprings for the new population, until the new population has the same size than the old one.
The uniform crossover method decides whether each of the offspring's features comes from one parent or another.
The next figure illustrates the uniform crossover method for our example. Here we have generated four offspring from two parents. Some features of each offspring correspond to one ancestor, and some other features to the other.
Here the population size remains constant.
The crossover operator can generate offsprings that are very similar to the parents. This might cause a new generation with low diversity. The mutation operator solve this problem by changing the value of some features in the offsprings at random.
In order to decide if a feature will be mutated, we generate a random number between 0 and 1. If this number is lower than a value called the mutation rate, that variable is flipped. The mutation rate is usually chosen to be 1/m, where m is the number of features. With that value for the mutation rate, we will mutate one feature of each individual (statistically).
The next image shows the mutation of one of the offsprings of the new generation. As we can see, the fourth feature of the offspring has been mutated.
At this point, we have the new population.
The whole fitness assignment, selection, recombination and mutation process is repeated until a stopping criterion is satisfied. Each generation is likely to be more adapted to the environment than the old one. The following figure shows the typical behavior of the genetic algorithm. Here the mean selection error of the each population converges to a minimum value. We can also plot the performance of the best individual.
Where the mean of the selection errors of each generation converges.
The solution of this process is the best individual ever. This corresponds to the predictive model with the smallest selection error among all those we have analyzed. In our example, the best individual is the following:
The above predictive model will be the selected one for our application.
As we have seen, feature selection is becoming very important in predictive analytics. Indeed, many data sets contain a large number of features, so we have to select the most useful ones. One of the most advanced methods to do that is the genetic algorithm.
Some advantages of genetic algorithms this method are the following:
And some disadvantages are:
In conclusion, genetic algorithms can select the best subset of variables for our predictive model, but they usually require a lot of computation. Neural Designer implements a more advanced genetic algorithm that the one described in this post. You can find it at the Inputs selection section in the Model selection pane.