By Fernando Gomez, Alberto Quesada, Pedro Ángel Fraile Manzano, Roberto Rodríguez Basarte and Roberto Lopez, Artelnics.
Many typical machine learning applications, from customer targeting to medical diagnosis, arise from complex relationships between features (also-called input variables or characteristics).
Feature selection, or input selection algorithms search for the most relevant inputs for a model.
Feature selection techniques help identify and remove unneeded, irrelevant, and redundant features. Indeed, that variables do not contribute to or decrease the accuracy of the predictive model.
Contents:
Mathematically, we formulate inputs selection as a combinatorial optimization problem.
The objective function is the predictive model's generalization performance, represented by the error term on the selection instances of a data set.
The design variables are the inclusion (1) or the exclusion (0) of the input variables in the neural network.
A wide selection of features would evaluate 2N different combinations, where N is the number of features.
This process requires lots of computational work, and it becomes impracticable if the number of features is significant. Therefore, we need intelligent methods to select features in practice.
One of the most advanced algorithms for feature selection is the genetic algorithm.
The genetic algorithm is a stochastic method for function optimization based on natural genetics and biological evolution.
In nature, organisms' genes tend to evolve over successive generations to better adapt to the environment. The genetic algorithm is a heuristic optimization method inspired by the procedures of natural evolution.
Genetic algorithms operate on a population of individuals to produce better and better approximations.
The algorithm creates a new population every generation by selecting individuals according to their fitness level in the problem domain. These individuals are then recombined together using operators borrowed from natural genetics. The offspring might also undergo mutation.
A state diagram for the feature selection process with the genetic algorithm is depicted next.
This process leads to the evolution of populations better suited to their environment than the individuals who originated it.
In our case, each individual in the population represents a neural network.
Genes here are binary values, and represent the inclusion or not of particular features in the model. The number of genes is the total number of input variables in the data set.
The number of individuals, or population size, must be chosen for each application. The default value for the population size in Neural Designer is 40, this number must be a multiple of 4.
Next, we describe the operators and the corresponding parameters used by the genetic algorithm in detail.
The first step is to create and initialize the individuals in the population. Since the genetic algorithm is a stochastic optimization method, we randomly initialize the genes of all individuals. By randomly initializing the genes, we have succeeded in randomly initializing the population, which allows us to explore different potential solutions and avoid getting trapped in suboptimal solutions.
To illustrate this operator, consider a predictive model represented by a neural network with 6 possible features. If we generate four individuals, we have four different neural networks with random features.
The next figure illustrates this population.
As we can see, each individual is represented by 6 binary genes. Each positive gen means that the corresponding feature is included in the neural network.
Another way to activate columns is according to their correlation with the target variable. In this way, we create a probability distribution. This distribution is similar to the one created in the selection operator.
This initialization operator causes variables more likely to be the correct ones to be activated at first.
After the initialization, we need to assign a fitness value to each individual in the population.
We train each neural network with the training instances and then evaluate their error with the selection instances.
A significant selection error means low fitness. Individuals with higher fitness or lesser selection error are more likely to be selected for recombination.
The most used method for fitness assignment is known as a rank-based fitness assignment.
With this method, the selection errors of all the individuals are sorted.
Then, the fitness assigned to each individual only depends on its position in the individual's rank and not on the actual selection error.
The fitness value $\Phi(i)$ assigned to each individual with the rank-based method is:
$$\Phi(i) = R(i) \qquad where \quad i = 1,...,N.$$Here the parameter $R(i)$ is the rank of individual $i$. We can then calculate the probability of being selected based on the fitness assigned to each individual. The probability of each individual being selected can be calculated with the following expression:
$$\displaystyle P_i=\dfrac{N-(R(i)-1)}{\sum_{j=1}^N j}$$Where $N$ is the individual's number (population size).
The following table depicts the selection error, the rank, and the corresponding probability of being selected (fitness) of each individual in our example.
Selection error | Rank | Probability | |
---|---|---|---|
Individual 1 | 0.9 | 4 | 0.1 |
Individual 2 | 0.6 | 2 | 0.3 |
Individual 3 | 0.7 | 3 | 0.2 |
Individual 4 | 0.5 | 1 | 0.4 |
We can plot the above fitness values in a pie chart. Here the area for each individual in the pie is proportional to its fitness. The following picture shows the fitness pie.
As we can see, the fittest individual (#4) is the one who has the most significant area (40%). Conversely, the least fitted individual (#1) has the smallest area (10%).
After a fitness assignment has been performed, the selection operator chooses the individuals that recombine for the next generation.
The individuals most likely to survive are those more fitted to the environment. Therefore, the selection operator selects the individuals according to their fitness level. The number of selected individuals is N/2, being N the population size.
Elitist selection makes the fittest individuals survive directly for the crossover, which helps to maintain the quality of the population. The elitism size controls the number of directly selected individuals. To achieve good convergence of the genetic algorithm, it is recommended to use a value that is half of the total number of selected individuals, which is N/4.
One of the most used selection methods is the roulette wheel. This method places all the individuals on a roulette, with areas proportional to their fitness, as we saw above. Then, it turns the roulette and selects the individuals at random. The corresponding individual is selected for recombination.
The following figure illustrates the selection process for our example.
In this case, neural network #4 has been selected by elitism, and #3 has been selected by the roulette wheel. Note that, although individual #2 has more fitness than #3, it has not been selected due to the stochastic nature of the genetic algorithm.
Here the number of selected individuals is half of the population size.
Once the selection operator has chosen half of the population, the crossover operator recombines the selected individuals to generate a new population.
This operator picks two individuals randomly from the previously selected and combines their features to get four offspring for the new population until the new population is the same size as the old one.
The uniform crossover method decides whether each offspring's features come from one parent or another.
The following figure illustrates the uniform crossover method for our example.
Here we have generated four offspring from two parents. Some features of each neural network correspond to one ancestor and some other features to the other.
In this case, the parents are directly replaced by the offspring. In this way, the population size remains constant.
The crossover operator can generate offsprings that are very similar to the parents. This might cause a new generation with low diversity.
The mutation operator solves this problem by changing the value of some features in the offspring at random.
To decide if a feature is mutated, we generate a random number between 0 and 1. If the random number is lower than a value called the mutation rate, the operator flips that variable.
A standard value for the mutation rate is $\dfrac{1}{m}$, where $m$ is the number of features. We mutate one feature of each individual (statistically) with that value.
The following image shows the mutation of one offspring of the new generation.
As we can see, the fourth input of the neural network has been mutated.
At this point, we have a new population.
The whole fitness assignment, selection, recombination, and mutation process is repeated until a stopping criterion is satisfied. Each generation is likely to be more adapted to the environment than the old one.
This chart shows the typical behavior of the mean selection error and the best selection error through generations. The example used is Breast Cancer Mortality example available in the blog.
As we can see, the mean selection error at each generation converges to a minimum value.
The solution to this process is the best individual ever. This corresponds to the neural network with the smallest selection error among all those we have analyzed.
In our example, the best individual is the following:
The above neural network is the selected one for our application.
Input selection is becoming a critical topic in machine learning. Many data sets contain a large number of features, so we need to select the most useful ones to be included in the neural network.
One of the most advanced methods to do that is the genetic algorithm.
Some advantages of genetic algorithms are the following:
And some disadvantages are:
In conclusion, genetic algorithms can select the best subset of our model's variables, but they usually require much computation.
Neural Designer implements a more advanced genetic algorithm that the one described in this post. You can find it at the input selection section in the model selection panel.
In this section, we will present the results of our test with the genetic algorithm for feature selection on two datasets with a high number of features.
The objective of this test was to reduce the number of inputs and to improve the performance of the model.
For our study, we have selected two cancer datasets, the first is breast cancer and the second is lung cancer, both with a large number of variables, 689 for breast cancer and 11749 for lung cancer.
The table shows the results of our test for each dataset.
Dataset | Number of original features |
Number of selected features |
Original selection error |
Final selection error |
---|---|---|---|---|
Breast cancer | 689 | 96 | 1.63 | 0.25 |
Lung cancer | 11749 | 213 | 1.68 | 0.0939 |
This genetic algorithm has been applied to two cancer datasets, and the results show that the algorithm can significantly reduce the
number of variables used, improving the performance of the original model.
The results are particularly noteworthy for the lung cancer dataset, where the number of selected features was reduced from 11,749 to just 213. This represents a significant reduction in input variables and the selection error from 1.68 to 0.0939.
We conclude that our genetic algorithm is a valuable tool for simplifying the modeling process by reducing the number of inputs and improving model performance. This can be especially useful in cases where the number of input variables is high or where high efficiency in data processing is required. In summary, our experiment demonstrates that the use of the genetic algorithm can result in more effective and efficient models.