Related posts:> Genetic algorithms for feature selection.
> 6 Applications of predictive analytics in business intelligence.
> Kaggle higgs challenge.
Many common applications in predictive modelling, from customer segmentation to medical diagnosis,
arise from complex interactions between all types of variables. A software tool which untangles
these factors efficiently would be a great novelty likely to disrupt the existing predictive analytics market.
But analysing multiple internal and external variables is very complicated. Data scientists might have enormous data sets, and they need innovative methods that can select those variables which are relevant. But model selection algorithms are very expensive in computational terms, so a big drawback here is the performance.
We define the selection error as the error of a neural network for new data.
It measures the ability of the model for predicting the result in a new case.
Two frequent problems in the design of a neural network are called underfitting and overfitting. The best generalization is achieved by using a model whose complexity is the most appropriate to produce an adequate fit of the data. In this way underfitting is defined as the effect of a selection error increasing due to a too simple model, whereas overfitting is defined as the effect of a selection error increasing due to a too complex model.
The next figure illustrates the training and selection errors as a function of the order of a neural network.
Inputs selection is a method to improve the quality of the predictions.
It basically consists in extracting the subset of inputs that have more influence on a particular physical, biological, social, etc. process.
Neural Designer implements various algorithms that allow data scientists to find that optimal variables. These are growing inputs, pruning inputs and the genetic algorithm.
Growing and pruning methods calculate the correlation of every input with every output in the neural network. The growing inputs method starts with the most correlated input and keeps adding well correlated variables until the selection error starts increasing. On the other hand, the pruning inputs algorithm starts with all the variables of the data set and then removes the inputs with little correlation with the outputs.
A different class of inputs selection method is the genetic algorithm. This is a stochastic method based on the mechanics of natural genetics and biological evolution. The genetic algorithm implemented includes several methods to perform fitness assignment, selection, crossover and mutation operators.
The genetic algorithm start with a population of different subsets of variables. In every generation, the fitness of every individual in the population is computed, as the selection error for that subset of inputs. Then, the method evolves the population by selecting some individuals to generate the new population, performing a crossover with the selected population and mutating the offsprings generated in the crossover. The next figure shows a simplified flow diagram of the genetic algorithm.