Model selection algorithms in predictive analytics

By Fernando Gómez, Artelnics.

Many common applications in predictive modelling, from customer segmentation to medical diagnosis, arise from complex interactions between all types of variables. A software tool which untangles these factors efficiently would be a great novelty likely to disrupt the existing predictive analytics market.

But analysing multiple internal and external variables is very complicated. Data scientists might have enormous data sets, and they need innovative methods that can select those variables which are relevant. But model selection algorithms are very expensive in computational terms, so a big drawback here is the performance.

Model selection picture

The predictive analytics tool Neural Designer provides innovative algorithms to automate the model selection process. In this post, we will explain some basic ideas about the model selection and the algorithms implemented in this software.

In this way, model selection is applied to find the topology of a neural network that minimizes the error on new data. There are two ways to obtain an optimal topology, order selection and inputs selection. Order selection algorithms are used to get the optimal number of neurons in the neural network. Inputs selection algorithms are responsible for finding the optimal subset of inputs.

Order selection

We define the selection error as the error of a neural network for new data. It measures the ability of the model for predicting the result in a new case.

Two frequent problems in the design of a neural network are called underfitting and overfitting. The best generalization is achieved by using a model whose complexity is the most appropriate to produce an adequate fit of the data. In this way underfitting is defined as the effect of a selection error increasing due to a too simple model, whereas overfitting is defined as the effect of a selection error increasing due to a too complex model.

The next figure illustrates the training and selection errors as a function of the order of a neural network.

Errors plot

Neural Designer finds automatically the optimal order of a predictive model. It implements two different algorithms to perform this task: incremental order and simulated annealing.

Incremental order is the simplest order selection algorithm. This method starts with a minimum order and increases the size of the last hidden layer of neurons until the optimal order is reached. Finally, the algorithm returns the neural network with the optimal order obtained.

The simulated annealing is a metaheuristic method of optimization. It is based on the annealing process in the metallurgical industry. This algorithm starts with a random order, and it changes that value in a probabilistic fashion in each iteration, until a stopping criterion is reached.

Inputs selection

Inputs selection is a method to improve the quality of the predictions. It basically consists in extracting the subset of inputs that have more influence on a particular physical, biological, social, etc. process.

Neural Designer implements various algorithms that allow data scientists to find that optimal variables. These are growing inputs, pruning inputs and the genetic algorithm.

Growing and pruning methods calculate the correlation of every input with every output in the neural network. The growing inputs method starts with the most correlated input and keeps adding well correlated variables until the selection error starts increasing. On the other hand, the pruning inputs algorithm starts with all the variables of the data set and then removes the inputs with little correlation with the outputs.

A different class of inputs selection method is the genetic algorithm. This is a stochastic method based on the mechanics of natural genetics and biological evolution. The genetic algorithm implemented includes several methods to perform fitness assignment, selection, crossover and mutation operators.

The genetic algorithm start with a population of different subsets of variables. In every generation, the fitness of every individual in the population is computed, as the selection error for that subset of inputs. Then, the method evolves the population by selecting some individuals to generate the new population, performing a crossover with the selected population and mutating the offsprings generated in the crossover. The next figure shows a simplified flow diagram of the genetic algorithm.

Genetic algorithm

In summary, Neural Designer includes an advanced model selection framework capable of representing very complex data sets. This system procures high added value to data scientists, providing them with results in a way previously unachievable.