Introduction to neural networks
By Roberto Lopez, Artelnics.
Machine learning is a branch of artificial intelligence which attempts to model high-level abstractions of data by using complex architectures performing multiple transformations. Neural networks is the most important technique for machine learning, and they are recognized to provide the best results in predictive analytics applications. In this tutorial the most important concepts related to neural networks are described.
Neural networks provide a general machine learning framework for solving many types of applications. The most important ones are those of discovering relationships, recognizing patterns, predicting trends or finding associations within a data set. These are called approximation, classification, forecasting and association, respectively.
The following figure depicts an activity diagram for the solution approach of predictive analytics applications with neural networks.
As we can see, there are 7 important concepts in the application of neural networks:
- Data set
- Neural network
- Loss index
- Training strategy
- Model selection
- Testing analysis
- Model deployment
The data set contains the information for creating the model. It comprises a data matrix in which columns represent variables and rows represent instances.
The best known example is the iris data set, which is listed next.
Here the number of rows is 150, while the number of columns is 5.
Variables in a data set can be of three types:
- The input variables will be the independent variables in the model.
- The target variables will be the dependent variables in the model.
- The unused variables will neither be used as inputs nor as targets.
In the iris data set, the inputs are the sepal length, the sepal width, the petal lenght and the petal width.
The target is the corresponding iris flower. This is a nominal variable, and it must be transformed to numerical variables as follows:
Therefore, the number of inputs in the iris flowers data set is 4 (sepal length, sepal width, petal length, petal width) and the number of targets is 3 (setosa, versicolor, virginica).
It is rarely useful to have a neural network that simply memorizes a set of data. Typically, you want the neural network to perform accurately on new data, that is, to be able to generalize. In this way, instances can be:
- Training instances, which are used to construct different models.
- Selection instances, which are used for choosing the predictive model with best generalization properties.
- Testing instances, which are used to validate the functioning of the model.
- Unused instances, which are not used at all.
Usually, we will split a data set as follows: 60% of the instances for training, 20% for selection and 20% for testing. We can split the instances at random or sequentially.
If we split the iris data set into a training, selection and testing subsets with the above rates, we obtain the following subsets:
3. Missing values
A data set can also contain missing values, which are those elements which are not present. Usually missing values are denoted by a label in the data set. Some common missing values labels are ones are "Unknown", "?" or "NaN" (not a number).
There are two methods for dealing with missing values:
- Unuse those instances with missing data.
- Replace the missing values by the mean of the corresponding variable.
The Unuse method is recommended when the number of instances is big and the number of missing values is small. If that is not the case, it might be better to use the Mean method.
The iris flowers data set does not have missing values.
The neural network defines the predictive model as a multidimensional function containing adjustable parameters. Most neural networks, even biological neural networks, exhibit a layered structure. Therefore, layers are the basis to determine the architecture of a neural network.
The most important element of a Neural Designer neural network is the multilayer perceptron. Many practical applications require, however, extensions to the multilayer perceptron. That composition of layers results in a very good function approximator, which can be used for a variety or purposes.
Therefore, the outputs from the neural network depend in turn on the inputs to it and the different parameters within the neural network. Recall that neural network have very good approximation properties.
For instance, an approximation problem might require a multilayer perceptron with scaling and unscaling layers, see the next figure.
Here we identify three basic elements, which transform a set of inputs into a set of outputs: Given a set of input signals, a neural network computes a set of output signals.
The inputs are the independent variables in the model. The inputs in the neural network are the same than the inputs in the data set.
Some basic information related to the input and output variables of a neural network includes the name, description and units of that variables. That information will be used to avoid errors such as interchanging the role of the variables, misunderstanding the significance of a variable or using a wrong units system.
In practice it is always convenient to scale the inputs in order to make all of them to be of order zero. In this way, if all the neural parameters are of order zero, the outputs will be also of order zero. On the other hand, scaled outputs are to be unscaled in order to produce the original units. Here we identify the elements which transform a set of inputs into a set of scaled inputs:
- A set of inputs statistics.
- A scaling method.
In the context of neural networks, the scaling function can be thought as an additional layer connected to the input layer of the multilayer perceptron. The number of scaling neurons is the number of inputs, and the connectivity of that layer is not total, but one-to-one. The next figure shows a scaling layer.
The scaling layer contains some basic statistics on the inputs. They include the mean, standard deviation, minimum and maximum values.
Two scaling methods very used in practice are the minimum-maximum and the mean-standard deviation. Both methods are linear and, in general, produce similar results. The minimum and maximum method processes unscaled inputs in any range to produce scaled inputs which fall between -1 and 1. The mean and standard deviation method scales the inputs so that they will have mean 0 and standard deviation 1.
A multilayer perceptron is built up by organizing layers of perceptrons in a network architecture. Here we identify the elements which transform a set of inputs into a set of outputs:
- A network architecture.
- A set of parameters.
- The layers activation functions.
The architecture of a network refers to the number of layers, their arrangement and connectivity. The characteristic network architecture here is the so called feed-forward architecture. In a feed-forward neural network layers are grouped into a sequence, so that neurons in any layer are connected only to neurons in the next layer. Any architecture can be symbolized as a directed and labelled graph, where nodes represent neurons and edges represent connectivities among neurons. An edge label represents the parameter of the neuron for which the flow goes in. Thus, a neural network typically consists on a set of sensorial nodes which constitute the input layer, one or more hidden layers of neurons and a set of neurons which constitute the output layer. The input layer consists of external inputs and is not a layer of neurons; the hidden layers contain neurons; and the output layer is also composed of output neurons.
The parameters of a layer involve all the biases and synaptic weights of the perceptrons composing that layer. The number of parameters in a layer is therefore equal to the size of the layer multiplied by one plus the number of inputs to that layer.
The layer combination function transforms the set of input values to produce a set of combination or net input values, by computing the combination of each perceptron.
The layer activation function defines the layer output in terms of its combination, by computing the activation of each perceptron.
Therefore a layer computes a set of output values as a function of the input values to it. The output is built by composing the layer activation function with the layer combination function. The outputs depends on the inputs, but also on the parameters.
The parameters of a multilayer perceptron involve the parameters of each perceptron in the network architecture. The number of parameters is the sum of the number of parameters in each layer. All the parameters in a multilayer perceptron are usually arranged in a vector. The norm of the parameters vector will provide a metric for measuring the complexity of a multilayer perceptron.
The activation function of each layer determines the type of function that the multilayer perceptron represents. Hyperbolic tangent hidden layers and a linear output layer are a usual choice for approximation. Logistic activation function in all layers is commonly used in classification.
Communication proceeds layer by layer from the input layer via the hidden layers up to the output layer. The states of the output neurons represent the result of the computation. In this way, in a feed-forward neural network, the output of each neuron is a function of the inputs. Thus, given an input to such a neural network, the activations of all neurons in the output layer can be computed in a deterministic pass.
Neural Designer implements a deep architecture with an arbitrary number of perceptron layers. Most of the times, two layers of perceptrons will be enough to represent the data set. For very complex data sets, deeper architectures with three, four, or more layers of perceptrons might be required.
Also, scaled outputs from a multilayer perceptron are to be unscaled in order to produce the original units. In the context of neural networks, the unscaling function can be interpreted as an unscaling layer connected to the outputs of the multilayer perceptron. The elements which transform a set of scaled outputs into a set of outputs are:
- The outputs statistics.
- The scaling method.
The unscaling function can be seen as an additional layer connected to the output layer of the multilayer perceptron. The number of unscaling neurons is the number of outputs, and the connectivity of that layer is not total, but one-to-one.
An unscaling layer contains some basic statistics on the outputs. They include the mean, standard deviation, minimum and maximum values.
Two unscaling methods very used in practice are the minimum-maximum and the mean-standard deviation methods. Both are linear methods, and the results that they produce are very similar. The minimum-maximum method takes scaled outputs ranging from -1 to 1 in order to produce outputs in the original range of the variables. The mean-standard deviation method takes scaled outputs with mean 0 and standard deviation 1 in order to produce outputs with the original means and standard deviations of the variables.
The bounding layer is the one whose main task is to limit the output. Such as a percentage (0-100). It is really useful in many times. For example, in an approximation problem we have obtained a function that could have negative values. If we are studying wine quality, we can not obtain a negative value, so bounding layer will limit the function. In classification problems, the bounding layer is not used because classification uses a sinusoidal function that is limited by default.
The probabilistic layer takes an output to produce a new output whose elements can be interpreted as probabilities. In this way, the probabilistic outputs will always fall in the range [0, 1], and the sum of all will always be 1. This form of post-processing is often used in classification problems. A probabilistic layer is defined by:
- A probabilistic method.
In the context of neural networks, the probabilistic output function can be interpreted as an additional layer connected to the output layer of the multilayer perceptron. Therefore, the size of the probabilistic layer must be the number of outputs. Note that the probabilistic layer has total connectivity, and that it does not contain any parameter.
There are several probabilistic output methods. Two of the most popular are the competitive method and the softmax method. The competitive method assigns a probability of one to that output with the greatest value, and a probability of zero to the rest of outputs. Note that this function is not derivable. Therefore neither the Jacobian nor the Hessian form of the competitive function can be computed. The softmax function is a continuous probabilistic function, which holds that the outputs always fall in the range [0, 1], and the sum of all is always 1.
The loss index plays an important role in the use of a neural network. It defines the task the neural network is required to do and provides a measure of the quality of the representation that the neural network is required to learn. The choice of a suitable loss index depends on the particular application.
In general, the loss index will depend on the function represented by the neural network. On the other hand, when dealing with approximation or classification applications, it will be measured on the particular data set.
The learning problem for neural networks can then be stated as finding a neural network function for which the loss index takes on a minimum value.
The loss index can be visualized as a hyper-surface with the parameters as coordinates, see the next figure.
In this way, the learning problem for neural networks, formulated in terms of the minimization of the loss index, can be reduced to the optimization of the loss function.
A loss index in Neural Designer is composed of two different terms:
- Error term.
- Regularization term.
The error is the most important term in the loss expression. It defines the task that the neural network is required to accomplish.
For data modelling applications, such as approximation or classification, the sum squared error is the reference error term. It is the sum, over all the training instances in the data set, of the squared errors between the outputs from the neural network and the targets in the data set.
The mean squared error calculates the average squared error between the outputs from the neural network and the targets in the data set.
The root mean squared error takes the square root of the mean squared error between the outputs from the neural network and the targets in the data set.
The normalized squared error divides the squared error between the outputs from the neural network and the targets in the data set by a normalization coefficient. If the normalized squared error has a value of unity then the neural network is predicting the data 'in the mean', while a value of zero means perfect prediction of the data. This can be considered the default error term when solving approximation or classification problems.
The weighted squared error is used in binary classification applications with unbalanced targets i.e., when the numbers of positives and negatives are very different. It gives a different weight to errors belongning to positive and negative instances.
One of the potential difficulties of the normalized squared error is that it can receive a too large contribution from points which have large errors. If there are long tails on the distribution then the solution can be dominated by a very small number of points which have particularly large error. In such occasions, in order to achieve good generalization, it is preferable to chose a more suitable error method. The Minkowski error is the sum, over the training instances, of the difference between the outputs and the targets elevated to an exponent which can vary between 1 and 2. That exponent is called the Minkowski parameter, and a default value for it could be 1.5.
For all the error methods that we have seen above, the gradient can be found analytically using the so called back-propagation algorithm.
A problem is called well-posed if its solution meets existence, uniqueness and stability. A solution is said to be stable when small changes in the independent variable led to small changes in the dependent variable. Otherwise the problem is said to be ill-posed.
An approach for ill-posed problems is to control the effective complexity of the neural network. This can be achieved by using a regularization term into the loss index. Approximation or classification problems with noisy data sets are applications in which regularization can be useful.
One of the simplest forms of regularization term consists on the norm of the neural parameters vector. Adding that term to the error will cause the neural network to have smaller weights and biases, and this will force its response to be smoother. This regularization term is weighted by a parameter, which must be greater than zero. A default value for that weight might be 0.01. If the solution is too smooth the weight must be decreased. Conversely, if the solution oscillates too much the weight is increased.
The gradient for the regularization term from above can be computed in a straightforward manner.
The procedure used to carry out the learning process is called training (or learning) strategy. The training strategy is applied to the neural network to in order to obtain the minimum possible minimum loss. The type of training is determined by the way in which the adjustment of the parameters in the neural network takes place.
Although the loss index is multidimensional, one-dimensional optimization methods are of great importance. Indeed, one-dimensional optimization algorithms are very often used inside multidimensional optimization algorithms.
A function is said to have a relative or local minimum at some point if the function is always greater within some neighbourhood of that point. Similarly, a point is called a relative or local maximum if the function is always lesser within some neighbourhood of that point. The function is said to have a global or absolute minimum at some point if the function is always smaller within the whole domain. Similarly, a point will be a global maximum if the function is always greater within the whole domain. Finding a global optimum is, in general, a very difficult problem.
On the other hand, the tasks of maximization and minimization are trivially related to each other, since maximization of a function is equivalent to minimization of its negative, and vice versa.
In this regard, a one-dimensional optimization problem is one in which the argument which minimizes the loss index is to be found. The necessary condition states that if the directional loss index has a relative optimum and if the derivative exists as a finite number. The condition for the optimum to be a minimum is that the second derivative is greater than zero, and vice versa.
The most elementary approach for one-dimensional optimization problems is to use a fixed step size or training rate. More sophisticated algorithms which are are widely used are the golden section method and the Brent's method. Both of the two later algorithms begin by bracketing a minimum.
The golden section method brackets that minimum until the distance between the two outer points in the bracket is less than a defined tolerance.
The Brent's method performs a parabolic interpolation until the distance between the two outer points defining the parabola is less than a tolerance.
As it was shown, the learning problem for neural networks is reduced to the searching for a parameter vector at which the loss index takes a maximum or a minimum value. The concepts of relative or local and absolute or global optima for the multidimensional case apply in the same way as for the one-dimensional case. The tasks of maximization and minimization are also trivially related here. The necessary condition states that if the neural network is at a minimum of the loss index, then the gradient is the zero vector.
The loss index is, in general, a non linear function of the parameters. As a consequence, it is not possible to find closed training algorithms for the minima. Instead, we consider a search through the parameter space consisting of a succession of steps, or epochs. At each epoch, the loss will decrease by adjusting the neural network parameters. The change of parameters between two epochs is called the parameters increment. In this way, to train a neural network we start with some parameters vector (often chosen at random) and we generate a sequence of parameter vectors, so that the loss index is reduced at each iteration of the algorithm.
The training algorithm stops when a specified condition is satisfied. Some stopping criteria commonly used are:
- The parameters increment norm is less than a minimum value.
- The loss improvement in one epoch is less than a set value.
- Loss has been minimized to a goal value.
- The norm of the loss index gradient falls below a goal.
- A maximum number of epochs is reached.
- A maximum amount of computing time has been exceeded.
A stopping criterion of different nature is early stopping. This method is used in ill-posed problems in order to control the effective complexity of the neural network. Early stopping is a very common practice in neural networks and often produces good solutions to ill-posed problems. The figure below is a state diagram of the training procedure, showing states and transitions in the training process of a neural network.
The training process is determined by the way in which the adjustment of the parameters in the neural network takes place. There are many different training algorithms, which have a variety of different computation and storage requirements. Moreover, there is not a training algorithm best suited to all locations.
Training algorithms might require information from the loss function only, the gradient vector of the loss function or the Hessian matrix of the loss function. These methods, in turn, can perform either global or local optimization.
The simplest training algorithm is gradient descent. With this method, the neural parameters are updated in the direction of the negative gradient of the loss index.
In the conjugate gradient algorithm search is performed along conjugate directions, which produces generally faster convergence than gradient descent directions.
The quasi-Newton method is based on Newton's method, but does not require calculation of second derivatives. Instead, the quasi-Newton method computes an approximation of the inverse Hessian at each iteration of the algorithm, by only using gradient information.
Another main algorithm is the Levenberg-Marquardt. It was designed to approach second-order training speed without having to compute the Hessian matrix. The Levenberg-Marquardt algorithm can only be applied when the loss index has the form of a sum of squares (as the sum squared error, the mean squared error or the normalized squared error).
For very small data sets (10 variables, 1000 instances), the Levenberg-Marquardt algorithm is recommended, due to its high speed. For very big data sets (1000 variables, 1000000 instances), the gradient descent method with fixed training rate is the best choice, since it requires less memory allocation. For intermediate problems, the quasi-Newton method or the conjugate gradient will perform well. The quasi-Newton method is the default training algorithm in Neural Designer.
The model selection is applied to find a neural network with a topology that minimize the error for new data. There are two ways to obtain an optimal topology, the order selection and the inputs selection. Order selection algorithms are used to get the optimal number of hidden perceptron in the neural network. Inputs selection algorithms are responsible for finding the optimal subset of inputs.
The inputs selection is a method to improve the quality of the predictions. It consists in extract the subset of inputs that have more influence on a particular physical, biological, social, etc. process.
The inputs selection algorithm stops when a specified condition is satisfied. Some stopping criteria used are:
- The number of increments in the selection error between two iterations is more than a maximum value.
- Selection error has been minimized to a goal value.
- A maximum number of epochs is reached.
- A maximum amount of computing time has been exceeded.
Growing and pruning methods calculate the correlation of every input with every output in the neural network. The growing inputs method starts with the most correlated input and keeps adding well correlated variables until the selection error starts increasing. On the other hand, the pruning inputs algorithm starts with all the variables of the data set and then removes the inputs with little correlation with the outputs.
A different class of inputs selection method is the genetic algorithm. This is a stochastic method based on the mechanics of natural genetics and biological evolution. The genetic algorithm implemented includes several methods to perform fitness assignment, selection, crossover and mutation operators.
We define the error of a neural network for new data as the selection error. It measure the ability of the neural network to predict the result in a new case.
Two frequent problems in the design of a neural network are called underfitting and overfitting. The best generalization is achieved by using a model whose complexity is the most appropriate to produce an adequate fit of the data. In this way underfitting is defined as the effect of a selection error increasing due to a too simple model, whereas overfitting is defined as the effect of a selection error increasing due to a too complex model.
The next figure shows a generic situation of the selection error and the training error of a neural network depending on its order.
The objective is to automate the finding of the optimal order. Neural Designer implements two algorithms to perform this task.
The incremental order is the simplest order selection algorithm. This method starts with a minimum order and increase the order of the last hidden layer of perceptrons until it reach a maximum order or other stopping criteria. Finally, the algorithm return the neural network with the optimal order obtained.
The simulated annealing is a metaheuristic method of optimization. It is based in the annealing process in the metallurgical industry. This algorithm starts from a random order between a maximum order and a minimum order, and it changes to another order probabilistically in each iteration, until the method reach a stopping criteria.
Threshold selection is applied in binary classification problems. That algorithm modify the decision threshold of the probabilistic layer in order to obtain better accuracy. Then we can see some methods and their meanings.
F1 score: This method maximizes the harmonic mean of precision and sensitivity. This algorithm does not take the true negatives into account.
Matthew's correlation: This method maximizes the correlation between the targets and the outputs.
Youden's index: This algorithm maximizes the probability that the prediction method will make a correct decision as opposed to guessing. This method finds the point of maximal height in the ROC curve.
Kappa coefficient: This method maximizes the amount of agreement correct by the agreement expected by chance.
ROC curve distance optimization: This algorithm finds the threshold value of the point of the ROC curve nearest to the point (0,1).
The purpose of testing is to compare the outputs from the neural network against targets in an independent testing set. If the testing analysis is considered ok by a third person other than the designer, then the neural network can move to the production phase. Note that the results of testing depend very much on the problem at hand, and some numbers might be good for one application but bad for another. On the other hand, the testing methods are subject to the project type.
For approximation applications, calculation of the errors on the testing instances is usual. It is also frequent to calculate basic error statistics, and to draw error histograms. Despite of that, performing a linear regression analysis is the most standard method of testing a neural network for approximation.
For classification applications, the most standard testing method is to calculate the confusion matrix. If the classification problem is binary (the output is true or false), then the classification accuracy, error rate, sensitivity, specifity, positive likelihood and negative likelihood are also computed.
The concept of deployment in predictive data mining refers to the application of a model for prediction to new data. Building a model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process.
Model deployment task is used to solve our problem and shows us different results as outputs values or plot directional output. An example of one application of the model deployment task is displayed; write expression.
- C. Bishop. Neural Networks for classification. Oxford University Press, 1995.
- H. Demuth, M. Beale, and M. Hagan. Neural Network Toolbox User's Gide. The MathWorks, Inc., 2009.
- S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall.
- R. Lopez. Neural Networks for Variational Problems in Engineering. PhD Thesis, Technical University of Catalonia, 2008.