Neural Networks Tutorial
6. Testing analysis

The purpose of testing is to compare the outputs from the neural network against targets in an independent set (the testing instances). Note that the testing methods are subject to the project type (approximation or classification).

If all the testing metrics are considered ok, then the neural network can move to the so-called deployment phase. Note also that the results of testing depend very much on the problem at hand, and some numbers might be good for one application but bad for another.

The following testing methods

6.1. Approximation testing methods

Testing errors

To test a neural network, you may use any of the errors described in the loss index page as metrics.

The most important errors for measuring the accuracy of a neural network are:

All that errors are measured on the testing instances of the data set.

Errors statistics

The errors statistics measure in the minimums, maximums, means and standard deviations of the errors between the neural network and the testing instances in the data set.

The following table contains the basic statistics on the absolute and percentage error data when predicting the electricity generated by a combined cycle power plant.

Minimum Maximum Mean Deviation
Absolute error 0.03 18.13 4.18 5.27
Percentage error 0.05% 29.46% 6.79% 8.57%

Errors histogram

Error histograms show how the errors from the neural network on the testing instances are distributed. In general, a normal distribution centered at 0 for each output variable is expected here.

The next figure illustrates the histogram of errors made by a neural network when predicing the residuary resistance of sailing yachts. Here the number of bins is 10.

Maximal errors

It is very useful to see which testing instances provide the maximum errors, to alert for deficiencies in the model.

The following table illustrates the maximal errors of a neural network for modeling the adhesive strength of nanoparticles.

Rank Index Error Data
1 30 9.146 shear_rate: 75
particle_diameter: 4.89
particles_adhering: 42.43
2 35 9.121 shear_rate: 75
particle_diameter: 6.59
particles_adhering: 35.77
3 11 6.851 shear_rate: 50
particle_diameter: 4.89
particles_adhering: 52.76

As we can see, particles with a shear rate of 75 seem to yield the biggest errors.

Linear regression analysis

Linear regression analysis is the most standard method to test the performance of a model in approximation applications.

Here the neural network outputs and the corresponding data set targets for the testing instances are plotted.

This analysis leads to 3 parameters for each output variable:

The first two parameters, a and b, correspond to the y-intercept and the slope of the best linear regression relating scaled outputs and targets. The third parameter, R2, is the correlation coefficient between the scaled outputs and the targets.

For a perfect fit (outputs exactly equal to targets), the slope would be 1, and the y-intercept would be 0. If the correlation coefficient is equal to 1, then there is perfect correlation between the outputs from the neural network and the targets in the testing subset.

The following figure is plot of the linear regression analysis for predicting the noise generated by airfoil blades . As we can see, all the predicted values are very similar to the real ones.

The parameters led by this analysis are correlation = 0.952, intercept = 13.9 and slope = 0.89. From that, the correlation coefficient is the most important parameter. As this value is very close to 1, we can say that the neural network is predicting very well the noise.

6.2. Classification testing methods

Confusion matrix

In the confusion matrix the rows represent the target classes in the data set and the columns the corresponding output classes from the neural network

The diagonal cells in each table show the number of cases that were correctly classified, and the off-diagonal cells show the misclassified cases.

For binary classification, positive means identified and negative means rejected. Therefore, 4 different cases are possible:

Note that the output from the neural network is, in general, a probability. Therefore, the decision threshold determines the classification. The default decision threshold is 0.5. An output above is classified as positive and an output below is classified as negative.

For the case of two classes the confusion matrix takes the following form:

Predicted positive Predicted negative
Real positive true_positives false_negatives
Real negative false_positives false_negatives

The following example is the confusion matrix for a neural network that assesses the risk of default of credit card clients.

Predicted positive Predicted negative
Real positive 745 (12.4%) 535 (8.92%)
Real negative 893 (14.9%) 3287 (63.8%)

For multiple classification, the confusion matrix can be represented as follows:

Predicted class 1       ···       Predicted class N
Real class 1 # # #
··· # # #
Real class N # # #

The following example is the confusion matrix for the classification of iris flowers from sepal and petal dimensions. dimensions.

Predicted setosa Predicted versicolor Predicted virginica
Real setosa 10 (33.3%) 0 0
Real versicolor 0 11 (36.7%) 0
Real virginica 0 1 (3.33%) 8 (26.7%)

As we can see, all testing instances are correctly classified, except one, which is an iris virginica that has been classified as an iris versicolor.

Binary classification tests

For binary classification, there are a set of standard parameters for testing the performance of a neural network. These parameters are derived from the confusion matrix:

The following lists the binary classification tests corresponding to the confusion matrix for assessing the risk of default of credit card clients that we saw above:

From that parameters, we can see that the model is classifying very well the negatives (specificity = 81.0%), but it has not such a high precision for the positives (sensitivity = 58.2%).

ROC curve

A standard method for testing a neural network in binary classification applications is to plot a ROC (Receiver Operating Characteristic) curve.

The ROC curve plots the false positives rate (or 1 - specificity) on the X axis and true negatives rate (or sensitivity) on the Y axis for different values of the decision threshold.

An example of ROC curve for predicting customer churn in banks is the following.

The capacity of discrimination is measured by calculating area under curve (AUC). A random classifier would give AUC = 0.5 and a perfect classifier AUC = 1.

For the above example, the area under curve is AUC = XXX.

Outputs histograms

The histogram of the outputs shows how they are distributed using the testing instances.

The following chart shows the histogram for the output conversion. The abscissa represents the centers of the containers, and the ordinate their corresponding frequencies. The minimum frequency is 61, which corresponds to the bin with center 0.537405. The maximum frequency is 146, which corresponds to the bin with center 0.47973.

Cumulative gain

The cumulative gain analysis is a visual aid that shows the advantage of using a predictive model opposed to randomness.

It consists of two lines:

The next chart shows the results of the analysis in this case. The blue line represents the positive cumulative gain, the red line represents the negative cumulative gain and the grey line represents the cumulative gain for a random classifier.

For example, 0.8 and 0.2 are the values of the positive and negative cumulative gain for a percentage of population 0.5, means that after studying the 50% of all the population we have already found the 80% of all the positives and the 20% of the negatives while with a random classifier we would have found the 50% of the positives and the 50% of the negatives. The maximum gain score is used as a measure of the quality of the model.

Lift chart

This method provides a visual aid to evaluate a predictive model loss. It consists of a lift curve and a baseline. Lift curve represents the ratio between the positive events using a model and without using it. Baseline represents randomness, i.e., not using a model.

The lift chart is depicted next. The x-axis displays the percentage of instances studied. The y-axis displays the ratio between the results predicted by the model and the results without the model.

For example, let supose that we are trying to find the positive instances of a data set. If we try to find them randomly, after studying the 50% of the instances, we discover the 50% of the positive instances. Now, let suppose that we firstly study those instances that are more likely to be positive according to the score that they have received from the classifier and, after studying the 50% of them, we have found the 80% of the positive instances. Then, by using the model, after studying 50% of the instances we have discover 1.6% more positive instances than by looking for them randomly. This advantage is shown in the lift chart.

Conversion rates

Conversion rates measure the percentage of cases that perform a desired action. This value can be optimized by acting directly on the client or by a better choose of the potential consumer.

The next chart shows two rates. The first column represents the rates of the data set. The second one represents the ratios for the predicted positives of the model.

Profit chart

This testing method shows the difference of the profits from randomness and those using the model depending on the instance ratio.

The next chart has three lines, the grey one represent the profits from random choose of the instances. The blue line is the evolution of the profits using the model. Finally, the black one separates the benefits from the losses of the process.

The values of the previous plot are displayed in this table. The maximum profit is the value of the greatest benefit obtained with the model, and the instance ratio is the percentage of instances used to obtain that benefit.

Misclassified instances

When performing a classification problem, it is important to know which of the 1instances are misclassified. This method shows which actual positive instances are predicted as negative (false negatives), and which actual negative instances are predicted as positive (false positives).

The following table shows the instances which are positive and are predicted as negative.

Instance id Instance data
XXX XXX
XXX XXX
XXX XXX

The following table shows the instances which are negative are predicted as positive.

Instance Id Instance data
XXX XXX
XXX XXX
XXX XXX
⇐ Model selection Model deployment ⇒