The purpose of testing is to compare the outputs from the neural network against targets in an independent set (the testing instances). Note that the testing methods are subject to the project type (approximation or classification).
If all the testing metrics are considered ok, then the neural network can move to the so-called deployment phase. Note also that the results of testing depend very much on the problem at hand, and some numbers might be good for one application but bad for another.
The following testing methods
To test a neural network, you may use any of the errors described in the loss index page as metrics.
The most important errors for measuring the accuracy of a neural network are:
The errors statistics measure in the minimums, maximums, means and standard deviations of the errors between the neural network and the testing instances in the data set.
The following table contains the basic statistics on the absolute and percentage error data when predicting the electricity generated by a combined cycle power plant.
Minimum | Maximum | Mean | Deviation | |
---|---|---|---|---|
Absolute error | 0.03 | 18.13 | 4.18 | 5.27 |
Percentage error | 0.05% | 29.46% | 6.79% | 8.57% |
Error histograms show how the errors from the neural network on the testing instances are distributed. In general, a normal distribution centered at 0 for each output variable is expected here.
The next figure illustrates the histogram of errors made by a neural network when predicting the residuary resistance of sailing yachts. Here the number of bins is 10.
It is very useful to see which testing instances provide the maximum errors, to alert for deficiencies in the model.
The following table illustrates the maximal errors of a neural network for modeling the adhesive strength of nanoparticles.
Rank | Index | Error | Data |
---|---|---|---|
1 | 30 | 9.146 |
shear_rate: 75 particle_diameter: 4.89 particles_adhering: 42.43 |
2 | 35 | 9.121 |
shear_rate: 75 particle_diameter: 6.59 particles_adhering: 35.77 |
3 | 11 | 6.851 |
shear_rate: 50 particle_diameter: 4.89 particles_adhering: 52.76 |
As we can see, particles with a shear rate of 75 seem to yield the biggest errors.
Linear regression analysis is the most standard method to test the performance of a model in approximation applications.
Here the neural network outputs and the corresponding data set targets for the testing instances are plotted.
This analysis leads to 3 parameters for each output variable:
The first two parameters, a and b, correspond to the y-intercept and the slope of the best linear regression relating scaled outputs and targets. The third parameter, R2, is the correlation coefficient between the scaled outputs and the targets.
For a perfect fit (outputs exactly equal to targets), the slope would be 1, and the y-intercept would be 0. If the correlation coefficient is equal to 1, then there is perfect correlation between the outputs from the neural network and the targets in the testing subset.
The following figure is plot of the linear regression analysis for predicting the noise generated by airfoil blades . As we can see, all the predicted values are very similar to the real ones.
The parameters led by this analysis are correlation = 0.952, intercept = 13.9 and slope = 0.89. From that, the correlation coefficient is the most important parameter. As this value is very close to 1, we can say that the neural network is predicting very well the noise.
In the confusion matrix the rows represent the target classes in the data set and the columns the corresponding output classes from the neural network
The diagonal cells in each table show the number of cases that were correctly classified, and the off-diagonal cells show the misclassified cases.
For binary classification, positive means identified and negative means rejected. Therefore, 4 different cases are possible:
Note that the output from the neural network is, in general, a probability. Therefore, the decision threshold determines the classification. The default decision threshold is 0.5. An output above is classified as positive and an output below is classified as negative.
For the case of two classes the confusion matrix takes the following form:
Predicted positive | Predicted negative | |
---|---|---|
Real positive | true_positives | false_negatives |
Real negative | false_positives | false_negatives |
The following example is the confusion matrix for a neural network that assesses the risk of default of credit card clients.
Predicted positive | Predicted negative | |
---|---|---|
Real positive | 745 (12.4%) | 535 (8.92%) |
Real negative | 893 (14.9%) | 3287 (63.8%) |
For multiple classification, the confusion matrix can be represented as follows:
Predicted class 1 | ··· | Predicted class N | |
---|---|---|---|
Real class 1 | # | # | # |
··· | # | # | # |
Real class N | # | # | # |
The following example is the confusion matrix for the classification of iris flowers from sepal and petal dimensions. dimensions.
Predicted setosa | Predicted versicolor | Predicted virginica | |
---|---|---|---|
Real setosa | 10 (33.3%) | 0 | 0 |
Real versicolor | 0 | 11 (36.7%) | 0 |
Real virginica | 0 | 1 (3.33%) | 8 (26.7%) |
As we can see, all testing instances are correctly classified, except one, which is an iris virginica that has been classified as an iris versicolor.
For binary classification, there are a set of standard parameters for testing the performance of a neural network. These parameters are derived from the confusion matrix:
The following lists the binary classification tests corresponding to the confusion matrix for assessing the risk of default of credit card clients that we saw above:
From that parameters, we can see that the model is classifying very well the negatives (specificity = 81.0%), but it has not such a high precision for the positives (sensitivity = 58.2%).
A standard method for testing a neural network in binary classification applications is to plot a ROC (Receiver Operating Characteristic) curve.
The ROC curve plots the false positives rate (or 1 - specificity) on the X axis and true negatives rate (or sensitivity) on the Y axis for different values of the decision threshold.
An example of ROC curve for diagnosing breast cancer from fine-needle aspirate images is the following.
The capacity of discrimination is measured by calculating area under curve (AUC). A random classifier would give AUC = 0.5 and a perfect classifier AUC = 1.
For the above example, the area under curve is AUC = 0.994. That is a very good value, indicating that the model is predicting almost perfectly.
The histogram of the outputs shows how they are distributed using the testing instances.
The following chart shows the histogram for the output conversion from a model that diagnose if an ultrasonic flowmeter is faulty. The abscissa represents the centers of the containers, and the ordinate their corresponding frequencies.
In this case, te possible output values are 0 and 1, corresponding to the two possible states of the meter (it is faulty or not), with frequencies of 47.0588% and 52.9412% respectively.
The cumulative gain analysis is a visual aid that shows the advantage of using a predictive model opposed to randomness.
It consists of two lines:
The next chart shows the results of the analysis for a model that predicts if a bank client is likely to churn. The blue line represents the positive cumulative gain, the red line represents the negative cumulative gain and the grey line represents the cumulative gain for a random classifier.
In this case, by using the model, we see that by analyzing the 50% of the clients with the higher probability of churn, we would be reaching the 80% of the clients that are going to actually leave the bank.
This method provides a visual aid to evaluate a predictive model loss. It consists of a lift curve and a baseline. Lift curve represents the ratio between the positive events using a model and without using it. Baseline represents randomness, i.e., not using a model.
The x-axis displays the percentage of instances studied. The y-axis displays the ratio between the results predicted by the model and the results without the model. Below is depicted the lift chart of a model that predicts if a person is going to donate blood.
To explain the usefulness of this chart, let suppose that we are trying to find the positive instances of a data set. If we try to find them randomly, after studying the 50% of the instances, we discover the 50% of the positive instances.
Now, let suppose that we firstly study those instances that are more likely to be positive according to the score that they have received from the classifier and, after studying the 50% of them, we find the 80% of the positive instances.
Therefore, by using the model, after studying 50% of the instances we would discover 1.6% more positive instances than by looking for them randomly. This advantage is what is represented in the lift chart.
Conversion rates measure the percentage of cases that perform a desired action. This value can be optimized by acting directly on the client or by a better choose of the potential consumer.
The next chart shows two rates. The first column represents the rates of the data set. The second one represents the ratios for the predicted positives of the model.
This testing method shows the difference of the profits from randomness and those using the model depending on the instance ratio.
In this chart three lines are represented, the grey one represent the profits from randomly choosing the instances. The blue line is the evolution of the profits when using the model. Finally, the black one separates the benefits from the losses of the process. The chart below represents the profit chart of a model that detect forged banknotes.
The values of the previous plot are displayed below.
The maximum profit is the value of the greatest benefit obtained with the model, 122.7%, and the instance ratio is the percentage of instances used to obtain that benefit, 45%.
When performing a classification problem, it is important to know which instances have been misclassified. This method shows which actual positive instances are predicted as negative (false negatives), and which actual negative instances are predicted as positive (false positives).
This information is displayed in table format, identifying each instance with an Instance ID, and showing its corresponding data, which is composed of its values of the model variables. Two tables would be displayed, one for the positive instances predicted as negative and other one for the negative instances predicted as positive.
To learn more about testing analysis you can read the 6 testing methods for binary classification article in our blog.