# 6 testing methods for binary classification

By Pablo Martin, Artelnics.

Once a predictive model has been trained, it is needed to evaluate its predictive power on new data that have not been seen before, the testing instances subset. This process will determine if the predictive model is good enough to be moved into the production phase.

The purpose of testing analysis is to compare the responses of the trained predictive model against the correct predictions for every of the instances of the testing set. As these cases have not been used before to train the predictive model, the results of this process can be used as a simulation of what would happen in a real world situation.

Depending on the type of problem that we are analyzing, there are some specific methods that may help us to study in depth the performance of the predictive model. In this case, we focus on the following testing analysis methods for binary classification problems:

1. Confusion matrix.
2. Binary classification tests.
3. Conversion rates.
4. ROC curve.
5. Cumulative gain.
6. Lift chart.

To illustrate that, we use the following testing data set. The column named target determines whether the corresponding instance is negative (0) or positive (1) and the column named output is the score given by the predictive model to each instance, it can be interpreted as the probability that the instance is positive.

Instance Target Output Instance Target Output
1 1 0.99 11 1 0.41
2 1 0.85 12 0 0.40
3 1 0.70 13 0 0.28
4 1 0.60 14 0 0.27
5 0 0.55 15 0 0.26
6 1 0.54 16 0 0.25
7 0 0.53 17 0 0.24
8 1 0.52 18 0 0.23
9 0 0.51 19 0 0.20
10 1 0.49 20 0 0.10

## 1. Confusion matrix

The confusion matrix is an mxm, where m is the number of classes to be predicted. For binary classification problems, the number of classes is 2, thus the confusion matrix will have 2 rows and columns.

The rows of the confusion matrix represent the target classes while the columns represent the output classes. The diagonal cells in each table show the number of cases that were correctly classified and the off-diagonal cells show the misclassified cases.

It is needed to choose a decision threshold τ to label the instances as positives or negatives. If the probability assigned to the instance by the classifier is greater than τ, it is labeled as positive and if the probability is lower than the decision threshold, it is labeled as negative.

Once all the instances are classified, the predicted results are compared to the actual values of the target variables. This gives us four possibilities:

1. True positives (TP), which are the instances that are positives and are classified as positives.
2. False positives (FP), which are the instances that are negatives and are classified as positives.
3. False negatives (FN), which are the instances that are positives and are classified as negatives.
4. True negatives (TN), which are the instances that are negatives and are classified as negatives.

Back to our example, we choose a decision threshold with value 0.5. As a consequence, the number of true positives is 6, the number of false positives is 3, the number of false negatives is 2 and the number of true negatives is 9. This information is arranged in the confusion matrix as follows.

As we can see, our model can correctly classify most of the cases but we need to perform more tests to get a complete analysis of the capacity that it has to classify instances.

## 2. Binary classification tests

There are some important parameters, derived from the confusion matrix, which can be truly helpful to understand the information that it provides.

Some of the most important ones are the classification accuracy (ACC), the error rate (ER), the sensitivity or true positive rate (TPR) and the specificity or true negative rate (TNR). All of them are calculated by using the values TP, TN, FP and FN written in the confusion matrix.

The classification accuracy, which is the ratio of instances correctly classified, can be calculated using the following formula:

$$classification\_accuracy = \frac{TP+TN}{total\_instances}$$

The error rate, which is the ratio of instances misclassified, is given by:

$$error\_rate = \frac{FP+FN}{total\_instances}$$

to calculate the sensitivity, which is the portion of actual positives which are predicted as positives, we use the following expression:

$$sensitivity = \frac{TP}{positive\_instances}$$

Finally, the specificity, which is the portion of actual negatives predicted as negative, is calculated as follows:

$$specificity = \frac{TN}{negative\_instances}$$

In our case, the accuracy is 0.75 (75%) and the error rate is 0.25 (25%) so the model can correctly label a high percentage of the instances. The sensitivity is 0.75 (75%), which means that the model has a good capacity to detect the positives instances. Finally, the specificity is 0.75 (75%), which shows that the model labels correctly most of the negative instances.

## 3. Conversion rates

Conversion rates measure the percentage of cases that perform a desired action. This value can be optimized by acting directly on the client or by better choose of the potential consumer.

The best way to illustrate this method is going back to our example. The next chart shows three rates. The first pair of columns shows the original rate of positives and negatives in the data set. In this case, we have 40% of positives and 60% of negatives.

The pair of columns in the middle shows this rate for the instances that were predicted as positives by our model. Within this group, the percentage of positives is 66.7% and the percentage of negatives is 33.4%. Lastly, the two columns in the right represent the analogous information for the instances that were predicted as negatives. In this case, we have 81.9% of negatives and 18.1% of positives.

Then, our model multiplies the positives rate of the actual data by 1.66 and multiplies the negatives rate of the actual data by 1.365.

## 4. ROC curve

Receiver operating characteristic (ROC) curve is one of the most useful testing methods for binary classification problems, since it provides a comprehensive and visually attractive way to summarize accuracy of predictions.

By varying the value of the decision threshold between 0 and 1, we obtain a set of different classifiers for which we can calculate their specificity and their sensitivity. The points of a ROC curve are the representation of the values of those parameters for each of the values of the decision threshold.

The next chart shows the ROC curve for our example. As we can see, the values of (1-specificity) are plotted in the x-axis while in the y-axis we have the corresponding value of the sensitivity.

In the case that we had a perfect model, the ROC curve would pass through the upper left corner, which is the point in which the sensitivity and the specificity take the value 1. As a consequence, the closer to the point (0,1) the ROC curve, the better the classifier.

The most important parameter that can be obtained from a ROC curve is the area under the curve (AUC), which is used as a measure of the quality of the classifier. For a perfect model, the area under the curve would be 1. For our example, the AUC is 0.90, which shows the good performance of our classifier.

In addition, we can find the optimal threshold, which is the threshold that best discriminates between the two different classes as it maximize the specificity and the sensitivity. Its value will be the value of the threshold corresponding to the point of the ROC curve that is closer to the upper left corner. For our problem, the threshold that best separates the two classes is 0.4.

## 5. Cumulative gain

While the previous methods evaluated the performance of the model on the whole population, cumulative gain charts evaluate the performance of the classifier on every portion of the data.

This testing method is ely useful in those problems in which the objective is to find the greater amount of positive instances by studying the least amount of them, such as marketing campaign.

As we have said before, the classifier labels each case with a score, which is the probability that the current instance is positive. As a consequence, if the model can correctly rank the instances, by firstly calling those ones which have highest scores, we achieve more positive responses as opposite to randomness. Cumulative gain charts show this advantage in a visual way.

The following chart shows the results of the cumulative gain analysis for our example.

The grey line or base line in the middle represents the cumulative gain for a random classifier. The blue line is the cumulative gain for our example. Each point of it represents the percentage of positives found for the corresponding ratio of instances written in the x-axis. Finally, it is also depicted the negative cumulative gain, represented by the red line. This one shows the percentage of negatives found for each portion of the data.

If we have a good classifier, the cumulative gain should be above the base line while the negative cumulative gain should be below it as it happens for our example. Supposing that our example was a marketing campaign, the curve shows that by calling only the 60% of the population with the highest scores we would have got all the positive responses.

Once all the curves are plotted, we can calculate the maximum gain score, which is the maximum distance between the cumulative gain and the negative cumulative gain, as well as the point of maximum gain, which is the percentage of the instances for which the score has been reached.

As we can see in the table, for our example, the maximum distance between both lines is reached for the 60% of the population and it takes the value 0.667.

## 6. Lift chart

The information provided by lift charts is closely related to that provided by the cumulative gain. It represents the actual lift for each percentage of population, which is defined as the ratio between the percentage of positive instances found by using the model and without using it.

The lift chart for a random classifier is represented by a straight line joining the points (0,1) and (1,1). The lift chart for the current model is constructed by plotting different percentages of population in the x-axis against its corresponding actual lift in the y-axis. If the lift chart keeps above the base line, then the model is better than randomness for every point. Back to our example, the lift chart is shown below.

As we can see, the lift curve stays always above the grey line reaching its maximum value 2.5 for the instances ratios of 0.1 and 0.2, which means that the model multiplies the percentage of positives found by 2.5 for the 10% and 20% of the population.