As we know, it is much more expensive to sign in a new client than keeping an existing one.

For banks it is very useful to know what leads a client towards the decision of leaving the company.

Churn prevention allows companies to develop loyalty programs and retention campaigns to keep as many customers as possible.

In this example, we use customer data from a bank to construct predictive model for the clients that are likely to churn.

This is a classification project, since the variable to be predicted is binary (churn or loyal).

The goal here is to model the probability of churn, conditioned on the customer features.

The data set contains the information for creating our model. We need to configure three things here:

- Data source.
- Variables.
- Instances.

The data file bank_churn.csv contains 12 features about 10000 clients of the bank.

The features or variables are the following:

**customer_id**, unused variable.**credit_score**, used as input.**country**, used as input.**gender**, used as input.**age**, used as input.**tenure**, used as input.**balance**, used as input.**products_number**, used as input.**credit_card**, used as input.**active_member**, used as input.**estimated_salary**, used as input.**churn**, used as target. 1 if the client has left the bank during some period or 0 if he/she has not.

On the other hand, the instances are splitted at random into a training (60%), a selection (20%) and a testing (20%) subsets.

Once the variables and instances are configured, we can perform some analytics on the data.

The data distributions tells us the percentages of churn and loyal customers.

In this data set, the percentage of churn customers is about 20%.

The input-targets correlations might indicate which variables might be causing attrition.

From the above chart, we can see that older customers have more probability of leaving the bank.

The neural network for this application contains:

- A scaling layer.
- Two perceptron layers.
- A probabilistic layer.

This is the default architecture for classification problems.

The training strategy is applied to the neural network to obtain the best possible performance. It is composed of two things:

- A loss index.
- An optimization algorithm.

The selected loss index is the weighted squared error with L2 regularization. The weighted squared error is very useful in applications where the targets are unbalanced. It gives a weight of 3.91 to churn customers and a weight of 1 to loyal customers.

The selected optimization algorithm is the quasi-Newton method.

The following chart shows how the training (blue) and selection (orange) errors error decrease with the training epochs.

The final training and selection errors are **training error = 0.609 WSE** and **selection error = 0.614**, respectively.
In the next section we will try to improve the generalization performance by reducing the selection error.

Order selection is used to find the complexity of the neural network that optimizes the generalization performance. That is, the number of neurons that minimize the errror on the selection instances.

The following chart shows the training and selection errors for each different order.

As the chart shows, the optimal number of neurons is 6, with **selection error = 0.578**.
With more neurons the selection loss begins to grow, even though the training loss keeps decreasing.

Inputs selection (or feature selection) is used to find the set of inputs that produce the best generalization. The genetic algorithm has been applied here, but it does not reduce the selection error value, so we leave all input variables.

The following figure shows the final network architecture for this application.

The next step is to perform an exhaustive testing analysis to validate the predictive capabilities of the neural network.

A good measure for the precission of a binary classification model is the ROC curve.

We are interested in the area under the curve (AUC).
A prefect classifier would have an AUC=1 and a random one would have AUC=0.5.
Our model has an **AUC = 0.861**, which is great.

We can look also to the confusion matrix.
Next, we show the elements of this matrix for a **decision threshold = 0.5**.

Predicted positive | Predicted negative | |
---|---|---|

Real positive | 316 (15.8%) | 96 (4.8%) |

Real negative | 325 (16.3%) | 1263 (63.1%) |

From the above confusion matrix, we can calculate the folowing binary classification tests:

**Classification accuracy: 78.9%**(ratio of correctly classified samples).**Error rate: 21.1%**(ratio of missclassified samples).**Sensitivity: 76.7%**(percentage of actual positive classified as positive).**Specificity: 79.5%**(percentage of actual negative classified as negative).

Now, we can simulate the performance of a retention campaign. For that, we use the cummulative gain chart.

The above chart tells us that, if we contact 25% of the customers with higest probability of churn, we reach 75% of the customers which will actually leave the bank.

Once we have tested the churn model, we can use it to evaluate the probability of churn of our customers.

For instance, consider a customer with the following features:

- credit_score: 650
- country: France
- gender: Female
- age: 39
- tenure: 5
- balance: 76485
- products_number: 2
- credit_card: Yes
- active_member: No
- estimated_salary: 100000

The probability of churn for that customer is 38%.

Finally, the file bank_churn.py contains the mathematical expression of the neural network in the Python programming language. This file can be embedded in the bank's CRM to facilitate the work of the Retention Department.