Machine learning examples

Predict customer churn in banks

As we know, it is much more expensive to sign in a new client than keeping an existing one.

For banks it is very useful to know what leads a client towards the decision of leaving the company.

Churn prevention allows companies to develop loyalty programs and retention campaigns to keep as many customers as possible.

In this example, we use customer data from a bank to construct predictive model for the clients that are likely to churn.


  1. Application type.
  2. Data set.
  3. Neural network.
  4. Training strategy.
  5. Model selection.
  6. Testing analysis.
  7. Model deployment.

1. Application type

This is a classification project, since the variable to be predicted is binary (churn or loyal).

The goal here is to model the probability of churn, conditioned on the customer features.

2. Data set

The data set contains the information for creating our model. We need to configure three things here:

The data file bank_churn.csv contains 12 features about 10000 clients of the bank.

The features or variables are the following:

On the other hand, the instances are splitted at random into a training (60%), a selection (20%) and a testing (20%) subsets.

Once the variables and instances are configured, we can perform some analytics on the data.

The data distributions tells us the percentages of churn and loyal customers.

In this data set, the percentage of churn customers is about 20%.

The inputs-targets correlations might indicate which variables might be causing attrition.

From the above chart, we can see that older customers have more probability of leaving the bank.

3. Neural network

The second step is to choose a neural network to represent the classification function. For classification problems, it is composed by:

For the scaling layer, the minimum and maximum scaling method is set.

We set 2 perceptron layers, one hidden layer with 3 neurons as a first guess and one output layer with 1 neuron, both layers having the logistic activation function.

At last, we will set the continuous probabilistic method for the probabilistic layer.

The next figure is a diagram for the neural network used in this example.

4. Training strategy

The training strategy is applied to the neural network to obtain the best possible performance. It is composed of two things:

The selected loss index is the weighted squared error with L2 regularization. The weighted squared error is very useful in applications where the targets are unbalanced. It gives a weight of 3.91 to churn customers and a weight of 1 to loyal customers.

The selected optimization algorithm is the quasi-Newton method.

The following chart shows how the training (blue) and selection (orange) errors error decrease with the training epochs.

The final training and selection errors are training error = 0.609 WSE and selection error = 0.614 WSE, respectively. In the next section we will try to improve the generalization performance by reducing the selection error.

5. Model selection

Order selection is used to find the complexity of the neural network that optimizes the generalization performance. That is, the number of neurons that minimize the error on the selection instances.

The following chart shows the training and selection errors for each different order after performing the incremental order method.

As the chart shows, the optimal number of neurons is 6, with selection error = 0.578. With more neurons the selection loss begins to grow, even though the training loss keeps decreasing.

inputs selection (or feature selection) is used to find the set of inputs that produce the best generalization. The genetic algorithm has been applied here, but it does not reduce the selection error value, so we leave all input variables.

The following figure shows the final network architecture for this application.

6. Testing analysis

The next step is to perform an exhausttive testing analysis to validate the predictive capabilities of the neural network.

A good measure for the precision of a binary classification model is the ROC curve.

We are interested in the area under the curve (AUC). A prefect classifier would have an AUC=1 and a random one would have AUC=0.5. Our model has an AUC = 0.861, which is great.

We can look also to the confusion matrix. Next, we show the elements of this matrix for a decision threshold = 0.5.

Predicted positive Predicted negative
Real positive 316 (15.8%) 96 (4.8%)
Real negative 325 (16.3%) 1263 (63.1%)

From the above confusion matrix, we can calculate the following binary classification tests:

Now, we can simulate the performance of a retention campaign. For that, we use the cumulative gain chart.

The above chart tells us that, if we contact 25% of the customers with highest probability of churn, we reach 75% of the customers which will actually leave the bank.

7. Model deployment

Once we have tested the churn model, we can use it to evaluate the probability of churn of our customers.

For instance, consider a customer with the following features:

The probability of churn for that customer is 38%.

We can export the mathematical expression of the neural network to any bank software in order to facilitate the work of the Retention Department. This expression is listed below.

scaled_credit_score = (credit_score-650.529)/96.6533;
scaled_France = 2*(France-0)/(1-0)-1;
scaled_Spain = (Spain-0.2477)/0.431698;
scaled_Germany = (Germany-0.2509)/0.433553;
scaled_gender = 2*(gender-0)/(1-0)-1;
scaled_age = (age-38.9218)/10.4878;
scaled_tenure = 2*(tenure-0)/(10-0)-1;
scaled_balance = (balance-76485.9)/62397.4;
scaled_products_number = (products_number-1.5302)/0.581654;
scaled_credit_card = 2*(credit_card-0)/(1-0)-1;
scaled_active_member = 2*(active_member-0)/(1-0)-1;
scaled_estimated_salary = 2*(estimated_salary-11.58)/(199992-11.58)-1;
y_1_1 = Logistic (0.848205+ (scaled_credit_score*-0.608944)+ (scaled_France*-0.261025)+ (scaled_Spain*0.412236)+ (scaled_Germany*-0.102466)+ (scaled_gender*-0.190523)+ (scaled_age*-5.79629)+ (scaled_tenure*-0.538913)+ (scaled_balance*-0.442531)+ (scaled_products_number*-2.72944)+ (scaled_credit_card*0.684301)+ (scaled_active_member*3.1411)+ (scaled_estimated_salary*1.5462));
y_1_2 = Logistic (-0.30529+ (scaled_credit_score*0.0542391)+ (scaled_France*-0.0197414)+ (scaled_Spain*-0.277012)+ (scaled_Germany*0.287529)+ (scaled_gender*-0.138025)+ (scaled_age*-1.67199)+ (scaled_tenure*-0.295799)+ (scaled_balance*-0.0519641)+ (scaled_products_number*-5.95291)+ (scaled_credit_card*-0.214941)+ (scaled_active_member*-1.43624)+ (scaled_estimated_salary*0.198904));
y_1_3 = Logistic (-0.0481312+ (scaled_credit_score*0.25511)+ (scaled_France*0.0844269)+ (scaled_Spain*0.108521)+ (scaled_Germany*-0.2049)+ (scaled_gender*0.125926)+ (scaled_age*0.0827378)+ (scaled_tenure*0.276278)+ (scaled_balance*-0.489973)+ (scaled_products_number*-0.776123)+ (scaled_credit_card*0.0203207)+ (scaled_active_member*0.525674)+ (scaled_estimated_salary*-0.17605));
y_1_4 = Logistic (1.52953+ (scaled_credit_score*-3.07592)+ (scaled_France*1.09842)+ (scaled_Spain*-1.4286)+ (scaled_Germany*0.153036)+ (scaled_gender*1.71313)+ (scaled_age*2.61432)+ (scaled_tenure*-3.80362)+ (scaled_balance*0.78056)+ (scaled_products_number*-1.88)+ (scaled_credit_card*-1.82242)+ (scaled_active_member*1.85776)+ (scaled_estimated_salary*1.40538));
y_1_5 = Logistic (-0.0116541+ (scaled_credit_score*0.144119)+ (scaled_France*-0.0170994)+ (scaled_Spain*0.0812705)+ (scaled_Germany*-0.0603271)+ (scaled_gender*-0.0485258)+ (scaled_age*-1.6572)+ (scaled_tenure*0.0583053)+ (scaled_balance*-0.135168)+ (scaled_products_number*-1.32794)+ (scaled_credit_card*0.0531906)+ (scaled_active_member*-1.13656)+ (scaled_estimated_salary*-0.128869));
y_1_6 = Logistic (-3.85516+ (scaled_credit_score*-0.0138554)+ (scaled_France*-0.753416)+ (scaled_Spain*-1.04647)+ (scaled_Germany*1.90095)+ (scaled_gender*0.0137635)+ (scaled_age*-0.191778)+ (scaled_tenure*0.343281)+ (scaled_balance*4.70446)+ (scaled_products_number*-6.3796)+ (scaled_credit_card*0.115022)+ (scaled_active_member*-0.153162)+ (scaled_estimated_salary*-0.0731349));
non_probabilistic_churn = Logistic (4.33579+ (y_1_1*-1.60163)+ (y_1_2*7.91345)+ (y_1_3*-6.65044)+ (y_1_4*-1.39552)+ (y_1_5*-5.56462)+ (y_1_6*-2.54043));
churn = probability(non_probabilistic_churn);

   return 1/(1+exp(-x))

   if x < 0
       return 0
   else if x > 1
       return 1
       return x

Related examples:

Related solutions: