Predicting churn of customers in a telecommunications company using neural networks
Customer churn is a big problem for telecommunications companies since it impedes growth. For that reason, it is important that the companies are aware of churn rate in order to develop strategies that allows them to keep as many clients as possible.
The aim of this study is to use advanced analytics in order to understand why customers do churn and what can be done about it.
1. Data set
The data set contains a total of 3333 instances, each of them corresponding to a client of a telecommunications company for whom it has been collected information about the type of plan that they have contracted, the minutes that they have talked or the charge that they pay every month, among others.
The target variable is "churn" and it is the one that determines whether the client is still in the company or not.
The first step of the analysis is to check the basic statistics of all te variables. The next table shows a list of them for each of the 18 inputs and the target.
As we can see, the average bill in this company has a cost of 59.45$ and the average number of calls to customer service is 1.56.
The logistic correlations will give us the importance of single input variables with the target.
As we can see, most clients leave because of high prices, high rate of day calls, high rates of international calls without plan and technical services.
Before starting the predictive analysis, it is also important to know the ratio of negative and positive instances that we have in the data set.
The chart shows that there are 2850 loyal customers and 483 churn customers in our data set, that is, there are 6 loyal clients for each churn client.
2. Neural network training
Once we have study the available data, we can move to the design of the predictive model and its training.
The predictive model is represented by a neural network. The architecture, which is shown below, consists of 18 scaling neurons (yellow), 5 neurons in the first layer (blue) and 1 probabilistic neuron (red).
As we saw before, the data set is unbalanced. In that way, we will set the weighted squared error as error method. The next chart shows how the loss decreases during the training process with the iterations of the Quasi-Newton method.
As we can see, the initial value is 1.24397, and the final value after 173 iterations is 0.333412.
The next table shows more information about the results of the training with the quasi-Newton method.
They show that the selection loss and the final parameters norm are not big and that the analysis of all the training instances was made in 12 seconds.
3. Testing analysis
Once the model has been trained, it is time to evaluate its performance on new data that have not been used neither for training nor for selection.
Firstly, we are going to calculate the binary classification tests.
They are calculated from the values of the confusion matrix.
- Classification accuracy (ratio of correctly classified samples): 91.2%
- Error rate (ratio of missclassified samples): 8.8%
- Sensitivity (percentage of actual positive classified as positive): 76.9%
- Specificity (percentage of actual negative classified as negative): 93.9%
The accuracy shows that the model can predict correctly almost the 92% of all the testing instances while the error rate shows that it only fails to predict around 8% of them. The value of the sensitivity is 0.769231, which means that the model can detect around the 70% of the positive instances. The specificity is 0.8939502, so it can detect around 90% of the negative instances.
These binary classification tests show that the model can predict correctly most of the instances.
We are going to calculate now the ROC curve. It will help us to measure the discrimination capacity of the classifier between positives and negatives instances. The next chart shows the ROC curve for our problem.
The proximity of the curve to the upper left corner means that the model has a good capacity to discriminate between the two classes.
The most important parameter that we can obtain from the ROC curve is the area under the curve. This value takes the value 0.5 for a random classifier and takes de value 1 for a perfect one. In practice, it should ideally be as close to 1 as possible. The next table shows the value of this parameter in this case.
The area under curve is 0.896 which shows that our model is predicting well the churn of our customers. In this way, we can move it into the production phase to estimate the probability of churn of new ones.
4. Model deployment
The predictive model takes the form of a function of the outputs with respect to the inputs. The mathematical expression, which is listed below, represented by the model can be used to embed it into another software and can be used prevent customers from doing churn.
(churn) = Probability(non_probabilistic_churn);
if x < 0
else if x > 1