By Pablo Martin, Artelnics.
One of the main problems of companies and RRHH departments is employee. This phenomenon can be very expensive. Indeed, the cost of retaining an existing employee is far less than acquiring a new one. Employee churn prevention aims to predict who, when, and why employees will terminate their jobs.
Accurate methods that identify which employees are more likely to switch to another company are needed. They would allow to adapt those specific aspects of the organization needed to prevent attrition and, therefore, reduce costs.
The objective is to effectively untangle all the factors that lead to employee attrition, and to determine the underlying causes, in order to prevent it. But analyzing multiple personal and social factors is complicated, to say the least. We need both rich employee data, along with complex predictive models to analyze it.
The data set used in this study will contain quantitative and qualitative information about a sample of employees at the company. These variables are classified into the following three groups:
- Personal factors: age, sex, eduction, own residence...
- Professional factors: position, work experience, salary, length of service...
- Socio-economic factors: unemployment rate, economic growth, quality of life, crime rate...
The data set contains about 1,500 employees. For each, around 35 personal, professional and socio-economical attributes will be selected as the input variables. The target variable is the satisfaction of the worker with the company (loyal or attrition). The next table lists all the variables with their corresponding use.
As we can see, we have a total of 48 inputs, which contain the characteristics of every employee, 1 target, which is the variable "Attrition" mentioned before, and 3 unused variables ("EmployeeCount", "Over18" and "StandardHours"), which are constant and will not be used for the analysis since they do not provide any valuable information.
In order to study the dependencies between each input variable and the target, we are going to calculate the logistic correlations between them. The next chart shows the results of these calculations.
As we can see, the input variables that have more importance with the attrition are "OverTime" (0.246118), "TotalWorkingYears" (0.22332) and "YearsAtCompany" (0.196728) while the ones with the least importance are "HourlyRate" (0.00678), "PerformanceRating" (0.00289) and "Research Scientist" (0.00036).
Before starting the predictive analysis, it is also important to know the ratio of negative and positive instances that we have in the data set.
The chart shows that the number of negative instances (1233) is much larger that the number of positive instances (237). We will use this information later to design properly the predictive model.
A neural network will take all the attributes of each of the employees and it will transform them into a probability of attrition. For that purpose, we will use a neural network with 48 inputs, one hidden layer with one neuron in it and one output.
The scaling and unscaling layer, which will be respectively found between the input and the hidden layers and between the hidden and the output layers, will both use the minimum-maximum method.
As we said before, the data set is unbalanced. As a consequence, we will set as error method the weighted squared error with the positive and negative weights shown in the next table.
Now, the model is ready to be trained. We will use the method conjugate gradient as training algorithm. The next chart shows how the loss decreases with the iterations.
As we can see, the initial value for the loss was 1.05565 and, after 222 iterations, it has decreased to 0.567598.
In order to study whether during the training process over-fitting has appeared, we will also plot the selection loss history, which is shown below.
In this case, the initial value for the selection loss was 1.02299 and it has decreased to 0.669642 after 222 iterations. As we can see, both loss and selection loss behave in a similar way along iterations which means that no over-fitting has appeared.
Then, we can move to the next step, testing the predictive capacity of our model.
During this section, we will assess the quality of the model and we will decide if it is ready to be use in the production phase, i.e., in a real world situation.
The way to test the model will be comparing the outputs of the trained neural network against the real targets for a set of data that has not been used neither for training nor for selection, the testing subset. For that purpose, we will make use of some testing methods commonly used in binary classification problems.
The next table shows the binary classification tests. They are calculated from the values of the confusion matrix.
The accuracy shows that the model can predict correctly almost the 81% of all the testing instances while the error rate shows that it only fails to predict around 19% of them. The value of the sensitivity is 0.682927, which means that the model can detect around the 70% of the positive instances. The specificity is 0.83004, so it can detect around 83% of the negative instances.
In general, these binary classification tests show a good performance of the predictive model.
We are going to calculate now the ROC curve. It will help us to measure the discrimination capacity of the classifier between positives and negatives instances. The next chart shows the ROC curve for our problem.
For a perfect classifier, the ROC curve should pass through the upper left corner. In this case, the curve is close to it which means that the quality of the model is good. The next table shows the value of the area under the previous ROC curve.
The closer the area under curve to 1, the better the classifier. In this case, the area takes the value 0.836 which confirms what we saw before in the ROC chart, that the model is prediction attrition with great accuracy.
Once we know that the model can predict employee attrition accurately, it can be used to evaluate the satisfaction of a given employee with the company. The predictive model also gives us the factors which are more significant for a given employee, which allows the company to act on that variables.
The predictive model takes the form of a function of the outputs with respect to the inputs. The mathematical expression, which is listed below, represented by the model can be used to embed it into another software, in the so called production mode.
- The data used for this example can be downloaded from GitHub.