Insurance risk assessment using machine learning

By Pablo Martin, Artelnics.

Nowadays, the risk assessment process carried out by insurance companies has become obsolete.

Indeed, evaluating a customer for life insurance takes an average of 30 days. Moreover, this process is subjective and different risk analysts usually give different evaluations for the same customer.

But machine learning can help to solve these problems. By analyzing the available information on the customers stored by the company, we can develop risk models that evaluate new customers in a faster and more accurate fashion.

As we will see, machine learning benefits both the company and the customer, by providing better services and more suited products to clients. These methods will allow insurance companies to save money and increase revenues.


  1. Data set.
  2. Risk model.
  3. Conclusions.

Data set

The dataset used here is taken from Kaggle, one of the most important repositories of data science.

It contains information about 60.000 customers with 145 attributes. This adds up to more than 8 million data.

This data set contains about 400.000 missing values, i.e., information about customers not available to the company. However, machine learning can deal with incomplete data without problems.

The input variables include personal, family history, medical history and external variables.

The target variable is the risk of that client. This is a subjective variable because it is granted by people, more specifically by risk analysts.

The following table shows how the risk evaluation is distributed.

Very low7
Very high1

It might be interesting to study how the different features affect the risk of a customer. For that, we can calculate the correlation coefficients between all the inputs and the target.

Correlations close to 1 mean a high dependency of that target with that input. Correlations close to 0 mean that there is a low dependency. Note that, in general, the targets depend on many inputs simultaneously.

The following images show the absolute value of the linear correlations between all inputs with the target variable.

As we can see, the three most influential variables for the risk are the body mass index BMI (0.382), the weigh Wt (0.351) and the medical history 23 (0.287).

The next step is to know how the target variable is distributed over its entire range. Box plots display information about the first, second, third and fourth quartiles of a single variable in the data set. The following chart shows the box plot for the risk.

In this case, the Box plot tells us that the minimum risk is 1, the first quartile is 4, the second quartile or median is 6, the third quartile is 8, and the maximum is 8. Note that the upper quartile is missing, which means that the risk variable is not well distributed.

Histograms also give us valuable information on the distribution of a variable. The following chart shows the histogram for the risk.

The chart above shows that the most common risk value is 8 (low). The less common risk value is 3 (medium-high). We can see that this histogram does not have a normal (or Gaussian) distribution. Indeed, analysts tend to give either low or high-risk values. The 3 and 4 values have very low frequencies. The value 7 has a lower frequency than the values 6 and 8.

Therefore, the target variable is poorly distributed. It should be a normal distribution, but clearly, it is not. Indeed, most customers should have a medium risk. This is because risk analysts tend to give more evaluations than others.

Our model should output risk values with a normal distribution.

Risk model

The next step of our study is to build a model capable of evaluating our potential customers' risk. This model will use the extensive information that they provide us.

For this purpose, we use a technique called neural networks. Neural networks are among the most powerful methods to discover intricate relationships, recognize complex patterns or predict current trends in your data.

The following graph illustrates the neural network used in this application. Note that we do not plot the complete neural network since it is too large.

After training the network, we evaluate its performance to check if our risk model is useful to evaluate clients. The table below shows the minimums, maximums, means and standard deviations of the absolute, relative and percentage errors of the neural network for the testing data.

Mean absolute error1.534
Mean percentage error21.920%

As we can see, the mean error is 1.5 over 8. That means that the model is evaluating the risk of customers with quite a high precision.

Now, let's look at the histogram of the risk variable by applying the predictive model.

The previous histogram shows that now the risk variable follows a normal distribution, so it is well distributed. Now, most of the customers have average values, and the evaluations do not focus on the extremes. This is a desirable property of our risk model.


Insurance companies need to reduce the time of evaluating the risk of customers and provide more objective evaluations to reduce costs and provide more suitable insurance policies.

In this case study, we have used machine learning to assess customers' risk for life insurance accurately. Using this model, the company will reduce costs associated with very long evaluation processes and increase its benefits by offering each client the right insurance policy.

Related posts:

Subscribe To Our Newsletter