This example aims to assess obesity levels in individuals from the countries of Mexico, Peru, and Colombia, based on their eating habits and physical condition, to treat the patient.
- Application type.
- Data set.
- Neural network.
- Training strategy.
- Model selection.
- Testing analysis.
- Model deployment.
- Tutorial video.
1. Application type
The variable to be predicted is continuous (Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II, and Obesity Type III). Therefore, this is an approximation project.
Here, the basic goal is to model the obesity levels as a function of the input variables and advise the patient on improving the obesity level.
2. Data set
The data set contains three concepts:
- Data source.
The ObesityDataSet.csv file contains the data for this application.
The number of instances (rows) in the data set is 2111, and the number of variables (columns) is 17. The number of input variables, or attributes for each sample, is 14. Height and weight are unused variables related to the target variable. The input variables are numeric-valued, binary, and categorical. The number of target variables is 1 and represents the estimation of obesity levels in individuals.
The following list summarizes the variables information:
- gender: (1=Female or 0=Male).
- age: (Numeric).
- height: (Numeric).
- weight: (Numeric).
- family_history_with_overwight: (1=Yes/0=No).
- caloric_food:(0=Yes/1=No). Frequent consumption of high-caloric food.
- vegatables: (1, 2 or 3). Frequency of consumption of vegetables.
- number_meals: (1, 2, 3 or 4). The number of main meals.
- food_between_meals: (1=No, 2=Sometimes, 3=Frequently or 4=Always). Consumption of food between meals.
- smoke: (0=Yes/1=No).
- water: (1, 2 or 3). Consumption of water daily.
- calories: (0=Yes/1=No).Calories consumption monitoring.
- activity: (0, 1, 2 or 3). Physical activity frequency.
- technology: (0, 1 or 2). Time using technology devices.
- alcohol: (1=No, 2=Sometimes, 3=Frequently or 4=Always). Consumption of alcohol.
- transportation: (Automobile, motorbike, bike, public transportation or walking). Transportation used.
- obesity_level: (1=Insufficient_Weight, 2=Normal_Weight, 3=Overweight_Level_I, 4=Overweight_Level_II, 5=Obesity_Type_I, 6=Obesity_Type_II, 7=Obesity_Type_III)
Finally, the use of all instances is set.
Note that each instance contains the input and target variables of a different patient. The data set is divided into training, validation, and testing subsets. 60% of the instances will be assigned for training, 20% for generalization, and 20% for testing. More specifically, 1267 samples are used here for training, 422 for selection, and 422 for testing samples.
Once the data set has been set, we are ready to perform a few related analytics. We check the provided information and ensure the data is of good quality. We can calculate the distributions of the variables.
The following chart shows the histogram for the obesity level.
As we can see, the obesity level has a semi-normal distribution. The maximum frequency is 16.6272%, corresponding to the bin with center 5. The minimum frequency is 12.8849%, corresponding to the bin with center 1.
The inputs-targets correlations might indicate to us what factors most influence patients’ obesity level.
The most correlated variables with obesity levels are caloric food, family_history_with_overweight, and age.
3. Neural network
The third step is to set the model parameters. For approximation, the project type is composed of:
- Scaling layer.
- Perceptron layers.
- Unscaling layer.
The mean and standard deviation are set as the scaling method, while the minimum and maximum are set as the unscaling method. The activation function chosen for this model is the hyperbolic tangent activation function and the linear activation function for the hidden layer and the output layer, respectively.
A graphical representation of the neural network is depicted next.
It contains a scaling layer, two perceptron layers, and an unscaling layer.
The number of inputs is 18, and the number of outputs is 1. The complexity, represented by the number of neurons in the hidden layer, is 3.
4. Training strategy
The fourth step is to set the training strategy, which is composed of two terms:
- A loss index.
- An optimization algorithm.
The learning problem can be stated as finding a neural network that minimizes the loss index. That is, a neural network that fits the data set (error term) and does not oscillate (regularization term). The loss index is the normalized squared error with L1 regularization. The optimization algorithm that we use is the quasi-Newton method. This is also the standard optimization algorithm for this type of problem.
This chart shows how the error decreases with the iterations during the training process.
The final training and selection errors are training error = 0.0176 WSE and selection error = 0.0236 WSE, respectively.
5. Model selection
The objective of model selection is to improve the generalization capabilities of the neural network or, in other words, to reduce the selection error.
6. Testing analysis
Once the model is trained, we perform a testing analysis to validate its prediction capacity. This will be done by comparing the neural network outputs against the real target values for a data set never seen before.
The testing analysis will determine if the model is ready to move to the production phase.
The next chart illustrates the linear regression analysis for the variable particles_adhering.
For a perfect fit, the intercept, slope, and correlation values should be 0, 1, and 1. In this case, we have intercept = 0.263, slope = 0.947 and correlation = 0.844.
The achieved values are close to the ideal ones, so the model performs well.
7. Model deployment
Once the neural network’s generalization performance has been tested, the neural network can be saved for future use in the so-called model deployment mode.
We can treat new patients by calculating the neural network outputs.
For that, we need to know patients next details. Here we have a new patient:
- gender: Male.
- age: 35.
- family_history_with_overwight: Yes.
- caloric_food: Yes.
- vegatables: 2.
- number_meals: 2.
- food_between_meals: No.
- smoke: Yes.
- water: 2.
- calories: Yes.
- activity: 1.
- technology: 1.
- alcohol: Frequently.
- transportation: public_transportation.
- obesity_level: 5.406 = Obesity_Type_I.
We can plot directional outputs to study the behavior of the output variable obesity_level as the function of single inputs.
As we see in calculating correlations, weight and height are the inputs that most influence obesity_level. If patients’ height increases, obesity level decreases, and the inverse happen with the weight.
Despite those two attributes, this patient can reduce their obesity level by changing two of her habits. As the obesity level is not a problem that can be solved from one day to another, the treatment must be applied little by little.
The treatment will increase the number of food between meals and have more activity. We can see it below:
The first plot shows the output obesity level as a function of the input activity. The second one represents the output obesity level as a function of the input food between meals.
The last graph represents that if the patient consumes more calories, the obesity level is higher.
To sum up, the treatment for this patient is:
- increase the number of food between meals
- more activity such as walking at least 30mins each day
- decrease calories consumption monitoring
Besides, we can use the mathematical expression of the neural network, which is listed next.
scaled_gender = (gender-(0.4940789938))/0.5000830293; scaled_age = age*(1+1)/(61-(14))-14*(1+1)/(61-14)-1; scaled_family_history_with_overweight = family_history_with_overweight*(1+1)/(1-(0))-0*(1+1)/(1-0)-1; scaled_caloric_food = (caloric_food-(0.1160589978))/0.320371002; scaled_vegetables = (vegetables-(2.419039965))/0.5339270234; scaled_number_meals = number_meals*(1+1)/(4-(1))-1*(1+1)/(4-1)-1; scaled_food_between_meals = (food_between_meals-(2.140690088))/0.4685429931; scaled_smoke = smoke*(1+1)/(1-(0))-0*(1+1)/(1-0)-1; scaled_water = (water-(2.008009911))/0.6129530072; scaled_calories = calories*(1+1)/(1-(0))-0*(1+1)/(1-0)-1; scaled_activity = (activity-(1.01030004))/0.8505920172; scaled_technology = (technology-(0.6578660011))/0.6089270115; scaled_alcohol = (alcohol-(1.731410027))/0.5154979825; scaled_public_transportation = (public_transportation-(0.7484599948))/0.4340009987; scaled_walking = walking*(1+1)/(1-(0))-0*(1+1)/(1-0)-1; scaled_automobile = (automobile-(0.2164849937))/0.4119459987; scaled_motorbike = motorbike*(1+1)/(1-(0))-0*(1+1)/(1-0)-1; scaled_bike = bike*(1+1)/(1-(0))-0*(1+1)/(1-0)-1; perceptron_layer_0_output_0 = tanh[ -1.16742 + (scaled_gender*0.68282)+ (scaled_age*-0.420455)+ (scaled_family_history_with_overweight*0.691903)+ (scaled_caloric_food*-0.244554)+ (scaled_vegetables*0.769029)+ (scaled_number_meals*0.546811)+ (scaled_food_between_meals*-0.593264)+ (scaled_smoke*-0.250848)+ (scaled_water*-0.00675811)+ (scaled_calories*0.564604)+ (scaled_activity*0.0689401)+ (scaled_technology*0.0396549)+ (scaled_alcohol*-0.0964615)+ (scaled_public_transportation*0.839636)+ (scaled_walking*0.748955)+ (scaled_automobile*0.268782)+ (scaled_motorbike*0.528234)+ (scaled_bike*1.18717) ]; perceptron_layer_0_output_1 = tanh[ 0.32342 + (scaled_gender*-0.195494)+ (scaled_age*0.686579)+ (scaled_family_history_with_overweight*0.0506046)+ (scaled_caloric_food*0.00697568)+ (scaled_vegetables*-0.0795011)+ (scaled_number_meals*-0.0896854)+ (scaled_food_between_meals*0.294397)+ (scaled_smoke*-0.0825197)+ (scaled_water*-0.0133305)+ (scaled_calories*0.0261315)+ (scaled_activity*-0.0911458)+ (scaled_technology*0.010449)+ (scaled_alcohol*0.0786926)+ (scaled_public_transportation*-0.1329)+ (scaled_walking*-0.293047)+ (scaled_automobile*-0.286349)+ (scaled_motorbike*-0.128273)+ (scaled_bike*-0.160357) ]; perceptron_layer_0_output_2 = tanh[ -0.392613 + (scaled_gender*-0.367176)+ (scaled_age*-0.808962)+ (scaled_family_history_with_overweight*-0.384984)+ (scaled_caloric_food*0.131679)+ (scaled_vegetables*-0.0930502)+ (scaled_number_meals*0.221046)+ (scaled_food_between_meals*1.75847)+ (scaled_smoke*0.0215909)+ (scaled_water*-0.178107)+ (scaled_calories*0.513794)+ (scaled_activity*-0.360404)+ (scaled_technology*0.210205)+ (scaled_alcohol*0.451744)+ (scaled_public_transportation*0.289853)+ (scaled_walking*-0.0703812)+ (scaled_automobile*-0.570434)+ (scaled_motorbike*0.979235)+ (scaled_bike*0.448767) ]; perceptron_layer_1_output_0 = [ 0.179259 + (perceptron_layer_0_output_0*1.37047)+ (perceptron_layer_0_output_1*1.60426)+ (perceptron_layer_0_output_2*-0.714662) ]; unscaling_layer_output_0 = perceptron_layer_1_output_0*(7-1)/(1+1)+1+1*(7-1)/(1+1);
This piece of software can be embedded in any tool to make predictions on new data.
8. Tutorial video
You can watch the step-by-step tutorial video below to help you complete this Machine Learning example
for free using the easy-to-use machine learning software Neural Designer.
- The data for this problem has been taken from the UCI Machine Learning Repository.
- Palechor, F. M., & de la Hoz Manotas, A. (2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data in Brief, 104344.
- De-La-Hoz-Correa, E., Mendoza Palechor, F., De-La-Hoz-Manotas, A., Morales Ortega, R., & Sanchez Hernandez, A. B. (2019). Obesity level estimation software based on decision trees.