Introduction

Obesity level prediction using machine learning helps healthcare professionals provide personalized recommendations and support better decision-making.

Obesity is a growing health concern in Mexico, Peru, and Colombia, and accurately predicting obesity levels is critical for effective management and prevention of related diseases.

However, evaluating obesity based on eating habits, physical activity, and phenotypic characteristics can be complex.

We implemented a neural network model that learns from lifestyle, anthropometric, and family history variables to estimate obesity levels.

Using the ObesityDataSet.csv dataset, the model achieved strong performance (correlation = 0.844), showing high potential as a support tool for assessing and managing obesity.

Healthcare professionals can test this methodology with Neural Designer’s trial version.

Contents

The following index outlines the steps for performing the analysis.

1. Model type

The variable to be predicted is continuous (Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II, and Obesity Type III). Therefore, this is an approximation project. Here, the basic goal is to model the obesity levels as a function of the variables and advise the patient on improving the obesity level.

2. Data set

Data source

The ObesityDataSet.csv dataset contains 2,111 instances and 17 variables for this application.

Variables

The following list summarizes the variables’ information:

Patient information

  • gender (1=Female, 0=Male) – Sex of the individual.

  • age (numeric) – Age of the individual in years.

Anthropometric measurements

  • height (numeric) – Height of the individual in centimeters.

  • weight (numeric) – Weight of the individual in kilograms.

Family and lifestyle factors

  • family_history_with_overweight (1=Yes, 0=No) – Indicates if obesity runs in the family.

  • caloric_food (0=Yes, 1=No) – Frequent consumption of high-caloric food.

  • vegetables (1, 2, or 3) – Frequency of vegetable consumption.

  • number_meals (1–4) – Number of main meals per day.

  • food_between_meals (1=No, 2=Sometimes, 3=Frequently, 4=Always) – Consumption of food between meals.

  • smoke (0=Yes, 1=No) – Indicates whether the individual smokes.

  • water (1–3) – Daily water consumption.

  • calories (0=Yes, 1=No) – Indicates if calorie intake is monitored.

  • activity (0–3) – Frequency of physical activity.

  • technology (0–2) – Daily time using technology devices.

  • alcohol (1=No, 2=Sometimes, 3=Frequently, 4=Always) – Alcohol consumption frequency.

  • transportation (Automobile, motorbike, bike, public transportation, walking) – Mode of transportation used.

Target variable

  • obesity_level (1–7) – Classification of obesity level: Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II, and Obesity Type III.

Instances

The dataset’s instances (one per patient, including input and target variables) are split into training (60%), validation (20%), and testing (20%) subsets by default, adjustable as needed.

Variables distributions

Once the data set has been set, we are ready to perform a few related analytics. We check the provided information and ensure the data is of good quality. We can calculate the distributions of the variables.

The following chart shows the histogram for the obesity level.

As we can see, the obesity level has a semi-normal distribution. The maximum frequency is 16.6272%, corresponding to the bin with center 5. The minimum frequency is 12.8849%, corresponding to the bin with center 1.

Inputs-targets correlations

The input-target correlations might indicate what factors most influence patients’ obesity levels.

The most correlated variables with obesity are caloric food, family_history_with_overweight, and age.

3. Neural network

The model uses a neural network, a type of artificial intelligence that learns to recognize complex patterns in clinical data. Patient values, such as age, height, weight, and other clinical measurements, are normalized using the minimum-maximum scaling method.

The network includes a scaling layer, two perceptron layers, and a probabilistic layer. The first perceptron layer has multiple inputs and a few neurons, while the second layer outputs probabilities corresponding to the different obesity levels. The probabilistic layer uses the softmax method to produce probabilities for each category.

The network analyzes the variables together, identifying patterns and relationships that allow accurate classification of obesity levels.

The result is presented as predicted probabilities for each category, providing healthcare professionals with an objective measure to support clinical decision-making. This approach helps prioritize interventions or lifestyle recommendations and complements traditional clinical assessments.

It contains a scaling layer, two perceptron layers, and an unscaling layer.

The number of inputs is 18, and the number of outputs is 1. The complexity, represented by the number of neurons in the hidden layer, is 3.

4. Training strategy

To train the neural network for obesity level classification, we defined a loss function and an optimization algorithm. The loss function is the normalized squared error with L1 regularization, which allows the model to learn from errors while preventing overfitting.

For optimization, we used the quasi-Newton method, a standard and efficient approach for this type of problem.

During training, the error decreased steadily, indicating improved accuracy in classifying obesity levels.

The chart below illustrates how the error progressively decreased as the model became more precise in predicting different obesity categories.

This chart shows how errors decrease with the iterations during training.

The final training and selection errors are training error = 0.0176 WSE and selection error = 0.0236 WSE, respectively.

5. Model selection

The objective of model selection is to improve the generalization capabilities of the neural network or, in other words, to reduce the selection error.

Given the very small selection error we have achieved (0.0236 NSE), there is no need to apply order selection or input selection.

6. Testing analysis

Once the model is trained, we perform a testing analysis to validate its prediction capacity. This will be done by comparing the neural network outputs against the real target values for a previously unseen data set.

The testing analysis will determine if the model is ready to move to the production phase.

Linear regression analysis

The next chart illustrates the linear regression analysis for the variable particles_adhering.

The intercept, slope, and correlation values should be 0, 1, and 1 for a perfect fit. In this case, we have intercept = 0.263, slope = 0.947, and correlation = 0.844.

The achieved values are close to ideal, so the model performs well.

The following table contains the elements of the confusion matrix. This matrix contains the true positives, false positives, false negatives, and true negatives for the variable diagnose.

Predicted normal weightPredicted Obese IPredicted Obese IIPredicted Obese IIIPredicted Overweight IPredicted Overweight IIPredicted Underweight
Real normal weight260056118
real Obese I871559273
real Obese II104812150
real Obese III10086000
real Overweight I811510203
real Overweight II62423242
real Underweight150030625

The number of correctly classified instances is 226, and the number of misclassified instances is 196. From this table, we can calculate the binary classification tests.

7. Model deployment

Once we have tested the neural network’s generalization performance, we can save the model for future use in the so-called deployment mode.

In this phase, the obesity level prediction machine learning model can be applied to new individuals by calculating the neural network outputs based on their input variables, such as age, gender, eating habits, and physical activity.

This approach allows healthcare professionals and nutrition specialists to use the trained model as a decision-support tool, providing guidance on obesity management and personalized recommendations for improving the patient’s obesity level.

The model represented by the neural network is shown below.

Conclusions

The obesity level prediction machine learning model developed from the ObesityDataSet.csv demonstrated high performance (correlation = 0.844) in estimating individuals’ obesity levels.

The most influential variables, including caloric food consumption, family history of overweight, and age, align with established medical and nutritional knowledge, supporting the model’s reliability.

Due to its strong generalization capacity, this obesity prediction machine learning model can serve as an effective tool to assist healthcare professionals and nutrition specialists in assessing obesity, providing personalized recommendations, and supporting interventions to improve patients’ health outcomes.

References

  • We have obtained the data for this problem from the UCI Machine Learning Repository.
  • Palechor, F. M., & de la Hoz Manotas, A. (2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data in Brief, 104344.
  • De-La-Hoz-Correa, E., Mendoza Palechor, F., De-La-Hoz-Manotas, A., Morales Ortega, R., & Sanchez Hernandez, A. B. (2019). Obesity level estimation software based on decision trees.

Related posts