Introduction
Diabetic retinopathy risk prediction using machine learning helps physicians improve early detection, enabling timely intervention and reducing the risk of vision loss.
Diabetic retinopathy is a common complication of diabetes, and an accurate prognosis is critical for preserving patients’ quality of life.
However, predicting disease progression from routine clinical and laboratory data can be challenging.
We implemented a neural network model that learns from patient features such as age, systolic blood pressure, and cholesterol levels.
Using a dataset of 6,000 patients, the model achieved an accuracy of 74.3% and an area under the ROC curve of 0.75, demonstrating strong potential as a diabetic prognosis support tool.
Healthcare professionals interested in testing this methodology can explore Neural Designer through its trial version, enabling them to apply the diabetic prognosis machine learning model to new patient data.
Contents
The following index outlines the steps for performing the analysis.
1. Model type
This binary classification project focuses on diabetic retinopathy risk prediction, aiming to determine whether a patient will develop diabetic retinopathy based on blood test features.
The variable to be predicted can have two values (positive or negative for diabetic retinopathy). Thus, this is a binary classification project.
The goal here is to predict whether a patient will suffer from diabetic retinopathy, conditioned on blood test features.
2. Data set
Data source
The diabetic_retinopathy.csv file contains the data for this application. Target variables can only have two values in a classification project type: 0 (false) or 1 (true). The number of instances (rows) in the data set is 6000, and the number of variables (columns) is 6.
Variables
The following list summarizes the variables’ information:
Cell structure
- clump_thickness (1–10) – Benign cells form monolayers; malignant cells form multilayers.
- cell_size_uniformity (1–10) – Cancer cells vary in size and shape.
- cell_shape_uniformity (1–10) – Cancer cells vary in shape and size.
- single_epithelial_cell_size (1–10) – Enlarged epithelial cells may be malignant.
- bare_nuclei (1–10) – Nuclei without cytoplasm, often in benign tumors.
- bland_chromatin (1–10) – Uniform chromatin in benign cells; coarse in cancer cells.
- normal_nucleoli (1–10) – Small in normal cells, enlarged in cancer cells.
Cell behaviour
- marginal_adhesion (1–10) – Loss of adhesion is a sign of malignancy.
- mitoses (1–10) – High values indicate uncontrolled cell division.
Target variable
- diagnose (0 or 1) – Benign (0) or malignant (1) breast lump.
Instances
Finally, the use of all instances is set. Note that each instance contains a different patient’s input and target variables.
The dataset is split into three parts: 60% for training (3600 samples), 20% for validation (1200 samples), and 20% for testing (1200 samples).
Variables distributions
We can also calculate the distributions for all the variables.
The following pie chart shows the number of patients with diabetic retinopathy and without it in the data set.
Inputs-targets correlations
Other relevant numbers to remember are the input-target correlations, which indicate what factors influence the disease the most.
From the picture above, we can gather that all the variables have a similar influence on the target variable, except for the diastolic blood pressure, which is less related.
3. Neural network
The model uses a neural network, a type of artificial intelligence that learns to recognize complex patterns in clinical data.
The network analyzes the variables together, identifying patterns and relationships associated with a higher likelihood of diabetic complications.
Patient values, such as blood glucose, insulin levels, BMI, and other features, are normalized using the minimum-maximum scaling method.
The network includes a scaling layer and a dense layer.
The scaling layer XXX
The dense layer uses the logistic function, allowing the output to be interpreted as a probability of developing diabetic complications.
The yellow circles represent scaling neurons, and the red circles represent dense neurons.
The number of inputs is 4, and the number of outputs is 1.
The result is presented as predicted probabilities, providing healthcare professionals with an objective measure to support decision-making.
4. Training strategy
To train the neural network for diabetic prognosis, we defined a loss function and an optimization algorithm.
The loss function is the weighted squared error with L1 regularization, which allows the model to learn from mistakes while preventing overfitting.
For optimization, we used the quasi-Newton method, a standard and efficient approach for this type of problem.
The chart below illustrates how the error progressively decreased as the model became more accurate at predicting the likelihood of diabetic complications.
During training, the error decreased steadily, reaching a training error of 1.147 WSE and a selection error of 1.156 WSE, showing a good balance between fitting known data and generalizing to new patients.
5. Model selection
The objective of model selection is to find the network architecture with the best generalization properties, which minimizes the error on the selected instances of the data set.
Since the selection error we have achieved so far is 1.156, we don’t need to apply order selection or input selection here.
6. Testing analysis
The objective of the testing analysis is to validate the generalization performance of the trained neural network.
To validate a classification model, we need to compare the values provided by this model to the observed values.
ROC curve
The ROC curve is the standard method to evaluate binary classifiers.
A random model has an area under the curve (AUC) of 0.5, while a perfect model has an AUC of 1.
In this example, the AUC parameter is 0.826, indicating great performance.
Confusion matrix
This confusion matrix contains the true positives, false positives, false negatives, and true negatives for the variable diagnose.
The following table contains the elements of this matrix for a decision threshold of 0.5.
Predicted positive | Predicted negative | |
---|---|---|
Real positive | 471 | 164 |
Real negative | 140 | 425 |
The number of correctly classified instances is 134, and the number of misclassified instances is 2.
Binary classification tests
- Classification accuracy (ratio of instances correctly classified): 74.7%
- Error rate (ratio of instances misclassified):25.3%
- Sensitivity (ratio of real negatives that the model predicts as negatives): 74.2%
- Specificity (ratio of real positives that the model predicts as positives):75.2%
7. Model deployment
Once we have tested the neural network’s generalization performance, we can save the model for future use in the so-called deployment mode.
In this phase, the diabetic retinopathy prognosis model can be applied to new patients by calculating the neural network outputs based on their input variables.
The inputs required for the model are age, systolic blood pressure, and cholesterol.
These variables allow the model to estimate the probability that a patient may develop diabetic retinopathy.
The model represented by the neural network is shown below.
The neural network can assist physicians in early detection and timely intervention.
Conclusions
The diabetic retinopathy prognosis machine learning model developed from the Coursera dataset demonstrated good performance (accuracy = 74.3%, AUC = 0.75) in predicting whether patients would develop diabetic retinopathy.
The most influential variables, including age, systolic blood pressure, and cholesterol levels, align with established clinical knowledge, supporting the model’s reliability.
Due to its strong generalization capacity, this diabetic retinopathy prognosis model can serve as an effective tool to assist healthcare professionals in early risk assessment, complement clinical evaluations, and improve timely intervention for patients at risk of vision loss.
References
The data for this problem has been taken from the Coursera repository.