Introduction

Diabetic retinopathy risk prediction using machine learning helps physicians improve early detection, enabling timely intervention and reducing the risk of vision loss.

Diabetic retinopathy is a common complication of diabetes, and an accurate prognosis is critical for preserving patients’ quality of life.

However, predicting disease progression from routine clinical and laboratory data can be challenging.

We implemented a neural network model that learns from patient features such as age, systolic blood pressure, and cholesterol levels.

Using a dataset of 6,000 patients, the model achieved an accuracy of 74.3% and an area under the ROC curve of 0.75, demonstrating strong potential as a diabetic prognosis support tool.

Healthcare professionals interested in testing this methodology can explore Neural Designer through its trial version, enabling them to apply the diabetic prognosis machine learning model to new patient data.

Contents

The following index outlines the steps for performing the analysis.

1.Model type

2.Dataset

3.Neural network

4.Training strategy

5.Model selection

6.Testing analysis

7.Model deployment

1. Model type

This binary classification project focuses on diabetic retinopathy risk prediction, aiming to determine whether a patient will develop diabetic retinopathy based on blood test features.

The variable to be predicted can have two values (positive or negative for diabetic retinopathy). Thus, this is a binary classification project.

The goal here is to predict whether a patient will suffer from diabetic retinopathy, conditioned on blood test features.

2. Data set

Data source

The diabetic_retinopathy.csv file contains the data for this application. Target variables can only have two values in a classification project type: 0 (false) or 1 (true). The number of instances (rows) in the data set is 6000, and the number of variables (columns) is 6.

Variables

The following list summarizes the variables’ information:

Cell structure

  • clump_thickness (1–10) – Benign cells form monolayers; malignant cells form multilayers.
  • cell_size_uniformity (1–10) – Cancer cells vary in size and shape.
  • cell_shape_uniformity (1–10) – Cancer cells vary in shape and size.
  • single_epithelial_cell_size (1–10) – Enlarged epithelial cells may be malignant.
  • bare_nuclei (1–10) – Nuclei without cytoplasm, often in benign tumors.
  • bland_chromatin (1–10) – Uniform chromatin in benign cells; coarse in cancer cells.
  • normal_nucleoli (1–10) – Small in normal cells, enlarged in cancer cells.

Cell behaviour

  • marginal_adhesion (1–10) – Loss of adhesion is a sign of malignancy.
  • mitoses (1–10) – High values indicate uncontrolled cell division.

Target variable

  • diagnose (0 or 1) – Benign (0) or malignant (1) breast lump.

Instances

Finally, the use of all instances is set. Note that each instance contains a different patient’s input and target variables.

The dataset is split into three parts: 60% for training (3600 samples), 20% for validation (1200 samples), and 20% for testing (1200 samples).

Variables distributions

We can also calculate the distributions for all the variables.

The following pie chart shows the number of patients with diabetic retinopathy and without it in the data set.

About 48.6% of the samples have diabetic retinopathy, while 51.4% do not.

Inputs-targets correlations

Other relevant numbers to remember are the input-target correlations, which indicate what factors influence the disease the most.

From the picture above, we can gather that all the variables have a similar influence on the target variable, except for the diastolic blood pressure, which is less related.

3. Neural network

The model uses a neural network, a type of artificial intelligence that learns to recognize complex patterns in clinical data.

The network analyzes the variables together, identifying patterns and relationships associated with a higher likelihood of diabetic complications.

Patient values, such as blood glucose, insulin levels, BMI, and other features, are normalized using the minimum-maximum scaling method.

The network includes a scaling layer and a dense layer.

The scaling layer XXX

The dense layer uses the logistic function, allowing the output to be interpreted as a probability of developing diabetic complications.

The yellow circles represent scaling neurons, and the red circles represent dense neurons.

The number of inputs is 4, and the number of outputs is 1.

The result is presented as predicted probabilities, providing healthcare professionals with an objective measure to support decision-making.

4. Training strategy

To train the neural network for diabetic prognosis, we defined a loss function and an optimization algorithm.

The loss function is the weighted squared error with L1 regularization, which allows the model to learn from mistakes while preventing overfitting.

For optimization, we used the quasi-Newton method, a standard and efficient approach for this type of problem.

The chart below illustrates how the error progressively decreased as the model became more accurate at predicting the likelihood of diabetic complications.

During training, the error decreased steadily, reaching a training error of 1.147 WSE and a selection error of 1.156 WSE, showing a good balance between fitting known data and generalizing to new patients.

5. Model selection

The objective of model selection is to find the network architecture with the best generalization properties, which minimizes the error on the selected instances of the data set.

Since the selection error we have achieved so far is 1.156, we don’t need to apply order selection or input selection here.

6. Testing analysis

The objective of the testing analysis is to validate the generalization performance of the trained neural network.

To validate a classification model, we need to compare the values provided by this model to the observed values.

ROC curve

The ROC curve is the standard method to evaluate binary classifiers.

A random model has an area under the curve (AUC) of 0.5, while a perfect model has an AUC of 1.

In this example, the AUC parameter is 0.826, indicating great performance.

Confusion matrix

This confusion matrix contains the true positives, false positives, false negatives, and true negatives for the variable diagnose.

The following table contains the elements of this matrix for a decision threshold of 0.5.

Predicted positivePredicted negative
Real positive471164
Real negative140425

The number of correctly classified instances is 134, and the number of misclassified instances is 2.

Binary classification tests

From the confusion matrix, we can calculate metrics that measure the performance of a binary classifier:

  • Classification accuracy (ratio of instances correctly classified): 74.7%
  • Error rate (ratio of instances misclassified):25.3%
  • Sensitivity (ratio of real negatives that the model predicts as negatives): 74.2%
  • Specificity (ratio of real positives that the model predicts as positives):75.2%

The model shows balanced performance, correctly classifying about three out of four cases.

7. Model deployment

Once we have tested the neural network’s generalization performance, we can save the model for future use in the so-called deployment mode.

In this phase, the diabetic retinopathy prognosis model can be applied to new patients by calculating the neural network outputs based on their input variables.

The inputs required for the model are age, systolic blood pressure, and cholesterol.

These variables allow the model to estimate the probability that a patient may develop diabetic retinopathy.

The model represented by the neural network is shown below.

The neural network can assist physicians in early detection and timely intervention.

Conclusions

The diabetic retinopathy prognosis machine learning model developed from the Coursera dataset demonstrated good performance (accuracy = 74.3%, AUC = 0.75) in predicting whether patients would develop diabetic retinopathy.

The most influential variables, including age, systolic blood pressure, and cholesterol levels, align with established clinical knowledge, supporting the model’s reliability.

Due to its strong generalization capacity, this diabetic retinopathy prognosis model can serve as an effective tool to assist healthcare professionals in early risk assessment, complement clinical evaluations, and improve timely intervention for patients at risk of vision loss.

Solve this case study or build your own model with your data using Neural Designer.

References

The data for this problem has been taken from the Coursera repository.

Related posts