Diabetic retinopathy, also known as diabetic eye disease, is a medical condition that damages the retina due to diabetes.
This example assesses whether the patient has the disease (diabetic retinopathy).
The database used for this study was taken from the Coursera repository.
- Application type.
- Data set.
- Neural network.
- Training strategy.
- Model selection.
- Testing analysis.
- Model deployment.
- Tutorial video.
1. Application type
The variable to be predicted can have two values (positive or negative on diabetic retinopathy). Thus, this is a binary classification project.
The goal here is to predict whether a patient will suffer from diabetic retinopathy or not, conditioned on blood test features.
2. Data set
The diabetic_retinopathy.csv file contains the data for this application. Target variables can only have two values in a classification project type: 0 (false) or 1 (true). The number of instances (rows) in the data set is 6000, and the number of variables (columns) is 6.
The number of input variables, or attributes for each sample, is 3. All input variables are numerical values and represent the results of each test. The number of target variables is 1 and represents the presence or absence of retinopathy in each individual. The following list summarizes the information of the variables:
- age: (numeric).
- systolic_bp: (normal range: below 120mmHg). When the heart beats, it squeezes and pushes blood through the arteries to the rest of the body. This force creates pressure on the blood vessels, and that is the systolic blood pressure.
- diastolic_bp: (normal range: lower than 80mmHg). It is the pressure in the arteries when the heart rests between beats. This is the time when the heart fills with blood and gets oxygen.
- cholesterol: (normal range: between 125 and 200 mg/dl). It is a waxy, fat-like substance found in every cell in the body.
- prognosis: (0 or 1). It is 1 if the patient has retinopathy and 0 if he doesn’t.
Finally, the use of all instances is set. Note that each instance contains the input and target variables of a different patient.
The data set is divided into training, validation, and testing subsets. 60% of the instances will be assigned for training, 20% for generalization, and 20% for testing. Specifically, 3600 are training samples, and 1200 are selection and testing examples.
Once the data set has been set, we can perform a few related analytics. We check the provided information with these and ensure that the data is of good quality.
We can calculate the data statistics and draw a table with the minimums, maximums, means, and standard deviations of all attributes in the data set. The next table depicts these values.
We can also calculate the distributions for all the variables. The following pie chart shows the number of patients with diabetic retinopathy and without it in the data set.
As we can see, the percentage of people who will suffer from diabetic retinopathy in the samples is 48.55%. The ones that will not have it represent 51.45% approximately.
Other relevant numbers to keep in mind are the inputs-targets correlations, that indicate what factors influence the disease the most.
From the picture above, we can gather that all the variables have a similar influence on the target variable, except for the diastolic blood pressure, which is less related.
3. Neural network
The second step is to set a neural network to represent the classification function. For this type of model, the neural network is composed of:
- Scaling layer.
- Perceptron layers.
- Probabilistic layer.
The scaling layer contains the statistics on the inputs calculated from the data file and the method for scaling the input variables. Here the minimum-maximum method has been set.
Next, a perceptron layer with an hyperbolic tangent layer is used. The neural network must have four inputs, one for each input variable and one output for the target variable. As an initial guess, we use three neurons in the hidden layer.
The probabilistic layer only contains the method for interpreting the outputs as probabilities. As the output layer’s activation function is the logistic function, the output can be understood as a probability of class membership already.
The following figure is a graphical representation of this neural network for diabetic retinopathy prognosis.
The yellow circles represent scaling neurons, the blue circles represent perceptron neurons, and the red circles represent probabilistic neurons. The number of inputs is 4, and the number of outputs is 1.
4. Training strategy
The fourth step is to set the training strategy, which is composed of two terms:
- A loss index.
- An optimization algorithm.
The following chart shows how the error decreases with each iteration during the training process. The final training and selection errors are training error = 0.681 WSE and selection error = 0.705 WSE, respectively.
The blue line represents the training error, and the orange is the selection error. The initial value of the training error is 0.950331, and the final value after 48 epochs is 0.673678. The initial value of the selection error is 1.06847, and the absolute value after 48 epochs is 0.682127.
5. Model selection
More specifically, we want to find a neural network with a selection error of less than 0.705 WSE, which is the value that we have achieved so far.
Order selection algorithms train several network architectures with different number of neurons or inputs, and choose that with the smallest selection error.
The incremental order method starts with a small number of neurons and increases the complexity at each iteration. The following chart shows the training error (blue) and the selection error (orange) as a function of the number of neurons.
The figure below shows the final architecture for the neural network. We can see that it does not use the diastolic bp input. This is supported by the results that we obtained for the inputs-targets correlations.
The number of inputs is 3, and the number of outputs is 1. Therefore, the complexity of the number of hidden neurons is 3: 3: 1.
6. Testing analysis
The objective of the testing analysis is to validate the generalization performance of the trained neural network. To validate a classification model, we need to compare the values provided by this model to the observed values. We can use the ROC curve as it is the standard testing method for binary classification projects.
The following table contains the elements of the confusion matrix. This matrix contains the true positives, false positives, false negatives, and true negatives for the variable diagnosis. The total number of testing samples is 1200. The number of correctly classified samples is 893 (74%), and the number of misclassified samples is 307 (25%).
|Predicted positive||Predicted negative|
|Real positive||464 (38%)||145 (12%)|
|Real negative||162 (13%)||429 (35%)|
The binary classification tests are parameters for measuring the performance of a classification problem with two classes:
- Classification accuracy (ratio of instances correctly classified): 74.33%
- Error rate (ratio of instances misclassified): 25.66%
- Sensitivity (ratio of real positive which are predicted positive): 75.041%
- Specificity (ratio of real negative which are predicted negative): 74.11%
7. Model deployment
Once the neural network’s generalization performance has been tested, the neural network can be saved for future use in the so-called model deployment mode.
We can prognosticate new patients by calculating the neural network outputs. For that, we need to know the input variables for them. An example is the following:
- age: 55
- systolic_bp: 100.89 mmHg
- cholesterol: 140.78 mg/dl
- prognosis: 0.85, so the patient has 85% probability of suffering diabetic retinopathy.
We can also use Response Optimization. The objective of the response optimization algorithm is to exploit the mathematical model to look for optimal operating conditions.
An example is to minimize the probability of suffering diabetic retinopathy while maintaining the age between two values.
The next table resumes the conditions for this problem.
The next list shows the optimum values for previous conditions.
- age: 65.
- systolic_bp: 72.23 mmHg.
- diastolic_bp: 93.89 mmHg.
- cholesterol: 84.46 mg/dl.
- prognosis: 23.69%.
The mathematical expression represented by the neural network is written below. It takes age, systolic_bp, and cholesterol to produce the output prognosis. In classification models, the information is propagated feed-forward through the scaling, perceptron, and probabilistic layers.
scaled_age = age*(1+1)/(103.2789993-(35.16479874))-35.16479874*(1+1)/(103.2789993-35.16479874)-1; scaled_systolic_bp = systolic_bp*(1+1)/(151.6999969-(69.67539978))-69.67539978*(1+1)/(151.6999969-69.67539978)-1; scaled_cholesterol = cholesterol*(1+1)/(148.2339935-(69.96749878))-69.96749878*(1+1)/(148.2339935-69.96749878)-1; perceptron_layer_0_output_0 = sigma[ 0.851161 + (scaled_age*1.62563)+ (scaled_systolic_bp*1.05418)+ (scaled_cholesterol*1.01386) ]; perceptron_layer_0_output_1 = sigma[ 0.688599 + (scaled_age*1.48316)+ (scaled_systolic_bp*1.12786)+ (scaled_cholesterol*0.744651) ]; perceptron_layer_0_output_2 = sigma[ -1.40861 + (scaled_age*-2.32242)+ (scaled_systolic_bp*-1.61186)+ (scaled_cholesterol*-1.35143) ]; probabilistic_layer_combinations_0 = -0.31941 +2.39474*perceptron_layer_0_output_0 +2.16092*perceptron_layer_0_output_1 -3.6845*perceptron_layer_0_output_2 prognosis = 1.0/(1.0 + exp(-probabilistic_layer_combinations_0);
The above expression can be exported anywhere, for instance, to a diagnosis software to be used by doctors.
8. Tutorial video
You can watch the step by step tutorial video below to help you complete this Machine Learning example
for free with the easy-to-use machine-learning software Neural Designer.
- The data for this problem has been taken from the Coursera repository.