Diabetic retinopathy, also known as diabetic eye disease, is a medical condition in which damage occurs to the retina due to diabetes.
This example aims to assess whether the patient has a disease (diabetic retinopathy). The database used for this study was taken from the coursera repository.
This example is solved with Neural Designer. To follow it step by step, you can use the free trial.
This is a binary classification project, since the variable to be predicted can have two values (diabetic retinopathy or not).
The goal here is to predict the probability that a patient will suffer from diabetic retinopathy or not, conditioned on blood test features.
The diabetic_retinopathy.csv file contains the data for this application. In a classification project type, target variables can only have two values: 0 (false) or 1 (true). The number of instances (rows) in the data set is 6000, and the number of variables (columns) is 6.
The number of input variables, or attributes for each sample, is 3. All input variables are numeric-valued, and represent teh value of each test. The number of target variables is 1 and represents the absence or presence of cancer in an individual. The following list summarizes the variables information:
Finally, the use of all instances is set. Note that each instance contains the input and target variables of a different patient. The data set is divided into training, validation, and testing subsets. 60% of the instances will be assigned for training, 20% for generalization, and 20% for testing, specifically 3600 for training smaples, and 1200 for selection and testing examples.
Once the data set has been set, we are ready to perform a few related analytics. With that, we check the provided information and make sure that the data has good quality.
We can calculate the data statistics and draw a table with the minimums, maximums, means, and standard deviations of all variables in the data set. The next table depicts the values.
Also, we can calculate the distributions for all variables. The following figure shows a pie chart with the numbers of diabetic retinopathy and without it in the data set.
As we can see, the percentage of people that will suffer from diabetic retinopathy is 48.55% of the samples, and the ones that do not have diabetic retinopathy disease represents 51.45% of the samples, approximately.
Another relevant information to keep in mind is the inputs-targets correlations that indicate to us what factors most influence a disease like diabetic retinopathy.
From the above picture, we can conclude that all the variables have a considerable influence on the target variable.
The second step is to set a neural network to represent the classification function. For this class of applications, the neural network is composed of:
The scaling layer contains the statistics on the inputs calculated from the data file and the method for scaling the input variables. Here the minimum and maximum method has been set. Nevertheless, the mean and standard deviation method would produce very similar results.
A perceptron layer with an hyperbolic tangent layer is used. The neural network must have 5 inputs since there are 4 input variables and 1 output since there is one target variable. As an initial guess, we use 3 neurons in the hidden layer.
The probabilistic layer only contains the method for interpreting the outputs as probabilities. Indeed, as the sum of all outputs from a probabilistic layer must be 1, that two methods would always yield 1 here since there is only one output. Moreover, as the output layer's activation function is the logistic, that output can already be interpreted as a probability of class membership.
The next figure is a graphical representation of this neural network for diabetic retinopathy prognosis.
The yellow circles represent scaling neurons, the blue circles perceptron neurons and the red circles probabilistic neurons. The number of inputs is 4, and the number of outputs is 1.
The fourth step is to set the training strategy, which is composed of two terms:
The loss index is the weighted squared error with L1 regularization. This is the default loss index for binary classification applications.
The learning problem can be stated as finding a neural network that minimizes the loss index. That is, a neural network that fits the data set (error term) and does not oscillate (regularization term).
The optimization algorithm that we use is the quasi-Newton method. This is also the standard optimization algorithm for this type of problem.
The following chart shows how the error decreases with the iterations during the training process. The final training and selection errors are training error = 0.681 WSE and selection error = 0.705 WSE, respectively.
The blue line represents the training error and the orange line represents the selection error . The initial value of the training error is 0.950331, and the final value after 48 epochs is 0.673678. The initial value of the selection error is 1.06847, and the final value after 48 epochs is 0.682127.
The objective of model selection is to find the network architecture with the best generalization properties, that is, that which minimizes the error on the selected instances of the data set.
More specifically, we want to find a neural network with a selection error of less than 0.705 WSE, which is the value that we have achieved so far.
Order selection algorithms train several network architectures with a different number of neurons and select that with the smallest selection error.
The incremental order method starts with a small number of neurons and increases the complexity at each iteration. The following chart shows the training error (blue) and the selection error (orange) as a function of the number of neurons.
The figure below shows the final architecture for the neural network.
The number of inputs is 3, and the number of outputs is 1. The complexity, represented by the numbers of hidden neurons, is 3: 3: 1.
The objective of the testing analysis is to validate the generalization performance of the trained neural network. To validate a classification technique, we need to compare the values provided by this technique to the observed values. We can use the ROC curve as it is the standard testing method for binary classification projects.
The following table contains the elements of the confusion matrix. This matrix contains the true positives, false positives, false negatives, and true negatives for the variable diagnose. The total number of testing samples is 1200. The number of correctly classified samples is 893 (74%), and the number of misclassified samples is 307 (25%).
|Predicted positive||Predicted negative|
|Real positive||464 (38%)||145 (12%)|
|Real negative||162 (13%)||429 (35%)|
The binary classification tests are parameters for measuring the performance of a classification problem with two classes:
Once the neural network's generalization performance has been tested, the neural network can be saved for future use in the so-called model deployment mode.
We can prognosticate new patients by calculating the neural network outputs. For that we need to know the input variables for them. An example is the following:
The mathematical expression represented by the neural network is written below. The mathematical expression represented by the neural network is written below. It takes the inputs age, systolic_bp and cholesterol to produce the output prognosis. In classification models, the information is propagated in a feed-forward fashion through the scaling layer, the perceptron layers and the probabilistic layer.
scaled_age = age*(1+1)/(103.2789993-(35.16479874))-35.16479874*(1+1)/(103.2789993-35.16479874)-1; scaled_systolic_bp = systolic_bp*(1+1)/(151.6999969-(69.67539978))-69.67539978*(1+1)/(151.6999969-69.67539978)-1; scaled_cholesterol = cholesterol*(1+1)/(148.2339935-(69.96749878))-69.96749878*(1+1)/(148.2339935-69.96749878)-1; perceptron_layer_0_output_0 = sigma[ 0.851161 + (scaled_age*1.62563)+ (scaled_systolic_bp*1.05418)+ (scaled_cholesterol*1.01386) ]; perceptron_layer_0_output_1 = sigma[ 0.688599 + (scaled_age*1.48316)+ (scaled_systolic_bp*1.12786)+ (scaled_cholesterol*0.744651) ]; perceptron_layer_0_output_2 = sigma[ -1.40861 + (scaled_age*-2.32242)+ (scaled_systolic_bp*-1.61186)+ (scaled_cholesterol*-1.35143) ]; probabilistic_layer_combinations_0 = -0.31941 +2.39474*perceptron_layer_0_output_0 +2.16092*perceptron_layer_0_output_1 -3.6845*perceptron_layer_0_output_2 prognosis = 1.0/(1.0 + exp(-probabilistic_layer_combinations_0);
The above expression can be exported anywhere, for instance, to a dedicated diagnosis software to be used by doctors.
The file diabetic-retinopathy.py implements the mathematical expression of the neural network in Python. This piece of software can be embedded in any tool to make predictions on new data.
You can watch the step by step tutorial video below to help you complete this Machine Learning example for free using the easy-to-use machine learning software Neural Designer.