The purpose of this example is to diagnose the leukemia of patients, acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML), depending on their DNA coding.
Furthermore, because of the large DNA code, a gene selection will be performed to simplify the model and better understand the disease.
This is a microarray analysis application.
The DNA is coded in 7129 genes; each one takes a value between 0 and 1. The output of the model is a binary value; it takes the value 0 for the acute lymphoblastic leukemia (ALL) and the value 1 for the acute myeloid leukemia (AML).
This example is solved with Neural Designer. To follow it step by step, you can use the free trial.
This is a classification project since the variable to be predicted is binary (ALL or AML).
The goal here is to model the probability of ALL, conditioned on the microarray signals. Note that the probability of AML is 1 - ALL.
The data file contains a total of 7129 genes and 72 patients. The first row in the data file contains the names of the variables, and the rest represent the instances.
The data distribution tells us the percentages of ALL and AML for the current dataset.
The inputs-targets correlations indicate to us which genes are more related to ALL or AML diseases.
As we can see in the previous figure, there are some genes with a high correlation with the diagnosis. The gene 4847 and the gene 2288 have a perfect correlation with the target variable.
The second step is to choose a neural network to represent the classification function. For classification problems, it is composed of:
However, due to the massive amount of variables in this dataset, we are not defining the neural network in this step.
The training strategy is applied to the neural network to obtain the best possible performance. It is composed of two things:
We will not perform the training strategy for this example. As stated previously, the dataset contains a large number of variables. Therefore, before making choices about the neural network or the training strategy, we will perform a proper model selection.
Due to a large number of variables in the microarray problems, a feature selection should be performed. The input selection is used to find the optimal subset of inputs for the model's best performance.
In this example, the input selection algorithm selected is the growing inputs. This method is optimum for this kind of problem.
The next table shows the results of the input selection.
|Optimal number of inputs||1|
|Optimum training error||0.0297108|
|Optimum selection error||0.0926624|
We can observe that the algorithm did no iterations. This is because it found a variable with perfect correlation, and it stopped with this variable as the only input.
Order selection algorithms train several network architectures with a different number of neurons and select that with the smallest selection error.
The incremental order method starts with a small number of neurons and increases the complexity at each iteration. The following chart shows the training error (blue) and the selection error (orange) as a function of the number of neurons.
The final selection error achieved is 0.029 for an optimal number of neurons of 1.
The final neural network is displayed below.
A standard method for testing the prediction capabilities is to compare the outputs from the neural network against an independent set of data. The correlation matrix shows which instances are misclassified.
|Predicted positive||Predicted negative|
As we can see in this confusion matrix, the model performs a perfect prediction of the leukemia class with independent data to those used for the training and the input selection.
Once the model is obtained, Neural Designer provides the user with the mathematical expression of it in several programming languages. The file leukemia.py contains the model in python language.