Gene logo

Leukemia microarray diagnosis

By Fernando Gomez, Artelnics.

The purpose of this example is to diagnose the leukemia of a patient depending on their DNA coding. Furthermore, because of the large DNA code a gene selection will be performed in order to simplify the model and get better knowledge of the disease.

The DNA is coded in 7129 genes, each one takes a value between 0 and 1. The output of the model is a binary value, it takes the value 0 for the acute lymphoblastic leukemia (ALL) and the value 1 for the acute myeloid leukemia (AML).

Microarray representation
Microarray representation.

Contents:

  1. Data set
  2. Inputs selection
  3. Testing analysis
  4. Model deployment

1. Data set

The data file contains a total of 7129 genes and 72 patients. The first row in the data file contains the names of the variables and the rest of them represent the instances. The following image contains a preview of the data set obtained by using the task "Report data set".

Data set preview
Data set preview.

In this kind of applications, it might be interesting to look for logistic dependencies between single input and single target variables.

Correlations chart
Correlations char.

As we can see in the previous figure, there are some genes with a high correlation with the diagnosis by themself. In fact, the gene 4847 and the gene 2288 have a perfect correlation with the target variable.

2. Inputs selection

Due to the large number of variables in the microarray problems, a feature selection should be performed. The inputs selection is used to find the optimal subset of inputs for the best performance of the model.

In this example, the inputs selection algorithm selected is the growing input. This method is the optimum for this kind of problems.

The output of the results shows the next table with the final values of the algorithm and its losses.

Growing inputs results
Growing intputs results.

In the previous table, we can see that the algorithm did no iterations. This is because it found a variable with perfect correlation, and it stopped with this variable as the only input.

Finally, Neural Designer shows the final architecture of the neural network, see the next figure.

Final architecture
Final architecture.

3. Testing analysis

A standard method for testing the prediction capabilities is to compare the outputs from the neural network against an independent set of data. The correlation matrix shows which instances have been misclassified.

Confusion matrix
Leukemia confusion matrix.

As we can see in this confusion matrix, the model performs a perfect prediction of the leukemia class with independent data to those used for the training and the inputs selection.

4. Model deployment

Once the model is obtained, Neural Designer provides the user the mathematical expression of it in several programming languages. The next listing shows that formula in R language.

				Logistic <- function(x) { 

					1/(1+exp(-x))
				}

				Probability <- function(x) { 

					if(x < 0)  0
					else if(x > 1)  1
					else  x
				}

				expression <- function(Gene_2288) {

					scaled_Gene_2288<-2*(Gene_2288-0)/(1-0)-1
					non_probabilistic_Leukemia_class<-Logistic(-11.4153
					-17.362500000000001*scaled_Gene_2288)
					
					outputs <- Probability(non_probabilistic_Leukemia_class)
					outputs 
				} 
				

Bibliography

  • Golub,T.R., Slonim,D.K., Tamayo,P., "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring", Science, Vol. 286, pp. 531-537 (1998).