Breast cancer diagnosis
By Roberto Lopez, Artelnics.
In this tutorial a classification application in medicine is solved by means of a neural network. The data for this problem has been taken from the UCI Machine Learning Repository.
The aim of this classification problem is to assess whether a lump in a breast could be malignant (cancerous) or benign (non-cancerous) from digitized images of a fine-needle aspiration biopsy. The following figure illustrates this example.
The breast cancer database used here was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.
The first step is to prepare the data file, which is the source of information for the classification problem. The breastcancer.dat file contains the data for this application. The input to Neural Designer is a data set, which can have different formats (CSV, XLS, etc.). Decimal marks should be points, not commas. In classification project type, target variables can only have two values: 0 (false) or 1 (true). The following listing is a preview of the data file. The number of instances (rows) in the data set is 683, and the number of variables (columns) is 10.
The next figure shows the data set tab in Neural Designer. It contains three sections:
- Data file.
- Variables information.
- Instances information.
- Missing values information.
To set a data file click on the "Import data file" button or "import database" button (depending on the type of data), and select it through the file dialog which appears. If the data has a correct format, the first, second and last instances will be shown in the data preview table. Also, the numbers of variables and instances are depicted below right.
Then the information about the variables is edited. The number of input variables, or attributes, for each sample is 9. All input variables are numeric-valued, and represent measurements from digitized images of a fine-needle aspiration biopsy. The number of target variables, is 1, and represents the absence or presence of cancer in an individual. This is a binary classification application, with one target which represents two classes. The following list summarizes the variables information:
- clump_thickness: (1-10). Benign cells tend to be grouped in monolayers, while cancerous cells are often grouped in multilayers.
- cell_size_uniformity: (1-10). Cancer cells tend to vary in size and shape. That is why these parameters are valuable in determining whether the cells are cancerous or not.
- cell_shape_uniformity: (1-10). Uniformity of cell size/shape: Cancer cells tend to vary in size and shape. That is why these parameters are valuable in determining whether the cells are cancerous or not.
- marginal_adhesion: (1-10). Normal cells tend to stick together. Cancer cells tends to loose this ability. So loss of adhesion is a sign of malignancy.
- single_epithelial_cell_size: (1-10). It is related to the uniformity mentioned above. Epithelial cells that are significantly enlarged may be a malignant cell.
- bare_nuclei: (1-10). This is a term used for nuclei that is not surrounded by cytoplasm (the rest of the cell). Those are typically seen in benign tumours.
- bland_chromatin: (1-10). Describes a uniform "texture" of the nucleus seen in benign cells. In cancer cells the chromatin tend to be more coarse.
- normal_nucleoli: (1-10). Nucleoli are small structures seen in the nucleus. In normal cells the nucleolus is usually very small if visible at all. In cancer cells the nucleoli become more prominent, and sometimes there are more of them.
- mitoses: (1-10). Cancer is essentially a disease of uncontrolled mitosis.
- diagnose: (0 or 1). Benign (non-cancerous) or malignant (cancerous) lump in a breast.
Finally, the use of all instances is set. Note that each instance contains the input and target variables of a different patient. The data set is divided into a training, a validation and a testing subsets. 60% of the instances will be assigned for training, 20% for generalization and 20% for testing. Note that this data set has many repeated instances, which will not be used, since they provide redundant information.
Once the data set page has been edited we are ready to run a few related tasks. With that, we check again the provided information and make sure that the data has good quality. Some data set tasks also perform minor adjustments to the variables information or the instances information sections.
The "Calculate data statistics" task draws a table with the minimums, maximums, means and standard deviations of all variables in the data set. The next figure depicts that values. All variables range from 1 to 10. On the other hand, note that the mean of all variables is less than 5. Also note that the input variable with the smallest standard deviation is "mitoses".
The "Calculate data histograms" task draws a histogram for each variable to see how they are distributed. The following figures show the histograms with ten bins for two input variables, clump thickness and mitosis. The clump thickness histogram is well distributed, but the mitosis histogram has many instances in the first bin.
The next chart shows the number of instances belonging to each class in the data set. The number of instances with negative Diagnose (blue) is 444, and the number of instances with positive Diagnose (purple) is 239.
The second step is to choose a network architecture to represent the classification function. For this class of applications, the neural network page is composed by:
- Scaling layer.
- Learning layers.
- Probabilistic layer.
The following figure shows the neural network tab in Neural Designer.
In the inputs section, the basic information about that variables is set. By default, the names, units and descriptions are those edited in the data set page for the input variables.
The scaling layer section contains the statistics on the inputs calculated from the data file and the method for scaling the input variables. Here the minimum and maximum method has been set. Nevertheless, the mean and standard deviation method would produce very similar results.
A multilayer perceptron with a logistic hidden layer and a logistic output layer is used. Note that, since the logistic function ranges from 0 to 1, the outputs of this multilayer perceptron can be interpreted as probabilities. The neural network must have 9 inputs, since there are eight input variables, and 1 output, since there is one target variable. As an initial guess, we use 6 neurons in the hidden layer. This neural network can be denoted as a 9:6:1 multilayer perceptron.
The probabilistic layer only contains the method for interpreting the outputs as probabilities. As the number of outputs is one, the softmax and competitive methods would not work. Indeed, as the sum of all outputs from a probabilistic layer must be 1, that two methods would always yield 1 here, since there is only one output. Therefore the no probabilistic method must be used for binary classification applications. Moreover, as the activation function from the output layer is the logistic, that output can already be interpreted as a probability of class membership.
Finally, In the outputs section, the basic information about that variables is set. As for the inputs, the default names, units and descriptions are those edited in the data set page for the target variables.
The next figure is a graphical representation of this neural network for medical diagnose.
It defines a family V of parameterized functions y(x) of dimension s = 67, which is the number of free parameters. Elements V are of the form
The third step is to set the loss index. A general loss index for classification is composed of two terms:
- An error term.
- A regularization term.
The following figure shows the loss index tab in Neural Designer.
The objective term is to be the normalized squared error. It divides the squared error between the outputs from the neural network and the targets in the data set by a normalization coefficient. If the normalized squared error has a value of unity then the neural network is predicting the data 'in the mean', while a value of zero means perfect prediction of the data. This objective term does not have any parameters to set.
The neural parameters norm is used as regularization term. It is applied to control the complexity of the neural network by reducing the value of the parameters. The weight of this regularization term in the loss index is 0.001.
The learning problem can be stated as finding a neural network which minimizes the loss index, i.e., a neural network that fits the data set (objective) and that does not oscillate (regularization).
The fourth step is to choose a training algorithm for solving the reduced function optimization problem. We will use the quasi-Newton method for training.
The following figure shows the training strategy tab in Neural Designer.
It is very easy for gradient algorithms to get stuck in local minima when learning multilayer perceptron weights. This means that we should always repeat the learning process from several different starting positions.
The following chart shows how the performance decreases with the iterations during the training process. The initial value is 1.88731, and the final value after 102 iterations is 0.0360315.
The next table shows the training results by the quasi-Newton method. They include some final states from the neural network, the loss index and the training algorithm. The parameters norm is not very big, the performance and generalization performance are small and the gradient norm is almost zero.
The last step is to validate the generalization performance of the trained neural network. To validate a classification technique we need to compare the values provided by this technique to the actually observed values.
The following table contains the elements of the confusion matrix. The element (0,0) contains the true positives, the element (0,1) contains the false positives, the element (1,0) contains the false negatives, and the element (1,1) contains the true negatives for the variable diagnose. The number of correctly classified instances is 166, and the number of misclassified instances is 4.
The classification accuracy, error rate, sensitivity, specifity positive likelihood and negative likelihood are parameters for testing the performance of a classification problem with two classes. The classification accuracy is the ratio of instances correctly classified. The error rate is the ratio of instances misclassified. The sensitivity, or true positive rate, is the proportion of actual positive which are predicted positive. The specifity, or true negative rate, is the proportion of actual negative which are predicted negative. The positive likelihood is the likelihood that a predicted positive is an actual positive. The negative likelihood is the likelihood that a predicted negative is an actual negative. That values are computed through the "Calculate binary classification tests" task.
Once the generalization performance of the neural network has been tested, the neural network can be saved for future use in the so called production mode.
We can diagnose a patient by running the "Calculate outputs" tasks. For that we need to edit the input variables through the corresponding dialog.
Then the diagnose is written in the viewer.
The mathematical expression represented by the neural network is written below. It takes the inputs clump_thickness, cell_size_uniformity, cell_shape_uniformity, marginal_adhesion, single_epithelial_cell_size, bare_nuclei, bland_chromatin, normal_nucleoli and mitoses to produce the output diagnose. For classification problems, the information is propagated in a feed-forward fashion through the scaling layer, the perceptron layers and the probabilistic layer. This expression can be exported anyware, for instance, a dedicated diagnosis software to be used by doctors.