The aim of this example is to assess whether a lump in a breast could be malignant (cancerous) or benign (non-cancerous) from digitized images of a fine-needle aspiration biopsy.

The breast cancer database used here was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.

This is a binary classification project, since the variable to be predicted has two values (malignant or beningant tumor).

The goal here is to model the probability that a tumor is malignant, conditioned on the fine needle aspiration test features.

The breast_cancer.csv file contains the data for this application. In classification project type, target variables can only have two values: 0 (false) or 1 (true). The number of instances (rows) in the data set is 683, and the number of variables (columns) is 10.

The number of input variables, or attributes, for each sample is 9. All input variables are numeric-valued, and represent measurements from digitized images of a fine-needle aspiration biopsy. The number of target variables is 1, and represents the absence or presence of cancer in an individual. The following list summarizes the variables information:

**clump_thickness**: (1-10). Benign cells tend to be grouped in monolayers, while cancerous cells are often grouped in multilayers.**cell_size_uniformity**: (1-10). Cancer cells tend to vary in size and shape. That is why these parameters are valuable in determining whether the cells are cancerous or not.-
**cell_shape_uniformity**: (1-10). Uniformity of cell size/shape: Cancer cells tend to vary in size and shape. That is why these parameters are valuable in determining whether the cells are cancerous or not. **marginal_adhesion**: (1-10). Normal cells tend to stick together. Cancer cells tends to loose this ability. So loss of adhesion is a sign of malignancy.**single_epithelial_cell_size**: (1-10). It is related to the uniformity mentioned above. Epithelial cells that are significantly enlarged may be a malignant cell.**bare_nuclei**: (1-10). This is a term used for nuclei that is not surrounded by cytoplasm (the rest of the cell). Those are typically seen in benign tumours.**bland_chromatin**: (1-10). Describes a uniform "texture" of the nucleus seen in benign cells. In cancer cells the chromatin tend to be more coarse.-
**normal_nucleoli**: (1-10). Nucleoli are small structures seen in the nucleus. In normal cells the nucleolus is usually very small if visible at all. In cancer cells the nucleoli become more prominent, and sometimes there are more of them. **mitoses**: (1-10). Cancer is essentially a disease of uncontrolled mitosis.**diagnose**: (0 or 1). Benign (non-cancerous) or malignant (cancerous) lump in a breast.

Finally, the use of all instances is set. Note that each instance contains the input and target variables of a different patient. The data set is divided into a training, a validation and a testing subsets. 60% of the instances will be assigned for training, 20% for generalization and 20% for testing.

Once the data set has been set, we are ready to perform a few related analytics. With that, we check the provided information and make sure that the data has good quality.

We can calculate the data statistics and draw a table with the minimums, maximums, means and standard deviations of all variables in the data set. The next figure depicts that values.

All variables range from 1 to 10. On the other hand, note that the mean of all variables is less than 5. Also note that the input variable with the smallest standard deviation is "mitoses".

Also, we can calculate the data distributions and draw a histogram for each variable. The following figures show the histograms with ten bins for two input variables, clump thickness and mitosis. The clump thickness histogram is well distributed, but the mitosis histogram has many instances in the first bin.

The next chart shows the number of instances belonging to each class in the data set. The number of patients with negative diagnose is 444, and the number of patients with positive diagnose is 239.

The second step is to set a neural network to represent the classification function. For this class of applications, the neural network is composed by:

- Scaling layer.
- Perceptron layers.
- Probabilistic layer.

The scaling layer contains the statistics on the inputs calculated from the data file and the method for scaling the input variables. Here the minimum and maximum method has been set. Nevertheless, the mean and standard deviation method would produce very similar results.

Two perceptron layers with a logistic hidden layer and a logistic output layer are used. Note that, since the logistic function ranges from 0 to 1, the outputs from that layer can be interpreted as probabilities. The neural network must have 9 inputs, since there are eight input variables, and 1 output, since there is one target variable. As an initial guess, we use 3 neurons in the hidden layer.

The probabilistic layer only contains the method for interpreting the outputs as probabilities. Indeed, as the sum of all outputs from a probabilistic layer must be 1, that two methods would always yield 1 here, since there is only one output. Moreover, as the activation function from the output layer is the logistic, that output can already be interpreted as a probability of class membership.

The next figure is a graphical representation of this neural network for breast cancer diagnose.

The fourth step is to set the training strategy, which is composed of two terms:

- A loss index.
- An optimization algorithm.

The loss index is the weighted squared error with L2 regularization. This is the default loss index for binary classification applications.

The learning problem can be stated as finding a neural network which minimizes the loss index. That is, a neural network that fits the data set (error term) and that does not oscillate (regularization term).

The optimization algorithm that we use is the quasi-Newton method. This is also the standard optimization algorithm for this type of problems.

The following chart shows how the error decreases with the iterations during the training process.
The final training and selection errors are **training error = 0.054 WSE** and **selection error = 0.072 WSE**, respectively.

The objective of model selection is to find the network architecture with best generalization properties, that is, that which minimizes the error on the selection instances of the data set.

More specifically, we want to find a neural network with a selection error less than **0.072 WSE**,
which is the value that we have achieved so far.

Order selection algorithms train several network architectures with different number of neurons and select that with the smallest selection error.

The incremental order method starts with a small number of neurons and increases the complexity at each iteration. The following chart shows the training error (blue) and the selection error (orange) as a function of the number of neurons.

The objective of testing analysis is to validate the generalization performance of the trained neural network. To validate a classification technique we need to compare the values provided by this technique to the actually observed values.

The following table contains the elements of the confusion matrix. This matrix contains the true positives, false positives, false negatives, and true negatives for the variable diagnose, respectively.

Predicted positive | Predicted negative | |
---|---|---|

Real positive | 129 | 3 |

Real negative | 1 | 37 |

The binary classification tests are parameters for measuring the performance of a classification problem with two classes:

**Classification accuracy**(ratio of instances correctly classified): 97.6%**Error rate**(ratio of instances misclassified): 2.4%**Sensitivity**(ratio of real positive which are predicted positive): 99.2%**Specificity**(ratio of real negative which are predicted negative): 92.5%

Once the generalization performance of the neural network has been tested, the neural network can be saved for future use in the so-called model deployment mode.

We can diagnose new patients by calculating the neural network outputs. For that we need to know the input variables for them. An example is the following:

**clump_thickness**(1-10):**cell_size_uniformity**(1-10):**cell_shape_uniformity**(1-10):**marginal_adhesion**(1-10):**single_epithelial_cell_size**(1-10):**bare_nuclei**(1-10):**bland_chromatin**(1-10):**normal_nucleoli**(1-10):**mitoses**(1-10):

**diagnose**:

The mathematical expression represented by the neural network is written below. It takes the inputs clump_thickness, cell_size_uniformity, cell_shape_uniformity, marginal_adhesion, single_epithelial_cell_size, bare_nuclei, bland_chromatin, normal_nucleoli and mitoses to produce the output diagnose. For classification problems, the information is propagated in a feed-forward fashion through the scaling layer, the perceptron layers and the probabilistic layer.

scaled_clump_thickness = (clump_thickness-4.44217)/2.82076; scaled_cell_size_uniformity = (cell_size_uniformity-3.15081)/3.06514; scaled_cell_shape_uniformity = (cell_shape_uniformity-3.21523)/2.98858; scaled_marginal_adhesion = (marginal_adhesion-2.83016)/2.86456; scaled_single_epithelial_cell_size = (single_epithelial_cell_size-3.23426)/2.22309; scaled_bare_nuclei = (bare_nuclei-3.54466)/3.64386; scaled_bland_chromatin = (bland_chromatin-3.4451)/2.4497; scaled_normal_nucleoli = (normal_nucleoli-2.86969)/3.05267; scaled_mitoses = (mitoses-1.60322)/1.73267; y_1_1 = Logistic (-1.35621+ (scaled_clump_thickness*-2.54409)+ (scaled_cell_size_uniformity*-5.01572) + (scaled_cell_shape_uniformity*-3.39576)+ (scaled_marginal_adhesion*-0.278873)+ (scaled_single_epithelial_cell_size*-2.61646) + (scaled_bare_nuclei*-5.51018)+ (scaled_bland_chromatin*-0.979982)+ (scaled_normal_nucleoli*-1.71412)+ (scaled_mitoses*0.410197)); non_probabilistic_diagnose = Logistic (3.94959+ (y_1_1*-9.14654)); diagnose = Probability(non_probabilistic_diagnose);

The above expression can be exported anywhere, for instance, a to a dedicated diagnosis software to be used by doctors.

- The data for this problem has been taken from the UCI Machine Learning Repository.
- Wolberg, W.H., & Mangasarian, O.L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. In Proceedings of the National Academy of Sciences, 87, 9193--9196.
- Zhang, J. (1992). Selecting typical instances in instance-based learning. In Proceedings of the Ninth International Machine Learning Conference (pp. 470--479). Aberdeen, Scotland: Morgan Kaufmann.