Iris flowers classification
By Roberto Lopez, Artelnics.
This is perhaps the best known data set to be found in the classification literature. The aim is to classify iris flowers among three species (setosa, versicolor or virginica) from measurements of length and width of sepals and petals.
The iris data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. The central goal here is to design a model which makes good classifications for new data, in other words one which exhibits good generalization. The next figure is a picture of an iris flower of the versicolor specie.
The first step is to prepare the data set, which is the source of information for the classification problem.
The file irisflowers.csv contains the data for this example in comma sepparated values (CSV) format. A sample of the contents of that file is listed below.
Once the data file is ready, we import it in Neural Designer using the "Import data file" wizard.
The next figure shows the data set page in Neural Designer.
It contains four sections:
- Data file.
- Missing values.
Neural Designer shows a preview of the data file and says that the number of columns is 5 and the number of rows is 150.
The variables are:
- sepal_length: Sepal length, in centimeters, used as input.
- sepal_width: Sepal width, in centimeters, used as input.
- petal_length: Petal length, in centimeters, used as input.
- petal_width: Petal width, in centimeters, used as input.
- setosa: Iris setosa, true or false, used as target.
- versicolour: Iris versicolour, true or false, used as target.
- virginica: Iris virginica, true or false, used as target.
- Setosa: 1 0 0.
- Versicolor: 0 1 0.
- Virginica: 0 0 1.
The instances are divided into a training, a selection and a testing subsets. They represent 60% (90), 20% (30) and 20% (30) of the original instances, respectively, and have been splitted at random.
The "Report data set" task transfers to Neural Viewer the information contained in the Data set page of Neural Editor.
The "Calculate data statistics" task draws a table with the minimums, maximums, means and standard deviations of all variables in the data set. The next figure shows the data statistics.
Finally, the "Calculate data histograms" task draws a histogram for each variable to see how they are distributed. The user must specify here the number of bins for all the histograms. The next figure is the histogram for the first attribute.
The second step is to choose a network architecture to represent the classification function. For classification problems, it is composed by:
- Scaling layer.
- Neural network.
- Probabilistic layer.
Note that on neural network page all settings for this example are the default values.
The inputs section contains information about the input variables in the neural network.
- Sepal length, in centimeters.
- Sepal width, in centimeters.
- Petal length, in centimeters.
- Petal width, in centimeters.
The scaling layer section contains information about the method for scaling the input variables and the statistic values to be used by that method. In this example, we will use the minimum and maximum method for scaling the inputs. The mean and standard deviation would also be appropriate here.
Since the number of input variabes is only 4, we won't apply principal components in this application.
The neural network must have four inputs, since there are four input variables; and three output neurons, since there are three target variables. We use one hidden layer of size five. This neural network can be denoted as 4:5:3. All the activation functions have been set to logistic. All the parameters are initialized at random with a normal distribution of mean 0 and standard deviation 1.
The probabilistic layer allows the outputs to be interpreted as probabilities, i.e., all outputs are between 0 and 1 and their sum is 1. The probabilistic method to be used is the softmax.
The outputs from this neural network are:
- iris_setosa, probability.
- iris_versicolour, probability.
- iris_virginica, probability.
The next figure is a graphical representation of the neural network for the iris flowers classification example, taken from the "Report neural network" task.
This neural network defines a function of the form
[setosa, versicolor, virginica] = function(sepal_lenght, sepal_width, petal_length, petal_width)
The function above is parameterized by all the biases and synaptic weights in the neural network.
The third step is to set the loss index, which is composed by:
- Error term.
- Regularization term.
The error term chosen for this application is the normalized squared error.
On the other hand, the regularization term is the neural parameters norm. The weight for this term is 0.001. Regularization has two effects here: (i) it makes the model to be stable, without oscilations and (ii) it avoids saturation of the logistic activation functions.
The learning problem can be stated as to find a neural network which minimizes the loss index, i.e., a neural network that fits the data set (objective) and that does not oscillate (regularization).
The next step in solving this problem is to assign the training strategy.
The next figure shows the training strategy page in Neural Designer.
On the other hand, the quasi-Newton method is applied as the main training algorithm.
The following chart shows how the performance decreases with the iterations during the training process. The initial value is 1.21313, and the final value after 94 iterations is 0.0376633.
The following table shows the training results for the problem considered here. Here the final parameters norm is not very big, the final performance is small, the final generalization performance is also small, and the final gradient norm is almost zero, the number of epochs and the training time in a PC.
The last step is to test the generalization performance of the trained neural network. Here we compare the values provided by this technique to the actually observed values.
Since the testing analysis does not depend on any parameter, there is not a page in Neural Designer for that component.
In the confusion matrix the rows represent the target classes and the columns the output classes for the testing target data set. The diagonal cells in each table show the number of cases that were correctly classified, and the off-diagonal cells show the misclassified cases. The next table shows the confusion elements for this application. The number of correctly classified instances is 28, and the number of misclassified instances is 2. In particular, the neural network has said that two flowers are virginica when they are actually versicolor. Also, note that the confusion matrix depends on the particular testing instances that we have.
The neural network is now ready to predict outputs for inputs that it has never seen.
The "Calculate outputs" task will classify a given iris flower, from the lenghts and withs of its sepals and petals. The next figure shows the dialog where the user types the input values.
The results from that task are written in the viewer. For this particular case, the neural network would clasiffy that flower as being of the virginica specie with 55% probability. The probability of being setosa is 22%, and the probability of being versicolor is also 23%.
The "Write expression" task exports to the report the mathematical expression of the trained and tested neural network. That expression is listed below.
The data for this problem has been taken from the UCI Machine Learning Repository.