Classify Palmer penguins using machine learning

Environment

Alberto Berral
https://www.linkedin.com/in/aberralgonzalez/
August 31, 2023

This example builds a machine learning model to classify size measurements for adult foraging penguins near Palmer Station, Antarctica.

We used data on 11 variables obtained during penguin sampling in Antarctica.

This data is obtained from the palmerpenguins package.

This example is solved with Neural Designer. We recommend you follow it step by step using the free trial.

Application type.
Data set.
Neural network.
Training strategy.
Testing analysis.
Model deployment.

1. Application type

The predicted variable can have three values corresponding to a penguin species: Adelie, Gentoo, and Chinstrap. Therefore, this is a multiple classification project.

The goal of this example is to model the probability of each sample belonging to a penguin species.

2. Data set

Data source

The penguin_dataset.csv file contains the data for this example.

Target variables have three values in our classification model: Adelie (0), Gentoo (1), and Chinstrap (2).

The number of rows (instances) in the data set is 334, and the number of variables (columns) is 11.

Variables

The following list summarizes the variables’ information:

clutch_completion: a character string denoting if the study nest was observed with a full clutch, i.e., 2 eggs.
date_egg: a date denoting the date the study nest was observed with 1 egg (sampled).
culmen_length_mm: a number denoting the length of the dorsal ridge of a bird’s bill (millimeters).
culmen_depth_mm: a number denoting the depth of the dorsal ridge of a bird’s bill (millimeters).
flipper_length_mm: an integer denoting the length of the penguin’s flipper (millimeters).
body_mass_g: an integer denoting the penguin’s body mass (grams).
sex: a factor denoting penguin sex (female, male).
delta_15_N: a number denoting the measure of the ratio of stable isotopes 15N:14N.
delta_13_C: a number denoting the measure of the ratio of stable isotopes 13C:12C.

The number of input variables, or attributes for each sample, is 9.

The number of target variables is 1, species (adelie, gentoo, and chinstrap).

Instances

Each row contains the input and target variables of a different instance.

The dataset is divided into training, validation, and testing subsets.

By default, Neural Designer assigns 60% for training, 20% for validation, and 20% for testing, but you can adjust these percentages as needed.

Variables distributions

Also, we can calculate the distributions of all variables. The following pie chart shows the number of species we have.

The image shows the proportion of each penguin species: Adelie (44.18%), Gentoo (36,04%), and Chinstrap (19.76%).

Inputs-targets correlations

The input-target correlations might indicate which factors most differentiate between penguins and, therefore, be more relevant to our analysis.

Here, the most correlated variables with penguin species are date_egg, culmen_depth_mm, delta_13_C, and delta_15_N.

3. Neural network

The next step is to set a neural network as the classification function. Usually, the neural network is composed of:

Scaling layer

The scaling layer contains the inputs scaled from the data file and the method used for scaling. Here, the method selected is the minimum-maximum. As we use ten input variables, the scaling layer has ten inputs.

Dense layer

The dense layer applies the Softmax activation to interpret outputs as class probabilities.

It takes ten inputs and produces three outputs.

Each output gives the probability of a sample belonging to a class.

Neural network graph

The following figure represents the neural network:

The network takes ten inputs and produces three outputs, which represent the probability of a penguin belonging to a given class.

4. Training strategy

The fourth step is to set the training strategy, which is composed of two terms:

A loss index.
An optimization algorithm.

Loss index

The aim is to find a neural network that fits the data set (error term) and does not oscillate (regularization term).

The loss index is the normalized squared error with L2 regularization, the default loss index for classification applications.

Optimization algorithm

The optimization algorithm we use is the quasi-Newton method, the standard optimization algorithm for this type of problem.

Training

The following image shows how the error decreases with the iterations during the training process.

The curves have converged, as we can see in the previous image.

The final errors are 0.004 for training and 0.005 for validation.

However, the selection error is a bit higher than the training error.

6. Testing analysis

The objective of the testing analysis is to validate the generalization properties of the trained neural network.

The method to validate the performance of our model is to compare the predicted values to the real values, using a confusion matrix.

Confusion matrix

The rows represent the real classes in the confusion matrix, and the columns are the predicted classes for the testing data.

The following table contains the values of the confusion matrix.

	Predicted Adelie	Predicted Gentoo	Predicted Chinstrap
Real Adelie	22 (32.353%)	0	1 (1.471%)
Real Gentoo	0	31 (45.588%)	1 (1.471%)
Real Chinstrap	0	0	13 (19.118%)

As we can see, we can classify 66 (97.1%) of the samples, while we fail to do so for 2 (2.9%) samples

7. Model deployment

Once we have tested the neural network’s performance, we can save it for the future using the model deployment mode.

References

Artwork by @allison_horst
This data is obtained from the palmerpenguins package
Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081.
GitHub: Palmerpenguins