Wine logo

Wine quality improvement

By Sergio Sanchez, Artelnics.

The objective of this study is to predict human wine taste preferences. This can be useful in order to improve the wine production and to support the oenologist wine tasting evaluations. Such model is useful to support the oenologist wine tasting evaluations and improve wine production. Furthermore, similar techniques can help in target marketing by modeling consumer tastes from niche markets.

Fixed Acidity is a fundamental property of wine, imparting sourness and resistance to microbial infection. Volatile acidity refers to the steam distillable acids present in wine, primarily acetic acid but also lactic, formic, butyric, and propionic acids.

The variables used for this proposal are not related to grape type, wine brand or wine selling price, they are only realated to physicochemical tests. The output of the model will give a score between 0 and 10, which defines the wine quality.

Wine tasting
Wine tasting.

Contents:

  1. Data set
  2. Automatic model selection
  3. Results
  4. Production

1. Data set

The data file contains a total of 1599 rows and 12 columns. The first row in the data file contains the names of the variables and the rest of them represent the instances.For that purpose, the data is divided at random into training, selection and testing subsets, containing 60%, 20% and 20% of the instances, respectively. The following image contains a description of the variables obtained by using the task "Report data set".

Wine quality improvement variables
Wine quality improvement variables.

As we can see in the next figure, the data set is not well balanced since there is a large amount of quality scores around 6 and much less with values near to 0 or 10.

Not balanced quality score histogram
Not balanced quality score histogram.

As we can see, our target variable does not follow a normal distribution. Therefore, we select min max method for unscaling.

2. Automatic model selection

The next step to take is to set model selection mode. Loss index page and training page are not used in this example because we are setting them automatically with model selection.

Some data sets have inputs that are redundants and it affects the performance of the neural network. The inputs selection is used to find the optimal subset of inputs for the best performance of the model.

In this example, the inputs selection algorithm selected is the genetic algorithm. It has a population size of 100 individuals in each generation. The remainder parameters take its default values.

The next chart shows the performance history for the different subsets during the genetic algorithm inputs selection process. The blue line represents the training loss, its initial value is 0.580421, and the final value after 100 generations is 0.548342. The red line symbolizes the selection loss, its initial value is 0.633424, and the final value after 100 generations is 0.632917.

Inputs selection mean history

he next chart shows the history of the mean of the selection losses in each generation during the genetic algorithm inputs selection process. The initial value is 0.76545, and the final value after 100 generations is 0.634.

Finally, Neural Designer shows the final architecture of the neural network, see the next figure.

Inputs selection architecture
-

3. Results

A standard method for testing the prediction capabilities is to compare the outputs from the neural network against an independent set of data. The linear regression analysis, performed by the task "Perform linear regression analysis", leads to 3 parameters to each output: intercept, slope and correlation.

Regression parameters
Regression parameters.

For a perfect prediction the intercept would be 0 and the slope would be 1. If the correlation is equal to 1, then there is perfect correlation between the outputs from the neural network and the targets in the testing subset. In this case, the parameters show good results.

4. Production

Once the model is obtained, Neural Designer provides the user the mathematical expression of it. The next listing shows that result.

				scaled_Fixed acidity=2*(Fixed acidity-4.6)/(15.9-4.6)-1;
				scaled_Volatile acidity=2*(Volatile acidity-0.12)/(1.58-0.12)-1;
				scaled_Citric acid=2*(Citric acid-0)/(1-0)-1;
				scaled_Residual sugar=2*(Residual sugar-0.9)/(15.5-0.9)-1;
				scaled_Chlorides=2*(Chlorides-0.012)/(0.611-0.012)-1;
				scaled_Free sulfur dioxide=2*(Free sulfur dioxide-1)/(72-1)-1;
				scaled_Total sulfur dioxide=2*(Total sulfur dioxide-6)/(289-6)-1;
				scaled_Density=2*(Density-0.99007)/(1.00369-0.99007)-1;
				scaled_pH=2*(pH-2.74)/(4.01-2.74)-1;
				scaled_Sulphates=2*(Sulphates-0.33)/(2-0.33)-1;
				scaled_Alcohol=2*(Alcohol-8.4)/(14.9-8.4)-1;

				y_1_1=tanh(-5.82479
				+0.110015*scaled_Fixed acidity+1.53483*scaled_Volatile acidity-2.76848*scaled_Citric acid+4.28537*scaled_Residual sugar-2.79891*scaled_Chlorides
				-1.46431*scaled_Free sulfur dioxide+3.02312*scaled_Total sulfur dioxide+0.931677*scaled_Density+1.39366*scaled_pH
				-1.82487*scaled_Sulphates-3.29735*scaled_Alcohol);
				y_1_2=tanh(0.945211
				+1.87145*scaled_Fixed acidity-6.13372*scaled_Volatile acidity+4.72878*scaled_Citric acid-0.339682*scaled_Residual sugar-2.6105*scaled_Chlorides
				-1.23659*scaled_Free sulfur dioxide-1.2991*scaled_Total sulfur dioxide-3.69043*scaled_Density+4.66422*scaled_pH
				+3.38811*scaled_Sulphates+10.0286*scaled_Alcohol);
				y_1_3=tanh(9.49629
				-0.599709*scaled_Fixed acidity-4.45104*scaled_Volatile acidity+3.34046*scaled_Citric acid-2.91677*scaled_Residual sugar+9.20604*scaled_Chlorides
				-0.320459*scaled_Free sulfur dioxide+1.39114*scaled_Total sulfur dioxide+1.50174*scaled_Density-6.05182*scaled_pH
				+3.9604*scaled_Sulphates+9.21738*scaled_Alcohol);
				scaled_Quality=(0.446394
				+0.444612*y_1_1+0.370473*y_1_2+0.278908*y_1_3);
				Quality=0.5*(scaled_Quality+1.0)*(8-3)+3;
				

The formula from below can be exported to the software tool required by the customer.

Bibliography

  • P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.