The objective of this study is to predict human wine taste preferences.

This can be useful to improve the wine production and to support the oenologist wine tasting evaluations.

Furthermore, similar techniques can help in target marketing by modeling consumer tastes from niche markets.

The variables used for this proposal are not related to grape type, wine brand or wine selling price, they are only related to physicochemical tests. The output of the model will give a score between 0 and 10, which defines the wine quality.

This is an approximation project, since the variable to be predicted is continuous (wine quality).

The basic goal here is to model the quality of a wine, as a function its features.

The data file wine_quality.csv contains a total of 1599 rows and 12 columns. The first row in the data file contains the names of the variables and the rest of them represent the instances.

The data set contains the following variables:

**fixed_acidity**. This is a fundamental property of wine, imparting sourness and resistance to microbial infection.**volatile_acidity**. It refers to the steam distillable acids present in wine, primarily acetic acid but also lactic, formic, butyric, and propionic acids.**citric_acid****residual_sugar****chlorides****free_sulfur_dioxide****total_sulfur_dioxide****density****pH****sulphates****alcohol****quality**

On the other hand, the instances are divided at random into a training, a selection and a testing subsets, containing 60%, 20% and 20% of the instances, respectively.

We can calculate the data distribution and plot a histogram for the wine quality.

As we can see, the target variable is not well balanced since there is a large amount of quality scores around 6 and much less with values near to 0 or 10.

In this example, the inputs selection algorithm selected is the genetic algorithm. It has a population size of 100 individuals in each generation. The remainder parameters take its default values.

The next chart shows the performance history for the different subsets during the genetic algorithm inputs selection process. The blue line represents the training loss, its initial value is 0.580421, and the final value after 100 generations is 0.548342. The red line symbolizes the selection loss, its initial value is 0.633424, and the final value after 100 generations is 0.632917.

The next chart shows the history of the mean of the selection losses in each generation during the genetic algorithm inputs selection process. The initial value is 0.76545, and the final value after 100 generations is 0.634.

The next figure shows the final architecture of the neural network.

Some data sets have inputs that are redundant and it affects the performance of the neural network. The inputs selection is used to find the optimal subset of inputs for the best performance of the model.

A standard method for testing the prediction capabilities is to compare the outputs from the neural network against an independent set of data. The linear regression analysis, leads to 3 parameters to each output: intercept, slope and correlation.

For a perfect prediction the intercept would be 0 and the slope would be 1. If the correlation is equal to 1, then there is perfect correlation between the outputs from the neural network and the targets in the testing subset. In this case, the parameters show good results.

The next listing shows the mathematical expression of the predictive model.

The formula from below can be exported to the software tool required by the customer.

- P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.