When building a machine learning model, knowing the ranges of all the variables is imperative; data statistics provide precious information. Indeed, they put the data set in context.

The most important statistical parameters are the minimum, the maximum, the mean, and the standard deviation. We must always perform a simple statistical analysis to check data consistency.

This way, we need to calculate statistics for each data set variable.Recall that the data matrix comprises the variables

\begin{eqnarray}v_{j} := column_{j}(d), \quad j=1,\ldots,q\quad\end{eqnarray}

and the samples

\begin{eqnarray}u_{i} := column_{i}(d), \quad i=1,\ldots,p\quad\end{eqnarray}

(columns and rows of the data set).
The values that a variable takes for each sample in the data set are

\begin{eqnarray}\quad v_j=(d_{1j}, \ldots, d_{pj}), \quad j=1,\ldots,q.\end{eqnarray}

Contents

2. Minimum and maximum

The minimum of a variable is the smallest value of that variable in the data set.

The minimum of the variable is denoted by $v_{jmin}$, and it is defined as follows,

\begin{eqnarray}
\boxed{
v_{jmin} = \min_{i = 1, \ldots, p} d_{ij}.
}
\end{eqnarray}

A variable’s maximum is the variable’s biggest value in the data set.

Similarly, the maximum is denoted by $v_{jmax}$, and it is defined as follows,

\begin{eqnarray}
\boxed{
v_{jmax} = \max_{i=1, \ldots, p} d_{ij}.
}
\end{eqnarray}

3. Mean and standard deviation

The mean of a variable is the average value of that variable in the data set. The mean is denoted by $v_{jmean}$ and is defined as,

\begin{eqnarray}
\boxed{v_{jmean} = \frac{1}{p}\sum_{i=1}^{p} d_{ij}.}
\end{eqnarray}

where $d_{ij}$ is the data matrix element and $p$ is the number of samples.

The standard deviation measures how dispersed the data is about the mean. The standard deviation of a variable is denoted by $v_{jstd}$ and is defined as,

\begin{eqnarray}
\boxed{
v_{jstd} = \sqrt{\frac{1}{p}\sum_{i=1}^{p}\left(d_{ij}-v_{jmean}\right)^2}.
}
\end{eqnarray}

where $v_{jmean}$ is the mean of the variable $v_{j}$.

The graphical representation is the standard distribution curve, called the Gaussian bell.

The graphical representation for the mean and standard deviation, one of the data statics parameters in machine learning models, is the standard distribution curve, called the Gaussian bell.

A low standard deviation means that all the values are close to the mean.
Conversely, a high standard deviation means the values are spread out around the mean and from each other.

Example: Predict the noise generated by airfoil blades

NASA conducts a study of the noise generated by an aircraft in order to make a model to reduce it.

The file airfoil_self_noise.csv contains the data for this example.
Here the number of variables (columns) is 6, and the number of instances (rows) is 1503.

We can calculate the basic statistics of each variable using the formulas described above.
The following table displays the minimum, maximum, mean, and standard deviation for every input variable in the data set.

Name Minimum Maximum Mean Deviation
frequency 200 20000 2890 3150
angle_of_attack 0 0.22 6.78 5.92
chord_length 0.0254 0.305 0.137 0.0935
free_stream_velocity 31.7 71.3 50.9 15.6
suction_side_displacement_thickness 0.000401 0.0584 0.0111 0.0132
scaled_sound_pressure_level 103 141 125 6.9

By performing this simple statistical analysis, we can check the consistency of the data.

4. Conclusions

Statistics put the data set in context.

It is essential to perform a simple statistical analysis to check the consistency of the data before building the model.

This is done by calculating each variable’s most important statistical parameters, such as the minimum and maximum values, mean, and standard deviation.

Related posts