## 1. Definition

A machine learning dataset collects data needed to create and train an approximation, classification, or forecasting model.

A data source is the location where the data being used originates. We can have different types of data sources as excel files, .csv files, databases, image data, etc.

Before building a model, it is necessary to transform the data into numbers, i.e., we have to collect the data in a matrix of real numbers, creating the data matrix.

Every column represents a particular variable, and each row corresponds to a given sample of the data set in question.

A variable is any characteristic, number, or quantity that can be measured or counted. It is an attribute that describes a person, place, thing, or idea.

The variable’s value can “vary” from one entity to another. According to their type, we can consider different types of variables: numeric, ordinal, binary, or categorical variables.

Variables can be used as inputs or targets. Input variables are the independent variables in the model (they are also called features or attributes), and target variables are the dependent variables in the model.

A sample is an observation of all variables. The samples will also have different uses. We divide the samples into three different subsets. These are the training set (used to build different candidate models), the selection set (used to select the model that exhibits the best properties), and the test set (used to validate the final model).

Sometimes, the data set may be incomplete or have missing values. This is one of the main problems when applying neural networks to real-world problems. We can unuse the whole sample or impute such a value to solve it.

## 2. Data analysis

Before building a model, we need to analyze the data statistically to understand what it represents.

The most basic analysis is the statistics for each variable, and the most important statistical parameters are the minimum, maximum, mean, and standard deviation.

Another descriptive analysis is that of the distributions of each variable. For the predictive model to be of higher quality, we must check that all the variables in a data set have a uniform or normal distribution.

We can calculate different types of distributions, such as histograms, pie charts, medians, quartiles, or box plots.

We can also discover dependencies between the variables of the data set from the correlations.

A correlation is a numerical value between -1 and 1 that expresses the strength of the relationship between two variables.
If the correlation is close to 1 between two variables, they are positively related; if it is close to 0, the study variables are unrelated; and if the correlation is close to -1, the variables are negatively related.

We can also analyze the data to detect potential problems. One of the most common problems is outliers.

Outliers are observations in the data that are abnormally outliers and can spoil and confound the training process.

We can use Tukey’s test, a univariate method, or the Local outlier factor, a multivariate method, to deal with outliers.

We can also filter the data to create models with subsets of them. Filtering is usually temporary, we keep the entire data set, but only a part is used for the calculation.

Filtering requires that you specify a rule or logic to identify the cases you want to include in the analysis.

On the other hand, it is always convenient to scale the variables to order zero before training a neural network.

The objective of data scaling is to convert the data into an appropriate range for its computation. Data scaling is generally performed variable-by-variable, as different variables may require different types of scaling.

In this way, some of the most used scaling methods are the minimum and maximum, the mean and standard deviation, and the logarithm.

Training algorithms for neural networks do not work with the data matrix directly. Instead, they use data structures called data batches.

A batch of data contains two tensors, one with input data and one with target data, and the range of these batches depends on the model type.

## 3. Data matrix

Before building a model, we need to collect the data in a matrix of real numbers.

Let denote $p$ the number of rows and $q$ the number of columns. It is a matrix $d \in {R}^{p \times q}$.

As we can see, machine learning models require all data to be real numbers.

The data matrix has the following form,

\begin{eqnarray}
d = \left(
\begin{array}{ccc}
d_{1,1} & \cdots & d_{1,q}\\
\vdots & \ddots & \vdots \\
d_{p,1} & \cdots & d_{p,q}\\
\end{array}
\right).
\end{eqnarray}

A sample is a vector $u \in {R}^{p}$, where $p$ is the number of rows in the data matrix. In this regard, the data matrix contains $q$ variables,

\begin{eqnarray}
\end{eqnarray}

The samples will also have different uses. We divide the samples into three different subsets. These are the training set (used to build different candidate models), the selection set (used to select the model that exhibits the best properties), and the test set (used to validate the final model).

A variable is a vector $v \in {R}^{q}$, where $q$ is the number of columns in the data matrix.
In this regard, the data matrix contains $p$ samples,

\begin{eqnarray}
\end{eqnarray}

The variable’s value can “vary” from one entity to another. According to their type, we can consider different types of variables: numeric, ordinal, binary, or categorical variables.

Variables can be used as inputs or targets. Input variables are the independent variables in the model (they are also called features or attributes), and target variables are the dependent variables in the model.

Our source of information might not be directly in the format of a matrix.

For example, the information may be distributed in several tables of a database.

We can also find sets of images to, for example, diagnose a tumor.

In addition, some data may not be real numbers.

For example, a customer’s country (Spain, France, etc.) is categorical.

This means that an essential part of building machine learning models is the creation of a data matrix with the correct format.

The following example from the industry sector shows a data matrix of a real model: Wind turbine data matrix

A wind turbine manufacturer wants to know the electrical power generated by the device at different wind speeds. To do this, they measure different operating scenarios and generate the following data matrix.

\begin{eqnarray}\nonumber
d = \left(
\begin{array}{cc}
380.048 & 5.311\\
453.769 & 5.672\\
\vdots & \vdots \\
2820.466 & 9.973\\
\end{array}
\right).
\end{eqnarray}

The number of columns in the data matrix is $q=2$, a simple matrix. Each column corresponds to a variable. In this case, we have the wind speed (in meters per second) and the corresponding power generated by the turbine (in kilowatts). The first column is the input and the second column the target.

The number of rows in the data matrix is $p=48007$. Each row corresponds to a sample. Each sample contains values of the two variables.

## Conclusions

Datasets collect the data needed to create and train a model. In general, the data must be transformed to adapt it to machine learning and create the data matrix. Subsequently, it is advisable to perform a statistical study of the data to deal with potential problems such as outliers.