The data set contains information for creating our model. It is a collection of data structured as a table, in rows and columns.
The most popular data set in the machine learning field is the Iris flower data set, which was introduced by the British statistician and biologist Ronald Fisher in 1936.
We can identify the next concepts in a dataset:
The data is usually stored in a data file. Some common data sources are the following:
From all that, the most used format for a data set is a CSV file. When possible, export your spreadsheet file, SQL query, etc. to that format.
The variables are the columns in the data table. Variables might represent physical measurements (temperature, velocity...), personal characteristics (gender, age...), marketing dimensions (recency, frequency, monetary...), etc.
Regarding their use, we can talk about:
Input variables are the independent variables in the model. They are also-called features or attributes.
Input variables can be continuous, binary, or categorical.
Target variables are the dependent variables in the model.
In regression problems, targets are continuous variables (power consumption, product quality...).
In classification problems, targets are binary (fault, churn...) or categorical (the type of object, activity...). In this type of application, targets are also called categories or labels.
Unused variables are neither inputs nor targets. We can set a variable to Unused when it does not provide any information to the model (id number, address...).
Constant variables are those columns in the data matrix, always having the same value. They should be set as Unused since they do not provide any information to the model but increase its complexity.
Instances are the rows in the data table. They are also called samples or points.
It is not useful to design a neural network to memorize a set of data simply. Instead, we want the neural network to perform accurately on new data, that is, to generalize.
To achieve that, we divide the data set into different subsets:
During the design of a model, we usually need to try different settings. For instance, we can build several models with different architectures and compare their performance. Training instances are used to construct all these models.
Selection instances are used for choosing the neural network with the best generalization properties. In this way, we construct different models with the training subset and select the one that works best on the selected subset.
Testing instances are used to validate the functioning of the model. We train different models with the training instances, select the one that performs best on the selected instances, and test its capabilities with testing instances.
Instead of providing useful information to the model, some instances might distort it. For example, outliers in the data can make the neural network to work inefficiently. To fix these problems, those instances are set to Unused.
Usually, 60% of the instances are used for training, 20% for selection, and 20% for testing. Splitting of the instances might be performed in sequential order or randomly.
Repeated instances can also be set to Unused since they provide redundant information to the model.
A data set can also contain missing values, which are those elements that are not present. Usually, missing values are denoted by a label in the data set. Some standard labels used for representing missing values are NA (not available), NaN (not a number), Unknown, or ?. Do not use numeric values here, such as -999, since that might be confused with an actual value.
There are two ways to deal with missing values:
If the number of instances in the data set is significant, and the number of missing values is small, the instances with missing values can be simply excluded from the analysis.
In this way, the unusing method sets those instances with missing values to Unuse.
If the data set is small or the number of missing values is big, you probably cannot afford to unuse the instances with missing values. In these cases, it is advisable to assign probable values to the missing data.
The most common imputation method is to substitute the missing values with the mean value of the corresponding variable.
Some of the essential techniques for data analysis and preparation are the following:
Basic statistics are very valuable information when designing a model since they might alert on the presence of spurious data. It is a must to check for the correctness of every variable's most critical statistical measures.
As an example, the following table depicts the minimum, maximum, mean, and standard deviation of the variables used to improve the performance of a combined cycle power plant.
In classification applications, it is also interesting to compare the statistics for each category. For instance, we can compare the mean age of customers that buy a certain product with the mean age of customers that do not buy the product.
Distributions show how the data is distributed over its entire range. If the data is very irregularly distributed, the resulting model will probably be of poor quality.
Histograms are used to see how continuous variables are distributed. Continuous variables usually have a normal (or gaussian) distribution.
For example, the following figure depicts a histogram for the noise generated by different airfoil blades.
As we can see, this variable has a normal distribution. 22% of the airfoil blades in the data set emits a sound around 127 dB.
Pie charts are used to see the distribution of binary or nominal variables. This type of variable should be uniformly distributed.
The following figure shows the pie chart for the customers of a bank that purchase a bank deposit in a marketing campaign, which is a binary variable.
Approximately 85.5% of the customers do not purchase the product, 14.5% do purchase it. Therefore, this variable is not well-balanced, which means that there are many more negatives than positives.
Box plots also provide information about the shape of the data. They display information about the minimum, maximum, first quartile, second quartile, or median and third quartile of every variable. They consist of two parts: a box and two whiskers.
The length of the box represents the interquartile range (IQR), which is the distance between the third quartile and the first quartile. The middle half of the data falls inside the interquartile range. The whisker below the box shows the minimum of the variable. On the other hand, the whisker above the box shows the maximum of the variable. Within the box, it will also be drawn a line which represents the median of the variable.
The following figure illustrates the box plot for the age of the employees of a company.
As we can see, 50% of the customer are between 35 and 45 years old.
In forecasting applications, it is very useful to see how the different variables evolve over time. In this way, time series plots display observations in the y-axis against time in the x-axis.
The next figure is an example of time series plot. It depicts the temperature in a city over the years.
The figure from above shows a cyclic variable. It also shows some outliers in the data (wrong temperature observations).
It is always very useful to see how the targets depend on the inputs. Scatter charts show points with target values versus input values.
As an example, the following chart shows the compressive strength of concretes as a function of the quantity of cement. As we can see, there is quite a strong correlation between both variables: the more quantity of cement, the more compressive strength.
Sometimes, data sets contain redundant data that complicate the design of the neural network. To discover redundancies between the input variables, we use a correlation matrix.
The correlation is a numerical value between -1 and 1 that expresses the strength of the relationship between two variables. When it is close to 1, it indicates a positive relationship (one variable increases when the other increases); a value close to 0 indicates that there is no relationship; a value close to -1 indicates a negative relationship (one variable decreases when the other decreases).
When both variables are continuous, a linear function is used to calculate the correlation. When one or both variables are binary, a logistic function is used for calculating the correlation.
Next, we depict the correlations among the features used to target donors in a blood donation campaign. This example uses a recency, frequency, monetary, and time (RFMT) marketing model.
In this case, the quantity variable has a perfect correlation with the frequency variable (1.00). That means that we can remove one of those two variables in our model without any loss of information.
It is beneficial to know the dependencies of the target variables with the input variables. For this kind of diagnostic analytics, we also use correlation coefficients. If the correlation is 0, they are independent of each other, that is, increasing or decreasing one does not imply that it increases or decreases another. On the other hand, if the correlation coefficient is 1 or -1, they are directly or inversely dependent.
Linear correlations are used when both the input and the target are continuous. Logistic correlations are used when one or both the input and the target are binary.
The following figure shows the correlations between the dimensions and velocity of a sailing yacht and its corresponding hydrodynamic performance. From the above figure, we can see that a variable with a high correlation (more than 0.5) and about 10 variables with a small correlation (less than 0.1).
In forecasting applications, autocorrelations refer to the correlations of a time series with its own past values.
Positive autocorrelated series are sometimes called persistent because positive deviations from the mean are followed by positive deviations from the mean and negative deviations from the mean are followed by negative deviations from the mean.
On the other hand, negative autocorrelated time series are characterized by a tendency for positive deviations from the mean to be followed by negative deviations from the mean and vice-versa.
The next figure depicts an example of the autocorrelations of several meteorological variables from the past 5 days.
As we can see, the highest autocorrelated variables is the temperature, and the least autocorrelated variable is the rainfall.
In forecasting applications cross-correlation charts show the correlation between a target variable and the lags of an input variable.
As illustration, the following figure shows the correlations between the rainfall of today and the maximum pressure of yesterday, the day before yesterday, etc.
The chart above shows negative and decreasing correlations.
An outlier is an instance that is distant from other instances. They may be due to variability in the measurement or may indicate experimental errors.
If possible, outliers should be excluded from the data set, i.e., set as unused instances. However, detecting those anomalous instances might be very difficult, and usually requires lots of work.
The first thing we can do is to check for the correctness of the data statistics. Indeed, spurious minimums and maximums are a clear sign of the presence of outliers.
We can also plot the data histograms and check that there are not isolated bins at the ends.
Box plots are also a suitable method for detecting outliers' presence since they depict groups of data through their quartiles.
In this regard, the Tukey's method defines an outlier as those values of the data set that fall too far from the central point, the median. The maximum distance to the center of the data that is going to be allowed is defined by the cleaning parameter. As it grows, the test becomes less sensitive to outliers, but if it is too small, many values will be detected as outliers.
As an example, the following chart is a box plot of the balance of a bank's customers. As we can see, there are a few clients with a very high balance, and they might be treated as outliers.
There are other methods for dealing with outliers, such as the Minkowski error. For more information, you can read the 3 methods to deal with outliers post in our blog.
The objective of filtering is to reduce the noise or errors in the data.
Here the instances that do not fall in a specified range are unused.
When filtering data, the minimum and maximum allowed values for all the variables must be set.
Unusing uncorrelated variables is a statistical technique that allows to identify underlying patterns in a data set so it can be expressed in terms of other data set of lower dimensions without much loss of information.
The resulting data set should be able to explain most of the variance of the original data set by making a variable reduction.