The data set contains the information for creating our model. It is a collection of data structured as a table, in rows and columns.
The most popular data set in the machine learning field is the Iris flower data set, which was introduced by the British statistician and biologist Ronald Fisher in 1936.
We can identify the next concepts in a dataset:
The data is usually stored in a data file or a database. Some common data sources are the following:
The variables are the columns in the data table. Variables might represent physical measurements (temperature, velocity...), personal characteristics (gender, age...), marketing dimensions (recency, frequency, monetary...), etc.
Regarding their use, we can talk about input, target or unused variables.
Input variables are the independent variables in the model. They are also-called features or attributes.
Target variables are the dependent variables in the model.
In regression problems, targets are continuous variables (power consumption, product quality...).
In classification problems, targets are binary (fault, churn...) or categorical (type of object, activity...). In this type of applications, targets are also-called categories or labels.
Unused variables are neither inputs nor targets. We can set a variable to Unused when it does not provide any information to the model (id number, address...).
Constant variables are those columns in the data matrix having always the same value. They should be set as Unused, since do not provide any information to the model but increase its complexity.
It is not useful to design a neural network to simply memorize a set of data. Instead, we want the neural network to perform accurately on new data, that is, to be able to generalize.
To achieve that, we divide the data set into three different subsets: training, selection and testing.
During the design of a model, we usually need to try different settings. For instance, we can build several models with different architectures and compare their performance. Training instances are used to construct all that models.
Selection instances are used for choosing the neural network with best generalization properties. In this way, we construct different models with the training subset, and select that one that works best on the selection subset.
Testing instances are used to validate the functioning of the model. That is, we train different models with the training instances, select the one that performs best on the selection instances and test its capabilities with testing instances.
Instead of providing useful information to the model, some instances might distort it. For example, outliers in the data can make the neural network to work badly. To fix these problems, those instances are set to Unused.
Usually, 60% of the instances are used for training, 20% for selection and 20% for testing. Splitting of the instances might be performed in a sequential order or randomly.
Repeated instances can also be set to Unused, since they provide redundant information to the model.
A data set can also contain missing values, which are those elements which are not present. Usually missing values are denoted by a label in the data set. Some common labels used for representing missing values are NA (not available), NaN (not a number), Unknown or ?. Do not use numeric values here, such as -999, since that might be confused with an actual value.
There are two ways to deal with missing values: removing them from the data or asigning them a value.
If the number of instances in the data set is big and the number of missing values is small, the instances with missing values can be simply excluded from the analysis. In this way, the unusing method sets those instances with missing values to Unuse.
If the data set is small or the number of missing values is big, you probably cannot afford to unuse the instances with missing values. In these cases it is advisable to assign probable values to the missing data. The most common imputation method is to substitute the missing values with the mean value of the corresponding variable.
Basic statistics are a very valuable information when designing a model, since they might alert on the presence of spurious data. It is a must to check for the correctness of the most important statistical measures of every single variable.
As an example, the following table depicts the minimum, maximum, mean and standard deviation of the variables used to improve the performance of a combined cycle power plant.
In classification applications, it is also interesting to compare the statistics for each category. For instance, we can compare the mean age of customers that buy a certain product with the mean age of customer that do not buy the product.
Distributions show how the data is distributed over its entire range. If the data is very irregularly distributed, the resulting model will probably be of bad quality.
Histograms are used to see how continuous variables are distributed. continuous variables usually have a normal (or gaussian) distribution.
For example, the following figure depicts a histogram for the noise generated by different airfoil blades.
As we can see, this variable has a normal distribution. 22% of the airfoil blades in the data set emit a sound around 127 dB.
Pie charts are used to see the distribution of binary or nominal variables. It is desirable that this type of variables are uniformly distributed.
The following figure shows the pie chart for the customers of a bank that purchase a bank deposit in a marketing campaign, which is a binary variable.
Approximately 85.5% of the customers do not purchase the product 14.5% do purchase it. Therefore, this variable is not well-balanced, which means that there are much more negatives than positives.
Box plots also provide information about the shape of the data. They display information about the minimum, maximum, first quartile, second quartile or median and third quartile of every variable. They consist of two parts: a box and two whiskers.
The length of the box represents the interquartile range (IQR), which is the distance between the third quartile and the first quartile. The middle half of the data falls inside the interquartile range. The whisker below the box shows the minimum of the variable while the whisker above the box shows the maximum of the variable. Within the box, it will also be drawn a line which represents the median of the variable.
The following figure illustrates the box plot for the age of the employees of a company.
As we can see, 50% of the customer are between 35 and 45 years old.
It is always very useful to see how the targets depend on the inputs. Scatter charts show points with target values versus input values.
As an example, the following chart shows the compressive strength of concretes as a function of the quantity of cement. As we can see, there is a quite strong correlation between both variables: the more atmospheric pressure, the more power generated by the plant.
Sometimes, data sets contain redundant data that complicate the design of the neural network. To discover redundancies between the input variables we use a correlation matrix.
The correlation is a numerical value between -1 and 1 that expresses the strength of the relationship between two variables. When it is close to 1 it indicates a positive relationship (one variable increases when the other increases); a value close to 0 indicates that there is no relationship; and a value close to -1 indicates a negative relationship (one variable decreases when the other decreases).
When both variables are continuous, a linear function is used to calculate the correlation. When one or both variables are binary, a logistic function is used for calculating the correlation.
Next, we depict the correlations among the features used to target donors in a blood donation campaign. This example uses a recency, frequency, monetary and time (RFMT) marketing model.
In this case, the quantity variable has a perfect correlation with the frequency variable (1.00). That means that we can remove one of that two variables in our model without any loss of information.
It is very useful to know the dependencies of the target variables with the input variables. For this kind of diagnostic analytics we also use correlation coefficients. If the correlation is 0, they are independent of each other, that is, increasing or decreasing one does not imply that it increases or decreases another. On the other hand, if the correlation coefficient is 1 or -1, they are directly or inversely dependent, respectively.
Linear correlations are used when both the input and the target are continuous. Logistic correlations are used when one or both the input and the target are binary.
The following figure shows the correlations between different variables and the churn of customers in a telecommunications company. From the above figure, we can see that there is a variable with a high correlation (more than 0.5) and about 10 variables with a small correlation (less than 0.1).
An outlier is an instance that is distant from other instances. They may be due to variability in the measurement or may indicate experimental errors.
If possible, outliers should be excluded from the data set. However, detecting that anomalous instances might be very difficult, and usually requires lots of work.
The first thing we can do is to check for the correctness of the data statistics.
Indeed, spurious minimums and maximums are a clear sing of the presence of outliers.
We can also plot the data histograms and check that there are not isolated bins at the ends.
Box plots are also a good method for detecting the presence of outliers, since they depict groups of data through their quartiles.
In this regard, the Tukey's method defines an outlier as those values of the data set that fall to far from the central point, the median. The maximum distance to the center of the data that is going to be allowed is defined by the cleaning parameter. As it grows, the test becomes less sensitive to outliers but if it is too small, a lot of values will be detected as outliers.
As an example, the following chart is a box plot of the balance of a bank's customers. As we can see, there are a few clients with very high balance, and they might be treated as outliers.
There are other methods for dealing with outliers such as the Minkowski error. For more information you can read the 3 methods to deal with outliers article in our blog.
Data filtering is the task of reducing the content of noise or errors from data.
Here the instances that don't fall in a specified range are unused.
When filtering data, the minimum and maximum allowed values for all the variables must be set.