In machine learning, plotting the distribution of the variables in the data set is a visual aid for studying their distribution.

Sometimes, the variables in the data set are not well balanced. That means we have much more information about some data regions than others.

This situation is frequent in binary variables with many samples of one class and only a few of the other. But it can also occur in continuous variables with many samples with similar values and only a few with other values.

All the variables in a data set should have a uniform or normal distribution. If the distribution of some variables is very irregular, then the predictive model will probably be of poor quality.Plotting the distribution of every variable in the data set is a visual aid for studying their distribution.

Contents

    1. Histograms.
    2. Pie charts.
    3. Median and quartiles.
    4. Box plots.
    5. Conclusions.

1. Histograms

Histograms are graphical representations of a numerical variable that shows the frequency distribution of the data in the form of vertical bars. Each bar represents a range of values, and its height indicates the number of times the data falls within that range. The horizontal axis shows the values of the variable, and the vertical axis shows the frequency or probability density of the data in each interval. Histograms are useful for visualizing the data distribution’s shape and identifying patterns and trends in the data. Let $v_{j}$ be a data set variable, with minimum $v_{jmin}$ and maximum $v_{jmax}$. The first step to building a histogram is splitting the data into $k$ disjoint categories called bins. Usually, the number of bins to divide the data is $k = 10$. The length of every bin is \begin{eqnarray} l = \frac{v_{jmax}-v_{jmin}}{k}. \label{BinLength} \end{eqnarray} Therefore, the $k$ disjoint categories are \begin{eqnarray}\nonumber b=&& [v_{min}, v_{min}+l),\\\nonumber &&\ldots, \\\nonumber && [v_{min}+(k-2)l, v_{min}+(k-1)l),\\ && [v_{min}+(k-1)l, v_{max}]. \end{eqnarray} Then, the number of observations, called frequencies, is counted in each bin. Lastly, a bar chart represents the frequencies and bins.

Example: Histogram example

A banking company wants to predict which customers will leave the bank and what are the possible causes. The bank can develop loyalty programs and retention campaigns to keep as many customers as possible. The objective is to model the probability of churn conditional on customer characteristics. The dataset we use to create the model contains 12 characteristics of over 10000 bank customers. We can calculate the histogram of each numerical variable in the data set to visualize the data distribution. The first numeric variable is $credit\_score$. To create the histogram, we have to choose the number of bins. In this case, we take $k=10$. Second, we identify the minimum and maximum of the variable. \begin{equation} credit\_score_{min}=350 \qquad credit\_score_{max}=850 \end{equation} The length of every bin is \begin{eqnarray} l = \frac{credit\_score_{max} – credit\_score_{min}}{k} = \frac{850 – 350}{10} = 50. \end{eqnarray} Therefore, the $k=10$ disjoint categories are \begin{eqnarray}\nonumber b=&& [350, 400), [400, 450), [450, 500), [500, 550), [550, 600), \\\nonumber && [600, 650), [650, 700), [700, 750), [750, 800), [800, 850]. \end{eqnarray} The following figure depicts the $credit\_score$ histogram. As we can see, histograms help visualize the data distribution’s shape and identify patterns and trends in the data.

2. Pie charts

A pie chart of a variable is a type of chart that shows the relative proportions of different values or categories within a single variable. Pie charts (or circle charts) are more suitable representations for binary and categorical variable distribution. To create a pie chart of a variable, you would first need to determine the categories or values the variable can take on. Then, you would count the number of observations in each category or value and calculate the percentage or proportion of each category’s total. Finally, you would create a pie chart with a slice for each category or value. The length of each slice’s arc (and its central angle and area) is proportional to the quantity it represents. Pie charts of variables can be a helpful way to quickly and easily visualize the distribution of a single variable.

Example: Pie chart example

Using the dataset of the previous example, the first category variable is $country\_distribution$. The following figure represents a pie chart of a variable with three categories. The pie chart for this variable is a helpful way to visualize its distribution quickly and easily.

3. Median and quartiles

The median represents the value in the middle of a distribution, separating it into two equal parts with half of the values above and half below it. It measures the central tendency of a distribution and is less sensitive to extreme values than the mean. In machine learning, the median is frequently utilized to provide an overview of the typical value of a feature or target variable. The calculation of the median varies slightly depending on whether the data set has an odd or even number of elements: If the data set has an odd number of elements, the median is the value that occupies the exact center position. To calculate it, the data are ordered from smallest to largest, and the value in the center position is selected. If the data set has an even number of elements, the median is the arithmetic mean of the two values that occupy the central positions. To calculate it, the data are ordered from smallest to largest b, the two values occupying the central positions are selected, and their arithmetic mean is calculated. The median is denoted as $Me$, and we can compute it using the following equation, \begin{eqnarray} \boxed{ Me = \left\{ \begin{array}{ll} x_{\frac{q+1}{2}} & \textrm{if $\mathbf{q}$ is odd.}\\ \displaystyle\frac{1}{2} (x_{\frac{q}{2}}+x_{1+\frac{q}{2}}) & \textrm{if $\mathbf{q}$ is even.} \end{array} \right.} \label{MedianFormula} \end{eqnarray} Quartiles are the three data set points that divide into four groups of the same size.
  1. The first quartile ($Q_1$) is the median of the lower half of the data. The $25\%$ of the data lies below this value.
  2. The second quartile ($Q_2$) is the median of the data.
  3. The third quartile ($Q_3$) is the median of the upper half of the data set. The $75\%$ of the data lies below this value.
We can calculate all of them using the formula of the median for each data subgroup.

4. Box plots

Box plots also provide information about the shape of the data. Box plots (also called box and whiskers) display information about the data’s minimum, maximum, first quartile, second quartile, median, and third quartile. A box plot consists of three different parts:
  1. A central rectangle (or box) that goes from the first quartile ($Q_1$) to the third quartile ($Q_3$).
  2. A line drawn within the rectangle represents the second quartile or median ($Q_2$).
  3. Two whiskers on both sides of the box. One goes from the third quartile to the maximum and the other from the first quartile to the minimum.

Example: Box plot example

Utilizing the dataset from the preceding example, the initial numeric variable is $credit\_score$ The following figure shows the box plot for the variable $credit\_score$. The minimum of the variable is $350$, the first quartile is $584$, the second quartile or median is $652$, the third quartile is $718$, and the maximum is $850$.

Conclusions

Understanding variables distribution is essential in machine learning, as they provide a framework for modeling uncertainty and capturing the variability inherent in the data. We can make accurate predictions, estimate uncertainties, and gain meaningful insights from our machine learning models by selecting and using appropriate distributions.

Related posts