{"id":3382,"date":"2023-08-31T10:59:22","date_gmt":"2023-08-31T10:59:22","guid":{"rendered":"https:\/\/neuraldesigner.com\/blog\/distributions\/"},"modified":"2025-08-29T15:49:19","modified_gmt":"2025-08-29T13:49:19","slug":"distributions","status":"publish","type":"blog","link":"https:\/\/www.neuraldesigner.com\/blog\/distributions\/","title":{"rendered":"How to check if the data distribution is correct in machine learning"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"3382\" class=\"elementor elementor-3382\" data-elementor-post-type=\"blog\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-53fd379c elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"53fd379c\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-3177a18e\" data-id=\"3177a18e\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-3c48eb0c elementor-widget elementor-widget-text-editor\" data-id=\"3c48eb0c\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t\t\t\t\t\t<section><p>In machine learning, plotting the distribution of the variables in the data set is a visual aid for studying their distribution.<\/section><section><\/p>\n<p>Sometimes, the variables in the data set are not well balanced. That means we have much more information about some data regions than others.<\/p>\n\n<\/section><section><p>This situation is frequent in binary variables with many samples of one class and only a few of the other. But it can also occur in continuous variables with many samples with similar values and only a few with other values.<\/p><\/section><section><p>All the variables in a data set should have a uniform or normal distribution. If the distribution of some variables is very irregular, then the predictive model will probably be of poor quality.Plotting the distribution of every variable in the data set is a visual aid for studying their distribution.<\/p>\n\n<\/section><section>\n<h2><span style=\"color: inherit; font-family: inherit; font-size: 1.75rem;\">Contents<\/span><\/h2>\n<\/section><section>\n<ol>\n \t<li style=\"list-style-type: none;\">\n<ol>\n \t<li><a href=\"#Histograms\"> Histograms<\/a>.<\/li>\n \t<li><a href=\"#PieCharts\"> Pie charts<\/a>.<\/li>\n \t<li><a href=\"#MedianQuartiles\"> Median and quartiles<\/a>.<\/li>\n \t<li><a href=\"#BoxPlots\"> Box plots<\/a>.<\/li>\n \t<li><a href=\"#Conclusions\">Conclusions<\/a>.<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<!--\n \t<li><a href=\"#TutorialVideo\">Tutorial video<\/a>.<\/li>\n-->\n<!--\n \t<li><a href=\"#References\">References<\/a>.<\/li>\n-->\n\n<\/section><section><\/section><section>\n<h2>1. Histograms<\/h2>\nHistograms are graphical representations of a numerical variable that shows the frequency distribution of the data in the form of vertical bars. Each bar represents a range of values, and its height indicates the number of times the data falls within that range. The horizontal axis shows the values of the variable, and the vertical axis shows the frequency or probability density of the data in each interval. Histograms are useful for visualizing the data distribution&#8217;s shape and identifying patterns and trends in the data.\n\nLet $v_{j}$ be a data set variable, with minimum $v_{jmin}$ and maximum $v_{jmax}$.\n\nThe first step to building a histogram is splitting the data into $k$ disjoint categories called bins. Usually, the number of bins to divide the data is $k = 10$.\n\nThe length of every bin is\n\n\\begin{eqnarray}\nl = \\frac{v_{jmax}-v_{jmin}}{k}.\n\\label{BinLength}\n\\end{eqnarray}\n\nTherefore, the $k$ disjoint categories are\n\n\\begin{eqnarray}\\nonumber\nb=&amp;&amp; [v_{min}, v_{min}+l),\\\\\\nonumber\n&amp;&amp;\\ldots, \\\\\\nonumber\n&amp;&amp; [v_{min}+(k-2)l, v_{min}+(k-1)l),\\\\\n&amp;&amp; [v_{min}+(k-1)l, v_{max}].\n\\end{eqnarray}\n\nThen, the number of observations, called frequencies, is counted in each bin.\n\nLastly, a bar chart represents the frequencies and bins.\n<h3>Example: Histogram example<\/h3>\nA <a href=\"https:\/\/www.neuraldesigner.com\/learning\/examples\/bank-churn\/\">banking company<\/a> wants to predict which customers will leave the bank and what are the possible causes. The bank can develop loyalty programs and retention campaigns to keep as many customers as possible.\n\nThe objective is to model the probability of churn conditional on customer characteristics. The dataset we use to create the model contains 12 characteristics of over 10000 bank customers.\n\nWe can calculate the histogram of each numerical variable in the data set to visualize the data distribution. The first numeric variable is $credit\\_score$.\n\nTo create the histogram, we have to choose the number of bins. In this case, we take $k=10$. Second, we identify the minimum and maximum of the variable.\n\n\\begin{equation}\ncredit\\_score_{min}=350\n\\qquad\ncredit\\_score_{max}=850\n\\end{equation}\n\nThe length of every bin is\n\n\\begin{eqnarray}\nl = \\frac{credit\\_score_{max} &#8211; credit\\_score_{min}}{k} = \\frac{850 &#8211; 350}{10} = 50.\n\\end{eqnarray}\n\nTherefore, the $k=10$ disjoint categories are\n\n\\begin{eqnarray}\\nonumber\nb=&amp;&amp; [350, 400), [400, 450), [450, 500), [500, 550), [550, 600), \\\\\\nonumber\n&amp;&amp; [600, 650), [650, 700), [700, 750), [750, 800), [800, 850].\n\\end{eqnarray}\n\nThe following figure depicts the $credit\\_score$ histogram.\n\n<img decoding=\"async\" src=\"https:\/\/www.neuraldesigner.com\/images\/histogram.webp\" \/>\n\nAs we can see, histograms help visualize the data distribution\u2019s shape and identify patterns and trends in the data.\n\n<\/section><section>\n<h2>2. Pie charts<\/h2>\nA pie chart of a variable is a type of chart that shows the relative proportions of different values or categories within a single variable. Pie charts (or circle charts) are more suitable representations for binary and categorical variable distribution.\n\nTo create a pie chart of a variable, you would first need to determine the categories or values the variable can take on.\nThen, you would count the number of observations in each category or value and calculate the percentage or proportion of each category&#8217;s total. Finally, you would create a pie chart with a slice for each category or value. The length of each slice&#8217;s arc (and its central angle and area) is proportional to the quantity it represents.\n\nPie charts of variables can be a helpful way to quickly and easily visualize the distribution of a single variable.\n<h3>Example: Pie chart example<\/h3>\nUsing the dataset of the previous example, the first category variable is $country\\_distribution$. The following figure represents a pie chart of a variable with three categories.\n\n<img decoding=\"async\" src=\"https:\/\/www.neuraldesigner.com\/images\/pie-chart.webp\" \/>\n\nThe pie chart for this variable is a helpful way to visualize its distribution quickly and easily.\n\n<\/section><section>\n<h2>3. Median and quartiles<\/h2>\nThe median represents the value in the middle of a distribution, separating it into two equal parts with half of the values above and half below it.\n\nIt measures the central tendency of a distribution and is less sensitive to extreme values than the mean.\n\nIn machine learning, the median is frequently utilized to provide an overview of the typical value of a feature or target variable.\n\nThe calculation of the median varies slightly depending on whether the data set has an odd or even number of elements:\n\nIf the data set has an odd number of elements, the median is the value that occupies the exact center position.\n\nTo calculate it, the data are ordered from smallest to largest, and the value in the center position is selected.\n\nIf the data set has an even number of elements, the median is the arithmetic mean of the two values that occupy the central positions.\n\nTo calculate it, the data are ordered from smallest to largest b, the two values occupying the central positions are selected, and their arithmetic mean is calculated.\n\nThe median is denoted as $Me$, and we can compute it using the following equation,\n\n\\begin{eqnarray}\n\\boxed{\nMe = \\left\\{ \\begin{array}{ll}\nx_{\\frac{q+1}{2}} &amp; \\textrm{if $\\mathbf{q}$ is odd.}\\\\\n\\displaystyle\\frac{1}{2} (x_{\\frac{q}{2}}+x_{1+\\frac{q}{2}}) &amp; \\textrm{if $\\mathbf{q}$ is even.}\n\\end{array} \\right.}\n\\label{MedianFormula}\n\\end{eqnarray}\n\nQuartiles are the three data set points that divide into four groups of the same size.\n<ol>\n \t<li>The first quartile ($Q_1$) is the median of the lower half of the data.\nThe $25\\%$ of the data lies below this value.<\/li>\n \t<li>The second quartile ($Q_2$) is the median of the data.<\/li>\n \t<li>The third quartile ($Q_3$) is the median of the upper half of the data set.\nThe $75\\%$ of the data lies below this value.<\/li>\n<\/ol>\nWe can calculate all of them using the formula of the median for each data subgroup.\n\n<\/section><section>\n<h2>4. Box plots<\/h2>\nBox plots also provide information about the shape of the data.\n\nBox plots (also called box and whiskers) display information about the data&#8217;s minimum, maximum, first quartile, second quartile, median, and third quartile.\n\nA box plot consists of three different parts:\n<ol>\n \t<li>A central rectangle (or box) that goes from the first quartile ($Q_1$) to the third quartile ($Q_3$).<\/li>\n \t<li>A line drawn within the rectangle represents the second quartile or median ($Q_2$).<\/li>\n \t<li>Two whiskers on both sides of the box. One goes from the third quartile to the maximum and the other from the first quartile to the minimum.<\/li>\n<\/ol>\n<h3>Example: Box plot example<\/h3>\nUtilizing the dataset from the preceding example, the initial numeric variable is $credit\\_score$\n\nThe following figure shows the box plot for the variable $credit\\_score$.\n\nThe minimum of the variable is $350$, the first quartile is $584$, the second quartile or median is $652$, the third quartile is $718$, and the maximum is $850$.\n\n<\/section><section>\n<h2>Conclusions<\/h2>\nUnderstanding variables distribution is essential in machine learning, as they provide a framework for modeling uncertainty and capturing the variability inherent in the data.\n\nWe can make accurate predictions, estimate uncertainties, and gain meaningful insights from our machine learning models by selecting and using appropriate distributions.\n<h2>Related posts<\/h2>\n<\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"author":122,"featured_media":1594,"template":"","categories":[],"tags":[36],"class_list":["post-3382","blog","type-blog","status-publish","has-post-thumbnail","hentry","tag-tutorials"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How to check if the data distribution is correct in machine learning<\/title>\n<meta name=\"description\" content=\"In machine learning, plotting the distribution of the variables in the data set is a visual aid for studying their distribution.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.neuraldesigner.com\/blog\/distributions\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to check if the data distribution is correct in machine learning\" \/>\n<meta property=\"og:description\" content=\"In machine learning, plotting the distribution of the variables in the data set is a visual aid for studying their distribution.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.neuraldesigner.com\/blog\/distributions\/\" \/>\n<meta property=\"og:site_name\" content=\"Neural Designer\" \/>\n<meta property=\"article:modified_time\" content=\"2025-08-29T13:49:19+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/sample.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1653\" \/>\n\t<meta property=\"og:image:height\" content=\"1165\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:site\" content=\"@NeuralDesigner\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"7 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/distributions\/\",\"url\":\"https:\/\/www.neuraldesigner.com\/blog\/distributions\/\",\"name\":\"How to check if the data distribution is correct in machine learning\",\"isPartOf\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/distributions\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/distributions\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/sample.jpg\",\"datePublished\":\"2023-08-31T10:59:22+00:00\",\"dateModified\":\"2025-08-29T13:49:19+00:00\",\"description\":\"In machine learning, plotting the distribution of the variables in the data set is a visual aid for studying their distribution.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/distributions\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.neuraldesigner.com\/blog\/distributions\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/distributions\/#primaryimage\",\"url\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/sample.jpg\",\"contentUrl\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/sample.jpg\",\"width\":1653,\"height\":1165},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/distributions\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.neuraldesigner.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Blog\",\"item\":\"https:\/\/www.neuraldesigner.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"How to check if the data distribution is correct in machine learning\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.neuraldesigner.com\/#website\",\"url\":\"https:\/\/www.neuraldesigner.com\/\",\"name\":\"Neural Designer\",\"description\":\"Explanable AI Platform\",\"publisher\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.neuraldesigner.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.neuraldesigner.com\/#organization\",\"name\":\"Neural Designer\",\"url\":\"https:\/\/www.neuraldesigner.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.neuraldesigner.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/05\/logo-neural-1.png\",\"contentUrl\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/05\/logo-neural-1.png\",\"width\":1024,\"height\":223,\"caption\":\"Neural Designer\"},\"image\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/NeuralDesigner\",\"https:\/\/es.linkedin.com\/showcase\/neuraldesigner\/\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How to check if the data distribution is correct in machine learning","description":"In machine learning, plotting the distribution of the variables in the data set is a visual aid for studying their distribution.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.neuraldesigner.com\/blog\/distributions\/","og_locale":"en_US","og_type":"article","og_title":"How to check if the data distribution is correct in machine learning","og_description":"In machine learning, plotting the distribution of the variables in the data set is a visual aid for studying their distribution.","og_url":"https:\/\/www.neuraldesigner.com\/blog\/distributions\/","og_site_name":"Neural Designer","article_modified_time":"2025-08-29T13:49:19+00:00","og_image":[{"width":1653,"height":1165,"url":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/sample.jpg","type":"image\/jpeg"}],"twitter_card":"summary_large_image","twitter_site":"@NeuralDesigner","twitter_misc":{"Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.neuraldesigner.com\/blog\/distributions\/","url":"https:\/\/www.neuraldesigner.com\/blog\/distributions\/","name":"How to check if the data distribution is correct in machine learning","isPartOf":{"@id":"https:\/\/www.neuraldesigner.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.neuraldesigner.com\/blog\/distributions\/#primaryimage"},"image":{"@id":"https:\/\/www.neuraldesigner.com\/blog\/distributions\/#primaryimage"},"thumbnailUrl":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/sample.jpg","datePublished":"2023-08-31T10:59:22+00:00","dateModified":"2025-08-29T13:49:19+00:00","description":"In machine learning, plotting the distribution of the variables in the data set is a visual aid for studying their distribution.","breadcrumb":{"@id":"https:\/\/www.neuraldesigner.com\/blog\/distributions\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.neuraldesigner.com\/blog\/distributions\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.neuraldesigner.com\/blog\/distributions\/#primaryimage","url":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/sample.jpg","contentUrl":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/sample.jpg","width":1653,"height":1165},{"@type":"BreadcrumbList","@id":"https:\/\/www.neuraldesigner.com\/blog\/distributions\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.neuraldesigner.com\/"},{"@type":"ListItem","position":2,"name":"Blog","item":"https:\/\/www.neuraldesigner.com\/blog\/"},{"@type":"ListItem","position":3,"name":"How to check if the data distribution is correct in machine learning"}]},{"@type":"WebSite","@id":"https:\/\/www.neuraldesigner.com\/#website","url":"https:\/\/www.neuraldesigner.com\/","name":"Neural Designer","description":"Explanable AI Platform","publisher":{"@id":"https:\/\/www.neuraldesigner.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.neuraldesigner.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.neuraldesigner.com\/#organization","name":"Neural Designer","url":"https:\/\/www.neuraldesigner.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.neuraldesigner.com\/#\/schema\/logo\/image\/","url":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/05\/logo-neural-1.png","contentUrl":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/05\/logo-neural-1.png","width":1024,"height":223,"caption":"Neural Designer"},"image":{"@id":"https:\/\/www.neuraldesigner.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/NeuralDesigner","https:\/\/es.linkedin.com\/showcase\/neuraldesigner\/"]}]}},"_links":{"self":[{"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/blog\/3382","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/blog"}],"about":[{"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/types\/blog"}],"author":[{"embeddable":true,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/users\/122"}],"version-history":[{"count":0,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/blog\/3382\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/media\/1594"}],"wp:attachment":[{"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/media?parent=3382"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/categories?post=3382"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/tags?post=3382"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}