{"id":3419,"date":"2023-08-31T10:59:21","date_gmt":"2023-08-31T10:59:21","guid":{"rendered":"https:\/\/neuraldesigner.com\/blog\/statistics\/"},"modified":"2025-09-12T17:11:48","modified_gmt":"2025-09-12T15:11:48","slug":"statistics","status":"publish","type":"blog","link":"https:\/\/www.neuraldesigner.com\/blog\/statistics\/","title":{"rendered":"Useful statistics of a dataset in machine learning"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"3419\" class=\"elementor elementor-3419\" data-elementor-post-type=\"blog\">\n\t\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-25a502c0 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"25a502c0\" data-element_type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-7cb37c49\" data-id=\"7cb37c49\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1fb8990a elementor-widget elementor-widget-text-editor\" data-id=\"1fb8990a\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t\t\t\t\t\t<div id=\"contenido\" class=\"contenido\"><section>When building a machine learning model, knowing the ranges of all the variables is imperative; data statistics provide precious information. Indeed, they put the data set in context.<\/section><section><\/section><section id=\"Introduction\">The most important statistical parameters are the minimum, the maximum, the mean, and the standard deviation.\u00a0<span style=\"color: var( --e-global-color-text ); font-family: var( --e-global-typography-text-font-family ), Sans-serif; font-size: var( --e-global-typography-text-font-size ); font-weight: var( --e-global-typography-text-font-weight );\">We must always perform a simple statistical analysis to check data consistency.<\/span><\/section><section id=\"Introduction\"><\/section><section id=\"Introduction\">This way, we need to calculate statistics for each data set variable.Recall that the data matrix comprises the variables<p>\u00a0<\/p><p>\\begin{eqnarray}v_{j} := column_{j}(d), \\quad j=1,\\ldots,q\\quad\\end{eqnarray}<\/p><p>and the samples<\/p><p>\\begin{eqnarray}u_{i} := column_{i}(d), \\quad i=1,\\ldots,p\\quad\\end{eqnarray}<\/p><p>(columns and rows of the data set).<br \/>The values that a variable takes for each sample in the data set are<\/p><p>\\begin{eqnarray}\\quad v_j=(d_{1j}, \\ldots, d_{pj}), \\quad j=1,\\ldots,q.\\end{eqnarray}<\/p><h3>Contents<\/h3><div id=\"contenido\" class=\"contenido\"><section><ol><li style=\"list-style-type: none;\"><ol><li><a href=\"#MinimumAndMaximum\">Minimum and maximum<\/a><\/li><li><a href=\"#MeanAndStandardDeviation\">Mean and standard deviation<\/a><\/li><li><a href=\"#Conclusions\">Conclusions<\/a><\/li><\/ol><\/li><\/ol><\/section><\/div><\/section><section id=\"MinimumAndMaximum\"><h2>2. Minimum and maximum<\/h2><p>The minimum of a variable is the smallest value of that variable in the data set.<\/p><p>The minimum of the variable is denoted by $v_{jmin}$, and it is defined as follows,<\/p><p>\\begin{eqnarray}<br \/>\\boxed{<br \/>v_{jmin} = \\min_{i = 1, \\ldots, p} d_{ij}.<br \/>}<br \/>\\end{eqnarray}<\/p><p>A variable&#8217;s maximum is the variable&#8217;s biggest value in the data set.<\/p><p>Similarly, the maximum is denoted by $v_{jmax}$, and it is defined as follows,<\/p><p>\\begin{eqnarray}<br \/>\\boxed{<br \/>v_{jmax} = \\max_{i=1, \\ldots, p} d_{ij}.<br \/>}<br \/>\\end{eqnarray}<\/p><\/section><section id=\"MeanAndStandardDeviation\"><h2>3. Mean and standard deviation<\/h2><p>The mean of a variable is the average value of that variable in the data set. The mean is denoted by $v_{jmean}$ and is defined as,<\/p><p>\\begin{eqnarray}<br \/>\\boxed{v_{jmean} = \\frac{1}{p}\\sum_{i=1}^{p} d_{ij}.}<br \/>\\end{eqnarray}<\/p><p>where $d_{ij}$ is the data matrix element and $p$ is the number of samples.<\/p><p>The standard deviation measures how dispersed the data is about the mean. The standard deviation of a variable is denoted by $v_{jstd}$ and is defined as,<\/p><p>\\begin{eqnarray}<br \/>\\boxed{<br \/>v_{jstd} = \\sqrt{\\frac{1}{p}\\sum_{i=1}^{p}\\left(d_{ij}-v_{jmean}\\right)^2}.<br \/>}<br \/>\\end{eqnarray}<\/p><p>where $v_{jmean}$ is the mean of the variable $v_{j}$.<\/p><p>The graphical representation is the standard distribution curve, called the Gaussian bell.<\/p><p><img fetchpriority=\"high\" decoding=\"async\" class=\"alignnone\" style=\"width: 50%; height: auto;\" src=\"https:\/\/www.neuraldesigner.com\/images\/gaussian.png\" alt=\"The graphical representation for the mean and standard deviation, one of the data statics parameters in machine learning models, is the standard distribution curve, called the Gaussian bell. \" width=\"1526\" height=\"1098\" \/><\/p><p>A low standard deviation means that all the values are close to the mean.<br \/>Conversely, a high standard deviation means the values are spread out around the mean and from each other.<\/p><h3>Example: Predict the noise generated by airfoil blades<\/h3><p>NASA conducts a study of the noise generated by an aircraft in order to make a model to reduce it.<\/p><p>The file <a href=\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/10\/airfoil_self_noise.zip\">airfoil_self_noise<\/a>.cvs contains the data for this example.<br \/>Here the number of variables (columns) is 6, and the number of instances (rows) is 1503.<\/p><p>We can calculate the basic statistics of each variable using the formulas described above.<br \/>The following table displays the minimum, maximum, mean, and standard deviation for every input variable in the data set.<\/p><table><tbody><tr><td>Name<\/td><td>Minimum<\/td><td>Maximum<\/td><td>Mean<\/td><td>Deviation<\/td><\/tr><tr><td>frequency<\/td><td style=\"text-align: right;\">200<\/td><td style=\"text-align: right;\">20000<\/td><td style=\"text-align: right;\">2890<\/td><td style=\"text-align: right;\">3150<\/td><\/tr><tr><td>angle_of_attack<\/td><td style=\"text-align: right;\">0<\/td><td style=\"text-align: right;\">0.22<\/td><td style=\"text-align: right;\">6.78<\/td><td style=\"text-align: right;\">5.92<\/td><\/tr><tr><td>chord_length<\/td><td style=\"text-align: right;\">0.0254<\/td><td style=\"text-align: right;\">0.305<\/td><td style=\"text-align: right;\">0.137<\/td><td style=\"text-align: right;\">0.0935<\/td><\/tr><tr><td>free_stream_velocity<\/td><td style=\"text-align: right;\">31.7<\/td><td style=\"text-align: right;\">71.3<\/td><td style=\"text-align: right;\">50.9<\/td><td style=\"text-align: right;\">15.6<\/td><\/tr><tr><td>suction_side_displacement_thickness<\/td><td style=\"text-align: right;\">0.000401<\/td><td style=\"text-align: right;\">0.0584<\/td><td style=\"text-align: right;\">0.0111<\/td><td style=\"text-align: right;\">0.0132<\/td><\/tr><tr><td>scaled_sound_pressure_level<\/td><td style=\"text-align: right;\">103<\/td><td style=\"text-align: right;\">141<\/td><td style=\"text-align: right;\">125<\/td><td style=\"text-align: right;\">6.9<\/td><\/tr><\/tbody><\/table><p><!--Me falta concluir y explicar un poco qu\u00e9 vemos en la tabla-->By performing this simple statistical analysis, we can check the consistency of the data.<\/p><\/section><section id=\"Conclusions\"><h2>4. Conclusions<\/h2><p>Statistics put the data set in context.<\/p><p>It is essential to perform a simple statistical analysis to check the consistency of the data before building the model.<\/p><p>This is done by calculating each variable&#8217;s most important statistical parameters, such as the minimum and maximum values, mean, and standard deviation.<\/p><h2>Related posts<\/h2><\/section><\/div>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"author":10,"featured_media":1488,"template":"","categories":[],"tags":[36],"class_list":["post-3419","blog","type-blog","status-publish","has-post-thumbnail","hentry","tag-tutorials"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Useful statistics of a dataset in machine learning<\/title>\n<meta name=\"description\" content=\"In machine learning, it is important to know the ranges of all variables; data statistics help to contextualize the data set.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.neuraldesigner.com\/blog\/statistics\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Useful statistics of a dataset in machine learning\" \/>\n<meta property=\"og:description\" content=\"In machine learning, it is important to know the ranges of all variables; data statistics help to contextualize the data set.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.neuraldesigner.com\/blog\/statistics\/\" \/>\n<meta property=\"og:site_name\" content=\"Neural Designer\" \/>\n<meta property=\"article:modified_time\" content=\"2025-09-12T15:11:48+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/statistics-scaled.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1337\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:site\" content=\"@NeuralDesigner\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/statistics\/\",\"url\":\"https:\/\/www.neuraldesigner.com\/blog\/statistics\/\",\"name\":\"Useful statistics of a dataset in machine learning\",\"isPartOf\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/statistics\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/statistics\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/statistics-scaled.jpg\",\"datePublished\":\"2023-08-31T10:59:21+00:00\",\"dateModified\":\"2025-09-12T15:11:48+00:00\",\"description\":\"In machine learning, it is important to know the ranges of all variables; data statistics help to contextualize the data set.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/statistics\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.neuraldesigner.com\/blog\/statistics\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/statistics\/#primaryimage\",\"url\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/statistics-scaled.jpg\",\"contentUrl\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/statistics-scaled.jpg\",\"width\":2560,\"height\":1337},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/statistics\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.neuraldesigner.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Blog\",\"item\":\"https:\/\/www.neuraldesigner.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Useful statistics of a dataset in machine learning\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.neuraldesigner.com\/#website\",\"url\":\"https:\/\/www.neuraldesigner.com\/\",\"name\":\"Neural Designer\",\"description\":\"Explanable AI Platform\",\"publisher\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.neuraldesigner.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.neuraldesigner.com\/#organization\",\"name\":\"Neural Designer\",\"url\":\"https:\/\/www.neuraldesigner.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.neuraldesigner.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/05\/logo-neural-1.png\",\"contentUrl\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/05\/logo-neural-1.png\",\"width\":1024,\"height\":223,\"caption\":\"Neural Designer\"},\"image\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/NeuralDesigner\",\"https:\/\/es.linkedin.com\/showcase\/neuraldesigner\/\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Useful statistics of a dataset in machine learning","description":"In machine learning, it is important to know the ranges of all variables; data statistics help to contextualize the data set.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.neuraldesigner.com\/blog\/statistics\/","og_locale":"en_US","og_type":"article","og_title":"Useful statistics of a dataset in machine learning","og_description":"In machine learning, it is important to know the ranges of all variables; data statistics help to contextualize the data set.","og_url":"https:\/\/www.neuraldesigner.com\/blog\/statistics\/","og_site_name":"Neural Designer","article_modified_time":"2025-09-12T15:11:48+00:00","og_image":[{"width":2560,"height":1337,"url":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/statistics-scaled.jpg","type":"image\/jpeg"}],"twitter_card":"summary_large_image","twitter_site":"@NeuralDesigner","twitter_misc":{"Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.neuraldesigner.com\/blog\/statistics\/","url":"https:\/\/www.neuraldesigner.com\/blog\/statistics\/","name":"Useful statistics of a dataset in machine learning","isPartOf":{"@id":"https:\/\/www.neuraldesigner.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.neuraldesigner.com\/blog\/statistics\/#primaryimage"},"image":{"@id":"https:\/\/www.neuraldesigner.com\/blog\/statistics\/#primaryimage"},"thumbnailUrl":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/statistics-scaled.jpg","datePublished":"2023-08-31T10:59:21+00:00","dateModified":"2025-09-12T15:11:48+00:00","description":"In machine learning, it is important to know the ranges of all variables; data statistics help to contextualize the data set.","breadcrumb":{"@id":"https:\/\/www.neuraldesigner.com\/blog\/statistics\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.neuraldesigner.com\/blog\/statistics\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.neuraldesigner.com\/blog\/statistics\/#primaryimage","url":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/statistics-scaled.jpg","contentUrl":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/statistics-scaled.jpg","width":2560,"height":1337},{"@type":"BreadcrumbList","@id":"https:\/\/www.neuraldesigner.com\/blog\/statistics\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.neuraldesigner.com\/"},{"@type":"ListItem","position":2,"name":"Blog","item":"https:\/\/www.neuraldesigner.com\/blog\/"},{"@type":"ListItem","position":3,"name":"Useful statistics of a dataset in machine learning"}]},{"@type":"WebSite","@id":"https:\/\/www.neuraldesigner.com\/#website","url":"https:\/\/www.neuraldesigner.com\/","name":"Neural Designer","description":"Explanable AI Platform","publisher":{"@id":"https:\/\/www.neuraldesigner.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.neuraldesigner.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.neuraldesigner.com\/#organization","name":"Neural Designer","url":"https:\/\/www.neuraldesigner.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.neuraldesigner.com\/#\/schema\/logo\/image\/","url":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/05\/logo-neural-1.png","contentUrl":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/05\/logo-neural-1.png","width":1024,"height":223,"caption":"Neural Designer"},"image":{"@id":"https:\/\/www.neuraldesigner.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/NeuralDesigner","https:\/\/es.linkedin.com\/showcase\/neuraldesigner\/"]}]}},"_links":{"self":[{"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/blog\/3419","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/blog"}],"about":[{"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/types\/blog"}],"author":[{"embeddable":true,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/users\/10"}],"version-history":[{"count":0,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/blog\/3419\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/media\/1488"}],"wp:attachment":[{"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/media?parent=3419"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/categories?post=3419"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/tags?post=3419"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}