{"id":3381,"date":"2025-11-20T10:59:22","date_gmt":"2025-11-20T09:59:22","guid":{"rendered":"https:\/\/neuraldesigner.com\/blog\/dataset-datamatrix\/"},"modified":"2025-11-27T14:54:53","modified_gmt":"2025-11-27T13:54:53","slug":"dataset-datamatrix","status":"publish","type":"blog","link":"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/","title":{"rendered":"How is a dataset for machine learning?"},"content":{"rendered":"<section>\n<p id=\"HowIsADataSet?\">A machine learning dataset collects data to create and train an approximation, classification, or forecasting model. Central to this process is the data source, the specific origin point of the data being utilized. These data sources come in various formats, such as Excel files, .csv files, databases, image data, and more.<\/p>\n<p>Before building a model, it is necessary to transform the data into numbers; this involves collecting the data in a matrix of real numbers, thereby creating the data matrix. Each column symbolizes a specific variable within a dataset, while each row corresponds to an individual sample within the dataset.<\/p>\n<p>A variable encompasses any measurable or countable characteristic, number, or quantity. It serves as an attribute that describes a person, place, thing, or concept.<\/p>\n<p>The variable\u2019s value can &#8220;vary&#8221; from one entity to another. According to their type, we can consider different types of variables: numeric, ordinal, binary, or categorical variables.<\/p>\n<p>Variables can be used as inputs or targets. Input variables are the independent variables in the model (they are also called features or attributes), and target variables are the dependent variables in the model.<\/p>\n<p>A sample represents an observation encompassing all variables within a dataset. Additionally, these samples serve various purposes, prompting their division into three distinct subsets. These are the training set (used to build different candidate models), the selection set (used to select the model that exhibits the best properties), and the test set (used to validate the final model).<\/p>\n<p>Sometimes, the dataset may be incomplete or have missing values. This is one of the main problems when applying neural networks to real-world problems. We can unuse the whole sample or impute such a value to solve it.<\/p>\n<h3>Contents<\/h3>\n<ol>\n<li><a href=\"#DataMatrix\">Data matrix<\/a>.<\/li>\n<li><a href=\"#DataAnalysis\">Data analysis<\/a>.<\/li>\n<li><a href=\"#Conclusions\">Conclusions<\/a>.<\/li>\n<\/ol>\n<section id=\"DataMatrix\">\n<h2>1. Data matrix<\/h2>\n<p>Before building a model, we need to collect the data in a matrix of real numbers.<\/p>\n<p>Let denote ( p) the number of rows and $q$ the number of columns.<br \/>\nIt is a matrix $d in {R}^{p times q}$.<\/p>\n<p>As we can see, machine learning models require all data to be real numbers.<\/p>\n<p>The data matrix has the following form,<\/p>\n<p>begin{eqnarray}<br \/>\nd = left(<br \/>\nbegin{array}{ccc}<br \/>\nd_{1,1} &amp; cdots &amp; d_{1,q}\\<br \/>\nvdots &amp; ddots &amp; vdots \\<br \/>\nd_{p,1} &amp; cdots &amp; d_{p,q}\\<br \/>\nend{array}<br \/>\nright).<br \/>\nend{eqnarray}<\/p>\n<h3>Samples<\/h3>\n<p>A sample is a vector $u in {R}^{p}$, where $p$ is the number of rows in the data matrix.<\/p>\n<p>In this regard, the data matrix contains $q$ variables,<\/p>\n<p>begin{eqnarray}<br \/>\nu_{i}:=col_{i}(d), quad i=1,ldots,q.<br \/>\nend{eqnarray}<\/p>\n<p>The samples will also have different uses. We divide the samples into three different subsets. These are the training set (used to build different candidate models), the selection set (used to select the model that exhibits the best properties), and the test set (used to validate the final model).<\/p>\n<h3>Variables<\/h3>\n<p>A variable is a vector $v in {R}^{q}$, where $q$ is the number of columns in the data matrix.<br \/>\nIn this regard, the data matrix contains $p$ samples,<\/p>\n<p>begin{eqnarray}<br \/>\nv_{i}:=row_{i}(d), quad i=1,ldots,p.<br \/>\nend{eqnarray}<\/p>\n<p>The variable\u2019s value can &#8220;vary&#8221; from one entity to another. According to their type, we can consider different types of variables: numeric, ordinal, binary, or categorical variables.<\/p>\n<p>Variables can be used as inputs or targets. Input variables are the independent variables in the model (they are also called features or attributes), and target variables are the dependent variables in the model.<\/p>\n<p>Our source of information may not be in a direct matrix format.<\/p>\n<p>For example, the information may be distributed in several tables of a database.<\/p>\n<p>We can also find sets of images to, for example, diagnose a tumor.<\/p>\n<p>In addition, some data may not be real numbers.<\/p>\n<p>For example, a customer&#8217;s country (Spain, France, etc.) is categorical.<\/p>\n<p>This means that an essential part of building machine learning models is the creation of a data matrix with the correct format.<\/p>\n<p><b>The following example from the industry sector shows a data matrix of a real model: Wind turbine data matrix<\/b><\/p>\n<p>A wind turbine manufacturer wants to know the electrical power generated by the device at different wind speeds.<br \/>\nTo do this, they measure different operating scenarios and generate the following data matrix,<\/p>\n<p>begin{eqnarray}nonumber<br \/>\nd = left(<br \/>\nbegin{array}{cc}<br \/>\n380.048 &amp; 5.311\\<br \/>\n453.769 &amp; 5.672\\<br \/>\nvdots &amp; vdots \\<br \/>\n2820.466 &amp; 9.973\\<br \/>\nend{array}<br \/>\nright).<br \/>\nend{eqnarray}<\/p>\n<p>The number of columns in the data matrix is $q=2$, a simple matrix.<br \/>\nEach column corresponds to a variable.<br \/>\nIn this case, we have the wind speed (in meters per second) and the corresponding power generated by the turbine (in kilowatts).<br \/>\nThe first column is the input, and the second column is the target.<\/p>\n<p>The number of rows in the data matrix is $p=48007$.<br \/>\nEach row corresponds to a sample.<br \/>\nEach sample contains values of the two variables.<\/p>\n<\/section>\n<section id=\"DataAnalysis\">\n<h2>2. Data analysis<\/h2>\n<p>Before building a model, we need to analyze the data statistically to understand what it represents.<\/p>\n<h3>Statistics<\/h3>\n<p>The most basic analysis involves examining the statistics for each variable, with the most important statistical parameters being the minimum, maximum, mean, and standard deviation.<\/p>\n<h3>Distributions<\/h3>\n<p>Another descriptive analysis is that of the distributions of each variable.<\/p>\n<p>For the predictive model to be of higher quality, we must check that all the variables in a dataset have a uniform or normal distribution.<\/p>\n<p>We can calculate different types of distributions, such as histograms, pie charts, medians, quartiles, or box plots.<\/p>\n<h3>Correlations<\/h3>\n<p>We can also discover dependencies between the variables of the dataset from the correlations.<\/p>\n<p>A correlation is a numerical value between -1 and 1 that expresses the strength of the relationship between two variables.<\/p>\n<p>When the correlation approaches 1 between two variables, they are positively related. Conversely, a correlation near 0 suggests no discernible relationship between the study variables. In contrast, if the correlation approaches -1, the variables are negatively related.<\/p>\n<h3>Outliers<\/h3>\n<p>We can also analyze the data to detect potential problems. One of the most common problems is outliers.<\/p>\n<p>Outliers, which are abnormal observations in the data, have the potential to disrupt and confound the training process.<\/p>\n<p>To address these outliers, we can employ Tukey\u2019s univariate test or utilize the Local Outlier Factor, a multivariate method.<\/p>\n<h3>Filtering<\/h3>\n<p>We can also filter the data to create models with subsets of it. Filtering is usually temporary; we keep the entire data set, but only a part is used for the calculation.<\/p>\n<p>Filtering requires that you specify a rule or logic to identify the cases you want to include in the analysis.<\/p>\n<h3>Scaling<\/h3>\n<p>On the other hand, you should always scale the variables before training a neural network.<\/p>\n<p>Scaling puts the data into a suitable range for computation, usually done variable by variable, since each can have different ranges.<\/p>\n<p>Some of the most common methods are minimum\u2013maximum, mean\u2013standard deviation, and logarithmic scaling.<\/p>\n<\/section>\n<section>\n<h3>Batching<\/h3>\n<p>Training algorithms for neural networks do not work with the data matrix directly. Instead, they use data structures called data batches.<\/p>\n<p>A batch of data contains two tensors, one with input data and one with target data, and the range of these batches depends on the model type.<\/p>\n<\/section>\n<section id=\"Conclusions\">\n<h2>Conclusions<\/h2>\n<p>Datasets collect the data needed to create and train a model. In general, the data must be transformed to adapt it to machine learning and generate the data matrix.<\/p>\n<p>Subsequently, it is advisable to perform a statistical study of the data to deal with potential problems such as outliers.<\/p>\n<\/section>\n<\/section>\n","protected":false},"author":10,"featured_media":2308,"template":"","categories":[],"tags":[36],"class_list":["post-3381","blog","type-blog","status-publish","has-post-thumbnail","hentry","tag-tutorials"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How is a dataset for machine learning?<\/title>\n<meta name=\"description\" content=\"A machine learning dataset collects data needed to create and train an approximation, classification, or forecasting model.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How is a dataset for machine learning?\" \/>\n<meta property=\"og:description\" content=\"A machine learning dataset collects data needed to create and train an approximation, classification, or forecasting model.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/\" \/>\n<meta property=\"og:site_name\" content=\"Neural Designer\" \/>\n<meta property=\"article:modified_time\" content=\"2025-11-27T13:54:53+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/datascience.webp\" \/>\n\t<meta property=\"og:image:width\" content=\"2000\" \/>\n\t<meta property=\"og:image:height\" content=\"1166\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/webp\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:site\" content=\"@NeuralDesigner\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/\",\"url\":\"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/\",\"name\":\"How is a dataset for machine learning?\",\"isPartOf\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/datascience.webp\",\"datePublished\":\"2025-11-20T09:59:22+00:00\",\"dateModified\":\"2025-11-27T13:54:53+00:00\",\"description\":\"A machine learning dataset collects data needed to create and train an approximation, classification, or forecasting model.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/#primaryimage\",\"url\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/datascience.webp\",\"contentUrl\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/datascience.webp\",\"width\":2000,\"height\":1166},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.neuraldesigner.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Blog\",\"item\":\"https:\/\/www.neuraldesigner.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"How is a dataset for machine learning?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.neuraldesigner.com\/#website\",\"url\":\"https:\/\/www.neuraldesigner.com\/\",\"name\":\"Neural Designer\",\"description\":\"Explanable AI Platform\",\"publisher\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.neuraldesigner.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.neuraldesigner.com\/#organization\",\"name\":\"Neural Designer\",\"url\":\"https:\/\/www.neuraldesigner.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.neuraldesigner.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/05\/logo-neural-1.png\",\"contentUrl\":\"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/05\/logo-neural-1.png\",\"width\":1024,\"height\":223,\"caption\":\"Neural Designer\"},\"image\":{\"@id\":\"https:\/\/www.neuraldesigner.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/NeuralDesigner\",\"https:\/\/es.linkedin.com\/showcase\/neuraldesigner\/\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How is a dataset for machine learning?","description":"A machine learning dataset collects data needed to create and train an approximation, classification, or forecasting model.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/","og_locale":"en_US","og_type":"article","og_title":"How is a dataset for machine learning?","og_description":"A machine learning dataset collects data needed to create and train an approximation, classification, or forecasting model.","og_url":"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/","og_site_name":"Neural Designer","article_modified_time":"2025-11-27T13:54:53+00:00","og_image":[{"width":2000,"height":1166,"url":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/datascience.webp","type":"image\/webp"}],"twitter_card":"summary_large_image","twitter_site":"@NeuralDesigner","twitter_misc":{"Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/","url":"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/","name":"How is a dataset for machine learning?","isPartOf":{"@id":"https:\/\/www.neuraldesigner.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/#primaryimage"},"image":{"@id":"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/#primaryimage"},"thumbnailUrl":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/datascience.webp","datePublished":"2025-11-20T09:59:22+00:00","dateModified":"2025-11-27T13:54:53+00:00","description":"A machine learning dataset collects data needed to create and train an approximation, classification, or forecasting model.","breadcrumb":{"@id":"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/#primaryimage","url":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/datascience.webp","contentUrl":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/06\/datascience.webp","width":2000,"height":1166},{"@type":"BreadcrumbList","@id":"https:\/\/www.neuraldesigner.com\/blog\/dataset-datamatrix\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.neuraldesigner.com\/"},{"@type":"ListItem","position":2,"name":"Blog","item":"https:\/\/www.neuraldesigner.com\/blog\/"},{"@type":"ListItem","position":3,"name":"How is a dataset for machine learning?"}]},{"@type":"WebSite","@id":"https:\/\/www.neuraldesigner.com\/#website","url":"https:\/\/www.neuraldesigner.com\/","name":"Neural Designer","description":"Explanable AI Platform","publisher":{"@id":"https:\/\/www.neuraldesigner.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.neuraldesigner.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.neuraldesigner.com\/#organization","name":"Neural Designer","url":"https:\/\/www.neuraldesigner.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.neuraldesigner.com\/#\/schema\/logo\/image\/","url":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/05\/logo-neural-1.png","contentUrl":"https:\/\/www.neuraldesigner.com\/wp-content\/uploads\/2023\/05\/logo-neural-1.png","width":1024,"height":223,"caption":"Neural Designer"},"image":{"@id":"https:\/\/www.neuraldesigner.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/NeuralDesigner","https:\/\/es.linkedin.com\/showcase\/neuraldesigner\/"]}]}},"_links":{"self":[{"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/blog\/3381","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/blog"}],"about":[{"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/types\/blog"}],"author":[{"embeddable":true,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/users\/10"}],"version-history":[{"count":1,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/blog\/3381\/revisions"}],"predecessor-version":[{"id":21397,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/blog\/3381\/revisions\/21397"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/media\/2308"}],"wp:attachment":[{"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/media?parent=3381"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/categories?post=3381"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.neuraldesigner.com\/api\/wp\/v2\/tags?post=3381"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}