There are numerous repositories with a large number of datasets for machine learning. Some of the most important ones are the UCI Machine Learning Repository or Kaggle.

However, using that datasets for performance benchmarking might be difficult. Indeed, those datasets lack the consistency required by key performance indicators such as data capacity, training speed, model accuracy and inference speed.

This post introduces a family of datasets, known as the Rosenbrock Dataset Suite. The objective is to facilitate benchmarking of machine learning platforms.

**Contents:**

The data capacity of a machine learning platform can be defined as the biggest dataset that it is capable of processing. In this way, the tool should perform all the esential tasks with that dataset.

Data capacity can be measured as the number of samples that a machine learning platform can process, for a given number of variables.

The most important drawback is that they usually have a fixed number of variables and samples. This makes it difficult to test how a machine learning platform behaves with different dataset sizes.

Training speed is defined as the number of samples per second that a machine learning platform processes during training.

The training speed depends very much on the dataset size. For instance, CPUs might provide faster training than GPUs for small datasets, and slower training for big datasets.

Therefore, we need to generate datasets with an arbitrary number of variables and samples, to see how these sizes affect training performance.

Precision can be defined as the mean error of a model against a testing data set.

Most real datasets are noisy. This means that the full fit of the model to the data cannot be verified.

Therefore, it is desirable to have datasets with which we can potentially build models with zero error.

The inference speed can be defined as the time to calculate the outputs from the model as a function of the inputs. Inference speed is measured as the number of samples per second.

As before, we need to generate datasets with an arbitrary number of variables and samples, to see how these sizes affect training performance.

The Rosenbrock function is a non-convex function, introduced by Howard H. Rosenbrock in 1960. It is also known as Rosenbrock's valley or Rosenbrock's banana function.

It is used as a performance test problem for optimization algorithms.

In mathematical optimization, n is the number of samples and m is the number of input variables.

$$x_{i,j} = rand(-1,+1)$$ $$y_{j} = \sum_{i=1}^{n-1}\left[ 100\left(x_{i+1}-x_{i}^{2} \right)^{2}+\left(1-x_{i} \right)^{2} \right]$$As the outputs from the Rosenbrock function are real values, this dataset suite is suitable for approximation problems. Therefore, we cannot test the performance of classification or forecasting applications with that.

The Rosenbrock dataset suite allows to create datasets with any number of variables and samples. Thus, this suite is perfect for performing data capacity, training speed and inference speed tests.

The Rosenbrock data is extracted from a deterministic function with a quite complex shape. It should be possible to build a machine learning model of that function with any desired degree of accuracy. Therefore, Rosenbrock datasets are ideal for model precision tests.

The following code shows how to generate a Rosenbrock dataset using C++.

// System includes #include <iostream> #include <fstream> #include <string> #include <random> using namespace std; int main(void) { cout << "Rosenbrock Dataset Generator." << endl; const int inputs_number = 2; const int samples_number = 10000; const string filename = "G:/R__" + to_string(samples_number)+ "_samples_"+ to_string(inputs_number) + "_inputs.csv"; float inputs[inputs_number]; default_random_engine generator; uniform_real_distribution<float> distribution(-1.0, 1.0); ofstream file(filename); for(int j=0; j < samples_number; j++) { float rosenbrock = 0.0; for(int i=0; i < inputs_number; i++) { inputs[i] = distribution(generator); file << inputs[i] << ","; } for(int i = 0; i< inputs_number - 1; i++) { rosenbrock += (1 - inputs[i])*(1 - inputs[i]) + 100*(inputs[i+1]-inputs[i] * inputs[i])* (inputs[i+1]-inputs[i]*inputs[i]); } file << rosenbrock << endl; } file.close(); return 0; }

You can also generate a Rosenbrock dataset with the following Python code.

import numpy as np import pandas as pd import random samples_number = 10000 inputs_number = 2 distribution = random.uniform(-1, 1) inputs = np.random.uniform(-1.0, 1.0, size = (samples_number, inputs_number)) rosenbrock = [] for j in range (samples_number): r = 0 for i in range(inputs_number-1): r += (1.0 - inputs[j][i])*(1.0 - inputs[j][i])+100.0*((inputs[j][i+1]-inputs[j][i]*inputs[j][i])*(inputs[j][i+1]-inputs[j][i]*inputs[j][i])) rosenbrock.append(r) data = pd.concat([pd.DataFrame(inputs),pd.DataFrame(rosenbrock)], axis=1) filename = "G:/R_" + str(samples_number)+ "_samples_"+ str(inputs_number) + "_variables_python.csv"; data.to_csv(filename,index = False,sep = ",")

Notice that data is normolized between [-1,1].

We provide the following datasets:

This blog introduces a function to measure the data capacity, training speed, model accuracy, and inference speed for any machine learning platform. Rosenbrock datasets have a strong consistency and do not have noise. For this reason, it is a powerfull altenative to datasets from common repositories for benchmarking.

The data science and machine learning platform Neural Designer contains many utilities to perform descriptive, diagnostic, predictive, and prescriptive analytics easily.

You can download Neural Designer now and try it for free.