By Carlos Barranquero, Artelnics. 19 November 2020.
TensorFlow, PyTorch and Neural Designer are three popular machine learning platforms developed by Google, Facebook and Artelnics, respectively.
Although all that frameworks are based on neural networks, they present some important differences in terms of functionality, usability, performance, etc.
This post compares the GPU training speed of TensorFlow, PyTorch and Neural Designer for an approximation benchmark.
As we will see, Neural Designer trains this neural network x1.55 times faster than TensorFlow and x2.50 times faster than PyTorch in a NVIDIA Tesla T4.
Contents:
One of the most important factors in machine learning platforms is their training speed. Indeed, modeling very large data sets is very expensive in computational terms.
To speed up model training, major machine learning tools use GPU computing techniques, such as NVIDIA CUDA.
The objective of this article is to measure the GPU training times of TensorFlow, PyTorch and Neural Designer for a benchmark application, and to compare the speeds obtained by that platforms.
The following table summarizes the technical features of these tools that might impact their GPU performance.
TensorFlow  PyTorch  Neural Designer  

Written in  C++, CUDA, Python  C++, CUDA, Python  C++, CUDA 
Interface  Python  Python  Graphical User Interface 
Differentiation  Automatic  Automatic  Analytical 
From the above table, we can see that TensorFlow and PyTorch are programmed in C++ and Python, while Neural Designer is entirely programmed in C++.
Interpreted languages like Python have some advantages over compiled languages like C ++, such as their ease of use.
However, the performance of Python is, in general, lower than that of C++. Indeed, Python takes significant time for interpreting sentences during the execution of the program.
On the other hand, TensorFlow and PyTorch use automatic differentiation, while Neural Designer uses analytical differentiation.
As before, automatic differentiation has some advantages over analytical differentiation. Certainly, it simplifies obtaining the gradient for new architectures or loss indices.
However, the performance of automatic differentiation is, in general, lower than that of analytical differentiation: The first derives the gradient during the execution of the program, while the second has that formula precalculated.
Next, we measure the training speed for a benchmark problem on a reference computer using TensorFlow, PyTorch and Neural Designer. The results produced by that platforms are then compared.
The first step is to choose a benchmark application that is general enough to draw conclusions about the performance of the machine learning platforms. As previously stated, we will train a neural network that approximates a set inputtarget samples.
In this regard, an approximation application is defined by a data set, a neural network and an associated training strategy. The next table uniquely defines these three components.
Data set 


Neural network 

Training strategy 

Once the TensorFlow, PyTorch and Neural Designer applications have been created, we need to run them.
The next step is to choose the computer in which we will train the neural networks with TensorFlow, PyTorch and Neural Designer. For training speed tests, the most important feature of the computer is the GPU or device card.
To make the results easier to reproduce, all calculations have been done on an Amazon Web Services instance. The next table lists some basic information about the computer used here.
Instance type:  AWS g4dn.xlarge 

Operating system:  Windows 10 Enterprise 
Processor:  CPU Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz 
Physical RAM:  16.0 GB 
Device (GPU):  NVIDIA Tesla T4 
Once the computer has been chosen, we install TensorFlow (2.1.0) PyTorch (1.7.0) and Neural Designer(5.0.0) on it.
#TENSORFLOW CODE import tensorflow as tf import pandas as pd import time import numpy as np #read data float32 start_time = time.time() filename = "C:/opennncuda/examples/rosenbrock/bin/R_new.csv" df_test = pd.read_csv(filename, nrows=100) float_cols = [c for c in df_test if df_test[c].dtype == "float64"] float32_cols = {c: np.float32 for c in float_cols} data = pd.read_csv(filename, engine='c', dtype=float32_cols) print("Loading time: ", round(time.time()  start_time), " seconds") x = data.iloc[:,:1].values y = data.iloc[:,[1]].values initializer = tf.keras.initializers.RandomUniform(minval=1., maxval=1.) #build model model = tf.keras.models.Sequential([tf.keras.layers.Dense(1000, activation = 'tanh', kernel_initializer = initializer, bias_initializer=initializer), tf.keras.layers.Dense(1, activation = 'linear', kernel_initializer = initializer, bias_initializer=initializer)]) #compile model model.compile(optimizer='adam', loss = 'mean_squared_error') #train model start_time = time.time() history = model.fit(x, y, batch_size = 1000, epochs = 1000) print("Training time: ", round(time.time()  start_time), " seconds")
Building this application with PyTorch also requires some Python scripting. This code is listed below. Also, you can download here.
#PYTORCH CODE import pandas as pd import time import torch import numpy as np import statistics def init_weights(m): if type(m) == torch.nn.Linear: torch.nn.init.uniform_(m.weight, a=1.0, b=1.0) torch.nn.init.uniform_(m.bias.data, a=1.0, b=1.0) epoch = 1000 total_samples, batch_size, input_variables, hidden_neurons, output_variables = 1000000, 1000, 1000, 1000, 1 device = torch.device("cuda:0") # read data float32 start_time = time.time() filename = "C:/opennncuda/examples/rosenbrock/bin/R_new.csv" df_test = pd.read_csv(filename, nrows=100) float_cols = [c for c in df_test if df_test[c].dtype == "float64"] float32_cols = {c: np.float32 for c in float_cols} dataset = pd.read_csv(filename, engine='c', dtype=float32_cols) print("Loading time: ", round(time.time()  start_time), " seconds") x = torch.tensor(dataset.iloc[:,:1].values, dtype = torch.float32) y = torch.tensor(dataset.iloc[:,[1]].values, dtype = torch.float32) # build model model = torch.nn.Sequential(torch.nn.Linear(input_variables, hidden_neurons), torch.nn.Tanh(), torch.nn.Linear(hidden_neurons, output_variables)).cuda() # initialize weights model.apply(init_weights) # compile model learning_rate = 0.001 loss_fn = torch.nn.MSELoss(reduction = 'mean') optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) indices = np.arange(0,total_samples) start = time.time() for j in range(epoch): mse=[] t0 = time.time() for i in range(0, total_samples, batch_size): batch_indices = indices[i:i+batch_size] batch_x, batch_y = x[batch_indices], y[batch_indices] batch_x = batch_x.cuda() batch_y = batch_y.cuda() outputs = model.forward(batch_x) loss = loss_fn(outputs, batch_y) model.zero_grad() loss.backward() optimizer.step() mse.append(loss.item()) print("Epoch:", j+1,"/1000", "[================================]  ","loss: ", statistics.mean(mse)) t1 = time.time()  t0 print("Elapsed time: ", int(round(t1 )), "sec") end = time.time() elapsed = end  start print("Training time: ",int(round(elapsed )), "seconds")
Once the TensorFlow, PyTorch and Neural Designer applications have been created, we need to run them.
The last step is to run the benchmark application on the selecte machine with TensorFlow, PyTorch and Neural Designer, and to compare the training times provided by that platforms.
The next figure shows the training results with TensorFlow.
As we can see, TensorFlow takes 3,892 seconds (01:04:52) to train the neural network for 1000 epochs. The final mean squared error is 0.075. With TensorFlow, the average GPU usage during training is 40%, approximately.
Similarly, the following figure is a screenshot of PyTorch at the end of the process.
In this case, PyTorch takes 5,507 seconds (01:31:47) to train the neural network for 1000 epochs, reaching to a mean squared error of 0.038. With PyTorch, the average GPU usage during training is 60%, approximately.
Finally, the next figure shows the training results with Neural Designer.
Neural Designer takes 2,392 seconds (00:39:52) to train the neural network for 1000 epochs. During that time, it reaches an mean squared error of 0.00981. With Neural Designer, the average GPU usage during training is 95%, approximately.
The next table summarizes the most important metrics yield by the three machine learning platforms.
TensorFlow  PyTorch  Neural Designer  

Training time  01:04:52  01:31:47  00:39:52 
Epoch time  3.892 seconds/epoch  5.507 seconds/epoch  2.392 seconds/epoch 
Training speed  256,937 samples/second  181,587 samples/second  418,060 samples/second 
Finally, the next chart depicts graphically the training speeds of TensorFlow, PyTorch and Neural Designer for this case.
As we can see, the training speed of Neural Designer for this application is x1.55 times bigger than that of TensorFlow and x2.50 times bigger than that of PyTorch.
Neural Designer is entirely written in C ++, uses analytical differentiation, and has been optimized to minimize the number of operations during training.
This means that, for the benchmark described in this post, its training speed is x1.55 times faster than that of TensorFlow and x2.50 times faster than that of PyTorch.
If you want to reproduce these results, ask for a temporal license of Neural Designer at info@neuraldesigner.com and we will be happy to provide it.