Training speed of dense networks in GPU:
TensorFlow vs PyTorch vs Neural Designer

By Carlos Barranquero, Artelnics. 19 November 2020.

TensorFlow, PyTorch and Neural Designer are three popular machine learning platforms developed by Google, Facebook and Artelnics, respectively.

Although all that frameworks are based on neural networks, they present some important differences in terms of functionality, usability, performance, etc.

This post compares the GPU training speed of TensorFlow, PyTorch and Neural Designer for an approximation benchmark.

As we will see, Neural Designer trains this neural network x1.55 times faster than TensorFlow and x2.50 times faster than PyTorch in a NVIDIA Tesla T4.

Contents:

Introduction

One of the most important factors in machine learning platforms is their training speed. Indeed, modeling very large data sets is very expensive in computational terms.

To speed up model training, major machine learning tools use GPU computing techniques, such as NVIDIA CUDA.

The objective of this article is to measure the GPU training times of TensorFlow, PyTorch and Neural Designer for a benchmark application, and to compare the speeds obtained by that platforms.

The following table summarizes the technical features of these tools that might impact their GPU performance.

TensorFlow PyTorch Neural Designer
Written in C++, CUDA, Python C++, CUDA, Python C++, CUDA
Interface Python Python Graphical User Interface
Differentiation Automatic Automatic Analytical

From the above table, we can see that TensorFlow and PyTorch are programmed in C++ and Python, while Neural Designer is entirely programmed in C++.

Interpreted languages like Python have some advantages over compiled languages like C ++, such as their ease of use.

However, the performance of Python is, in general, lower than that of C++. Indeed, Python takes significant time for interpreting sentences during the execution of the program.

On the other hand, TensorFlow and PyTorch use automatic differentiation, while Neural Designer uses analytical differentiation.

As before, automatic differentiation has some advantages over analytical differentiation. Certainly, it simplifies obtaining the gradient for new architectures or loss indices.

However, the performance of automatic differentiation is, in general, lower than that of analytical differentiation: The first derives the gradient during the execution of the program, while the second has that formula pre-calculated.

Next, we measure the training speed for a benchmark problem on a reference computer using TensorFlow, PyTorch and Neural Designer. The results produced by that platforms are then compared.

Benchmark application

The first step is to choose a benchmark application that is general enough to draw conclusions about the performance of the machine learning platforms. As previously stated, we will train a neural network that approximates a set input-target samples.

In this regard, an approximation application is defined by a data set, a neural network and an associated training strategy. The next table uniquely defines these three components.

Data set
  • Benchmark: Rosenbrock
  • Inputs number: 1000
  • Targets number: 1
  • Samples number: 1000000
  • File size: 22 Gb (download)
Neural network
  • Layers number: 2
  • Layer 1:
    • -Type: Perceptron (Dense)
    • -Inputs number: 1000
    • -Neurons number: 1000
    • -Activation function: Hyperbolic tangent (tanh)
  • Layer 2:
    • -Type: Perceptron (Dense)
    • -Inputs number: 1000
    • -Neurons number: 1
    • -Activation function: Linear
  • Initialization: Random uniform [-0.1,0.1]
Training strategy
  • Loss index:
    • -Error: Mean Squared Error (MSE)
    • -Regularization: None
  • Optimization algorithm:
    • -Algorithm: Adaptive Moment Estimation (Adam)
    • -Batch size: 1000
    • -Maximum epochs: 1000

Once the TensorFlow, PyTorch and Neural Designer applications have been created, we need to run them.

Reference computer

The next step is to choose the computer in which we will train the neural networks with TensorFlow, PyTorch and Neural Designer. For training speed tests, the most important feature of the computer is the GPU or device card.

To make the results easier to reproduce, all calculations have been done on an Amazon Web Services instance. The next table lists some basic information about the computer used here.

Instance type: AWS g4dn.xlarge
Operating system: Windows 10 Enterprise
Processor: CPU Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Physical RAM: 16.0 GB
Device (GPU): NVIDIA Tesla T4

Once the computer has been chosen, we install TensorFlow (2.1.0) PyTorch (1.7.0) and Neural Designer(5.0.0) on it.

#TENSORFLOW CODE
	import tensorflow as tf
	import pandas as pd
	import time
	import numpy as np
			
	#read data float32
	start_time = time.time() 
	filename = "C:/opennn-cuda/examples/rosenbrock/bin/R_new.csv"
	df_test = pd.read_csv(filename, nrows=100)
	float_cols = [c for c in df_test if df_test[c].dtype == "float64"]
	float32_cols = {c: np.float32 for c in float_cols}
	data = pd.read_csv(filename, engine='c', dtype=float32_cols)
	print("Loading time: ", round(time.time() - start_time), " seconds")
			
	x = data.iloc[:,:-1].values
	y = data.iloc[:,[-1]].values
			
	initializer = tf.keras.initializers.RandomUniform(minval=-1., maxval=1.)
				 
	#build model
	model = tf.keras.models.Sequential([tf.keras.layers.Dense(1000, 
									    activation = 'tanh', 
									    kernel_initializer = initializer, 
									    bias_initializer=initializer),
							tf.keras.layers.Dense(1, 
									    activation = 'linear', 
									    kernel_initializer = initializer, 
									    bias_initializer=initializer)])
			
	#compile model
	model.compile(optimizer='adam', loss = 'mean_squared_error')
					
	#train model
	start_time = time.time()
	history = model.fit(x, y, batch_size = 1000, epochs = 1000)
	print("Training time: ", round(time.time() - start_time), " seconds")
	

Building this application with PyTorch also requires some Python scripting. This code is listed below. Also, you can download here.

#PYTORCH CODE	
	import pandas as pd
	import time
	import torch
	import numpy as np
	import statistics
	
	def init_weights(m):
		if type(m) == torch.nn.Linear:		
			torch.nn.init.uniform_(m.weight, a=-1.0, b=1.0)
			torch.nn.init.uniform_(m.bias.data, a=-1.0, b=1.0)
						
	epoch = 1000
	total_samples, batch_size, input_variables, hidden_neurons, output_variables = 1000000, 1000, 1000, 1000, 1
	device = torch.device("cuda:0") 
		
	# read data float32
	start_time = time.time()
	filename = "C:/opennn-cuda/examples/rosenbrock/bin/R_new.csv"
	df_test = pd.read_csv(filename, nrows=100)
	float_cols = [c for c in df_test if df_test[c].dtype == "float64"]
	float32_cols = {c: np.float32 for c in float_cols}
	dataset = pd.read_csv(filename, engine='c', dtype=float32_cols)
	print("Loading time: ", round(time.time() - start_time), " seconds")
		
	x = torch.tensor(dataset.iloc[:,:-1].values, dtype = torch.float32)
	y = torch.tensor(dataset.iloc[:,[-1]].values, dtype = torch.float32)
			# build model
	model = torch.nn.Sequential(torch.nn.Linear(input_variables, hidden_neurons),
								torch.nn.Tanh(),
								torch.nn.Linear(hidden_neurons, output_variables)).cuda()
		
	# initialize weights
	model.apply(init_weights)
	
	# compile model
	learning_rate = 0.001
	loss_fn = torch.nn.MSELoss(reduction = 'mean')
	optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
			 
	indices = np.arange(0,total_samples)
				
	start = time.time()
		
	for j in range(epoch):
					
		mse=[]
					
		t0 = time.time()
					
		for i in range(0, total_samples, batch_size):
						
			batch_indices = indices[i:i+batch_size]
						
			batch_x, batch_y = x[batch_indices], y[batch_indices]
						
			batch_x = batch_x.cuda()
						
			batch_y = batch_y.cuda()
								
			outputs = model.forward(batch_x)
							
			loss = loss_fn(outputs, batch_y)
							
			model.zero_grad()
							   
			loss.backward()
					
			optimizer.step()
						
			mse.append(loss.item())
					
		print("Epoch:", j+1,"/1000", "[================================] - ","loss: ", statistics.mean(mse))
					
		t1 = time.time() - t0
					
		print("Elapsed time: ", int(round(t1 )), "sec")
					
	end = time.time()
		
	elapsed = end - start

	print("Training time: ",int(round(elapsed )), "seconds")

Once the TensorFlow, PyTorch and Neural Designer applications have been created, we need to run them.

Results

The last step is to run the benchmark application on the selecte machine with TensorFlow, PyTorch and Neural Designer, and to compare the training times provided by that platforms.

The next figure shows the training results with TensorFlow.

As we can see, TensorFlow takes 3,892 seconds (01:04:52) to train the neural network for 1000 epochs. The final mean squared error is 0.075. With TensorFlow, the average GPU usage during training is 40%, approximately.

Similarly, the following figure is a screenshot of PyTorch at the end of the process.

In this case, PyTorch takes 5,507 seconds (01:31:47) to train the neural network for 1000 epochs, reaching to a mean squared error of 0.038. With PyTorch, the average GPU usage during training is 60%, approximately.

Finally, the next figure shows the training results with Neural Designer.

Neural Designer takes 2,392 seconds (00:39:52) to train the neural network for 1000 epochs. During that time, it reaches an mean squared error of 0.00981. With Neural Designer, the average GPU usage during training is 95%, approximately.

The next table summarizes the most important metrics yield by the three machine learning platforms.

TensorFlow PyTorch Neural Designer
Training time 01:04:52 01:31:47 00:39:52
Epoch time 3.892 seconds/epoch 5.507 seconds/epoch 2.392 seconds/epoch
Training speed 256,937 samples/second 181,587 samples/second 418,060 samples/second

Finally, the next chart depicts graphically the training speeds of TensorFlow, PyTorch and Neural Designer for this case.

As we can see, the training speed of Neural Designer for this application is x1.55 times bigger than that of TensorFlow and x2.50 times bigger than that of PyTorch.

Conclusions

Neural Designer is entirely written in C ++, uses analytical differentiation, and has been optimized to minimize the number of operations during training.

This means that, for the benchmark described in this post, its training speed is x1.55 times faster than that of TensorFlow and x2.50 times faster than that of PyTorch.

If you want to reproduce these results, ask for a temporal license of Neural Designer at info@neuraldesigner.com and we will be happy to provide it.