3. Neural network

An artificial neural network, or simply a neural network, can be defined as a biologically inspired computational model which consists of a network architecture composed of artificial neurons. This structure contains a set of parameters, which can be adjusted to perform certain tasks.

Neural networks have universal approximation properties, which means that they can approximate any function in any dimension and up to a desired degree of accuracy.

The most used types of layers used in regression and classification applications are the perceptron, scaling, unscaling, bounding and probabilistic.

In other types of applications, such as computer vision or speech recognition, other types of layers, such as convolutional or associative, are commonly used.

**3.1. Perceptron layer**- Linear activation function
- Hyperbolic tangent activation function
- Logistic activation function
- Rectified linear activation function
**3.2. Scaling layer**- Minimum and maximum scaling method
- Mean and standard deviation scaling method
- Standard deviation scaling method
**3.3. Unscaling layer**- Minimum and maximum unscaling method
- Mean and standard deviation unscaling method
- Standard deviation unscaling method
- Logarithmic unscaling method
**3.4. Bounding layer****3.5. Probabilistic layer**- Binary probabilistic method
- Continuous probabilistic method
- Competitive probabilistic method
- Softmax probabilistic method
**3.6. Network architecture**

The most important layers of a neural network are the perceptron layers (also called dense layers). Indeed, they allow the neural network to learn.

The following figure shows a perceptron neuron, which is the basic unit of a perceptron layer. The perceptron neuron receives information as a set of numerical inputs \( x_1,\ldots,x_n\). This information is then combined with a bias \(b\) and a set of weights \( w_1,\ldots,w_n\) to produce a message in the form of a single numerical output \(y\). The parameters of the neuron involve the bias and the weights.

The combination function transforms the set of input values to produce a single combination or net input value, $$ combination = bias + \sum weights·inputs $$

The activation function defines the perceptron output in terms of its combination, $$ output = activation(combination)$$

The activation function of the perceptrons composing each layer determines the type of function that the neural network represents. Some of the most common activation functions are the linear, hyperbolic tangent, logistic and rectified linear.

The output of a perceptron with linear activation function is simply the combination of that neuron. $$ activation = combination $$

The hyperbolic tangent is one of the most used activation functions when constructing neural networks. It is a sigmoid function which varies between -1 and +1. $$ activation = tanh(combination) $$

The logistic is another type sigmoid function. It is very similar to the hyperbolic tangent, but in this case it varies between 0 and 1. $$ activation = \frac{1}{1+e^{-combination}} $$

The rectified linear activation function, also known as ReLU is one of the most used activation functions. It is zero when the combination is negative and equal to the combination when the combination is zero or positive. $$activation = \left\{ \begin{array}{lll} 0 &if& \textrm{$combination < 0$} \\ combination &if& \textrm{$combination \geq 0$} \end{array} \right. $$

You can read the article Perceptron: The main component of neural networks for a more detailed description about this important neuron model.

In this regard, a perceptron layer is a group of percetpron neurons having connections to the same inputs and sending outputs to the same destinations.

In practice it is always convenient to scale the inputs to make all of them to have a proper range.

In the context of neural networks, the scaling function can be thought as a layer connected to the inputs of the neural network. The scaling layer contains some basic statistics on the inputs. They include the mean, standard deviation, minimum and maximum values.

Some scaling methods very used in practice are the minimum-maximum, the mean-standard deviation and the standard deviation.

The minimum and maximum method processes unscaled inputs in any range to produce scaled inputs which fall between -1 and 1. This method is usually applied to variables with a uniform distribution. $$ scaled\_input = \frac{input-minimum}{maximum-minimum}$$

The mean and standard deviation method scales the inputs so that they will have mean 0 and standard deviation 1. This method is usually applied to variables with a normal (or Gaussian) distribution. $$ scaled\_input = \frac{input-mean}{standard\_deviation}$$

The standard deviation scaling method produces inputs with standard deviation 1. This is usually applied to half-normal distributions, that is, variables which are centered at zero and have only positive values. $$ scaled\_input = \frac{input}{standard\_deviation}$$

All scaling methods are linear and, in general, produce similar results. In all cases, the scaling of the inputs in the data set must be synchronized with the scaling of the inputs in the neural network. Neural Designer does that without any intervention by the user.

The scaled outputs from a neural network are to be unscaled to produce the original units.

In the context of neural networks, the unscaling function can be interpreted as an unscaling layer connected to the outputs of the percetpron layers. An unscaling layer contains some basic statistics on the outputs. They include the mean, standard deviation, minimum and maximum values.Four unscaling methods very used in practice are the minimum-maximum the mean-standard deviation, the standard deviation and the logarithmic methods.

The minimum and maximum method unscales variables that have been previously scaled to have minimum -1 and maximum +1, to produce outputs in the original range, $$ unscaled\_output = \frac{scaled\_output-mean}{standard\_deviation}$$

The mean and standard deviation method unscales variables that have been previously scaled to have mean 0 and standard deviation 1, $$ unscaled\_output = minimum\\+0.5(scaled\_output+1)(maximum-minimum)$$

The standard deviation method unscales variables that have been previously scaled to have standard deviation 1, to produce outputs in the original range,

$$ unscaled\_output = mean\\+ scaled\_output\cdot standard\_deviation$$The logarithmic method unscales variables that have been previously subjected to a logarithmic transformation, $$ unscaled\_output = minimum\\+0.5(\exp{(scaled\_output)}+1)(maximum-minimum)$$

In all cases, the scaling of the targets in the data set must be synchronized with the unscaling of the outputs in the neural network. Neural Designer does that without any intervention by the user.

In many cases, the output needs to be limited between two values. For instance, the quality of a product might be comprised between 1 and 5 stars.

The bounding function can be interpreted as a bounding layer connected to the outputs of the unscaling layer. It uses the following formula

$$bounded\_output = \left\{ \begin{array}{l} lower\_bound, \quad \textrm{$output < lower\_bound$} \\ output, \quad \textrm{$lower\_bound \leq output \leq upper\_bound$} \\ upper\_bound, \quad \textrm{$output \geq upper\_bound$} \end{array} \right. $$In classification problems, outputs are usually interpreted in terms of probabilities of class membership. In this way, the probabilistic outputs will always fall in the range [0, 1], and the sum of all will always be 1.

In the context of neural networks, the probabilistic output function can be interpreted as an additional layer connected to the last perceptron layer.

There are several probabilistic output methods. Two of the most popular are the competitive method and the softmax method.

The binary method is used in binary classification problems. Here the output can either take the value 1 (positive) or 0 (negative).

The **decision threshold** can be defined as the probability from which we consider a positive.
The default value is 0.5.
In this waw, the probabilistic output can be calculated as:

This method is also used in binary classification problems. Here the output can take any value between 0 and 1. $$probabilistic\_output = \left\{ \begin{array}{lll} 0 &if& \textrm{$output < 0$} \\ output &if& \textrm{$0 \leq output \leq 1$} \\ 1 &if& \textrm{$output> 1$} \end{array} \right. $$

The competitive method is used in multiple classification problems. It assigns a probability of one to that output with the greatest value, and a probability of zero to the rest of outputs. $$probabilistic\_output = \left\{ \begin{array}{lll} 1 &if& \textrm{$output = maximum(outputs)$} \\ 0 &if& \textrm{$output \neq maximum(outputs)$} \end{array} \right. $$

This method is also used in multiple classification problems. It is a continuous probabilistic function, which holds that the outputs always fall in the range [0, 1], and the sum of all is always 1. $$ probabilistic\_output = \frac{e^{output}}{\sum e^{outputs}}$$

As we have seen, a neural network might be composed by different types of layers, depending on the particular needs of the predictive model.

Next, we describe the most common neural networks configurations for each application type.

A neural network can be symbolized as a graph, where nodes represent neurons and edges represent connectivities among neurons. An edge label represents the parameter of the neuron for which the flow goes in.

Most neural networks, even biological neural networks, exhibit a layered structure. Therefore, layers are the basis to determine the architecture of a neural network.

A neural network is built up by organizing layers of neurons in a network architecture. The characteristic network architecture here is the so-called feed-forward architecture. In a feed-forward neural network layers are grouped into a sequence, so that neurons in any layer are connected only to neurons in the next layer.

The next figure represents a neural network with 4 inputs, several layers of different types and 3 outputs.

An approximation model usually contains a scaling layer, several perceptron layers, and an unscaling layer. A neural network for approximation might also contain a bounding layer.

Most of the times, two layers of perceptrons will be enough to represent the data set. For very complex data sets, deeper architectures with three, four, or more layers of perceptrons might be required.

The following figure represents a neural network to estimate the power generated by a combined cycle power plant as a function of meteorological and plant variables. This neural network has 4 inputs and 1 output. It consists of a scaling layer (yellow), a perceptron layer with 4 neurons (blue), a perceptron layer with 1 neuron (blue) and an unscaling layer (red).

A classification problem usually requires a scaling, two perceptron layers and a probabilistic layer. It might also contain a principal components layer.

Most of the times, two layers of perceptrons will be enough to represent the data set.

The following figure is a binary classification model for the diagnose of breast cancer from fine-needle aspirates. This neural network has 9 inputs and 1 output. It consists of a scaling layer (yellow), a layer of 1 perceptron (blue), a layer of 1 perceptron (blue) and a probabilistic layer (red).