Introduction to Artificial Neural Networks (ANNs)

Posted by Rahmad Sadli on February 7, 2020 in Deep Learning, Machine Learning
Neural Networks

Artificial Neural Networks (ANNs), inspired by the human brain system, are based on a collection of units of neurons that are connected one to another to process and send information.

A very basic or a simplest neural network composes of only a single neuron, some inputs \textbf{x}  = (x_1, x_2,..,x_n) and a bias b as illustrated in the following figure.

Neural Networks: a simplest neural network

All the inputs and the bias are connected to this neuron. These connections are called the synapses where every synapse has the weight W.

The hypothesis output of this simplest neural network is written as:

(1)   \begin{equation*} $h_{W,b}(\textbf{x}) = f(\sum_{i=1}^{n} W_ix_i +b)$\end{equation*}

The function of f is called the activation function.

There are many kinds of activation functions used in NNs implementation, the most commonly used are step function, sigmoid function, tanh and Rectifier Linear Unit (ReLu).

Activation Function

In the above description of the simplest neural network, it uses a sigmoid function as the activation function.

The sigmoid function is one of the popular activation functions used in the neural network systems. It is written by:

(2)   \begin{equation*} f(z)=\frac{1}{1+e^{-z}} \end{equation*}

It is important to be noticed that there are other common choices of the activation functions, they are hyperbolic tangent or tanh and rectified linear unit (ReLU).

The tanh function is written as:

(3)   \begin{equation*} f(z)=tanh(z)=\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}} \end{equation*}

The rectified linear activation function is given by:

(4)   \begin{equation*} f(z)=max(0,x) \end{equation*}

In practice, for deep Neural Networks, rectified linear function often works better than thesigmoid and the tanh functions.

The following figure shows the plots of the sigmoid, tanh and rectified linear functions (ReLU).

Neural Networks: plots of the sigmoid, tanh and rectified linear functions

Multi-Layer Neural Network

The simplest neural network described above is a very limited model. To form a multi-layer neural network, we can hook together the simple neurons. The output of a neuron can be the input of another.

The following figure shows a simple multi-layer neural network with two hidden layers.

Neural Networks:  a simple multi-layer neural network

This network has four layers with two inputs x_1 and x_2 in the input layer (layer L_1) and one output in the output layer (layer L_4 ). It has two hidden layers, layer L_2 and layer L_3. The circles labeled “+1” are called bias units that correspond to the intercept term.

Feed Forward Propagation

Now, we’re going to analyze this multi-layer neural network.

Neural Networks:  a simple multi-layer neural network

We write W_{ij}^{(l)} to denote the parameter (or weight) associated with the connection between unit j in layer l, and unit i in layer l+1. The number of layers in the network is denoted by n_l. So, the above multi-layer neural network model has n_l=4.

The activation of unit i in layer l is denoted by a_i^ {(l)}. For example: for l=1, we denote a_i^{(1)} for activation of unit i in layer 1.

We also use a_i^{(1)}= x_i to denote the ith input.

Given a fixed setting of the parameters W and b, our neural network defines a hypothesis h_{W,b}(x). It outputs a real number.

Now, we are going to look particularly at the last two layers, L_3 and L_4.

The layer L_4 produces the output hypothesis and the layer L_3 is the last hidden layer. Specifically, the computation of the activation function in the layer L_3 can be derived as the following:

(5)   \begin{equation*} \begin{split} a_1^{(3)}=f(W_{11}^{(2)}a_1^{(2)}+W_{12}^{(2)}a_2^{(2)}+ W_{13}^{(2)}a_3^{(2)}+ b_1^{(2)})\\ a_2^{(3)}=f(W_{21}^{(2)}a_1^{(2)}+W_{22}^{(2)}a_2^{(2)}+ W_{23}^{(2)}a_3^{(2)}+ b_2^{(2)})\\ \end{split} \end{equation*}

The hypothesis of the output of this neural network can be written as:

(6)   \begin{equation*} h_{wb} (x)= a_1^{(4)} =f(W_{11}^{(3)}a_1^{(3)} + W_{12}^{(3)} a_2^{(3)})  \end{equation*}

If we let z_{i}^{(l)} denoted as the total weighted sum of inputs to unit i in layer l, including the bias term, we will have:

(7)   \begin{equation*} \begin{split} z_i^{(l)}=\sum_{j=1}^n  W_{ij}^{(l-1)}a_j^{(l-1)}+ b_i^{(l-1)}\ \end{split} \end{equation*}

so that: a_i^{(l)}=f(z_i^{(l)})

or, for the computation of the next layer l+1 is:

(8)   \begin{equation*} \begin{split} z_i^{(l+1)}=\sum_{j=1}^n  W_{ij}^{(l)}a_j^{(l)}+ b_i^{(l)}\ \end{split}  \end{equation*}

Using matrix-vectorial notation, the above equation 8 can be written as follows:

(9)   \begin{equation*} \begin{split} \textbf{z}^{(l+1)}=\textbf{W}^{(l)}\textbf{a}^{(l)}+\textbf{b}^{(l)}\ \end{split} \end{equation*}

so that:

(10)   \begin{equation*} \begin{split} \textbf{a}^{(l+1)} & =f(\textbf{z}^{(l+1)}) \\ & =f(\textbf{W}^{(l)}\textbf{a}^{(l)}+\textbf{b}^{(l)})\\ \end{split}  \end{equation*}

The output of 6 can be written as:

(11)   \begin{equation*} \begin{split} h_{wb} (x) & = \textbf{a}^{(4)} \\ & =f(\textbf{z}^{(4)}) \\ & = f(\textbf{W}^{(3)}\textbf{a}^{(3)}+\textbf{b}^{(3)})\\ \end{split}  \end{equation*}

Back Propagation

Suppose we have a fixed training set \lbrace (x^{(1)}, y^{(1)}),….,(x^{(m)}, y^{(m)}) \rbrace of m training examples. We can train our neural network using batch gradient descent. For a single training example, the cost function with respect to single example is written as one-half square error as follows:

(12)   \begin{equation*} J(W,b;x,y)= \frac{1}{2} \Vert h_{W,b} (x)-y \Vert^2 \end{equation*}

Given a training set of m examples, we then define the overall cost function to be:

(13)   \begin{equation*} \begin{split} J(W,b;x,y) & = [\frac{1}{m} \sum_{i=1}^m J(W,b;x^{(i)},y^{(i)})] +\frac{\lambda}{2} \sum_{l=1}^{n_l-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} (W_{ji}^{(l)})^2\\ & = [\frac{1}{m} \sum_{i=1}^m ( \frac{1}{2} \Vert h_{W,b}( x^{(i)})-y^{(i)} \Vert ^2 )] +\frac{\lambda}{2} \sum_{l=1}^{n_l-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} (W_{ji}^{(l)})^2\\ \end{split} \end{equation*}

The first term is an average sum-of-squares error term. The second term is a regularization term (also called a weight decay term) that tends to decrease the magnitude of the weights, and helps prevent over-fitting. The weight decay parameter \lambda controls the relative importance of the two terms.

Gradient Descent

When we train the NNs, the goal is to minimize J(W,b) as a function of W and b. So basically, we initialize all parameters W_{ij}^{(l)} and b_{i}^{(l)} to small random values and then apply the Gradient Descent algorithm to optimize these parameters.


To implement gradient descent algorithm, the parameters of the gradient W,b must be updated as follows:

(14)   \begin{equation*} \begin{split} W_{ij}^{(l)} & = W_{ij}^{(l)} - \alpha \frac{\partial}{W_{ij}^{(l)}} J(W,b)\\ b_i^{(l)} & =b_i^{(l)} -\alpha \frac{\partial}{b_i^{(l)}} J(W,b)\\ \end{split} \end{equation*}

where \alpha is the learning rate.

To do back propagation completely, the derivative of the overall cost function J(W,b) can be computed as:

(15)   \begin{equation*} \begin{split} \frac{\partial}{W_{ij}^{(l)}} J(W,b) & = \left[ \frac{1}{m} \sum_{i=1}^m \frac{\partial}{W_{ij}^{(l)}} J(W,b;x^{(i)},y^{(i)})\right] +\lambda W_{ij}^{(l)}\\ \frac{\partial}{b_i^{(l)}} J(W,b) & = \left[ \frac{1}{m} \sum_{i=1}^m \frac{\partial}{b_i^{(l)}} J(W,b;x^{(i)},y^{(i)})\right]\\ \end{split} \end{equation*}

Detail of back propagation algorithm is:

  1. Perform a feedforward pass, computing the activations for layers L_2, L_3, and so on up to the output layer L_{n_l}.
  2. For each output unit i in layer n_l (the output layer), set:

    (16)   \begin{equation*}\begin{split}\delta_i^{(n_l)}= \frac{\partial}{\partial z_i^{(n_l)} } \frac{1}{2} \Vert y-       h_{W,b}(x) \Vert^2 =-(y_i - a_i^{(n_l)} . f'(z_i^{(n_l)})\end{split}\end{equation*}

  3. For l=n_l-1,n_l-2,n_l-3, …, 2
    For each node i in layer, set:

    (17)   \begin{equation*}\delta_i^{(l)} = \left(    \sum_{j=1}^{s_{l+1}} W_{ji}^{(l)} \delta_j^{(l+1)}\right) f'(z_i^{(l)})\end{equation*}

  4. Compute the desired partial derivatives, which are given as:

    (18)   \begin{equation*} \begin{split} \frac{\partial}{W_{ij}^{(l)}} J(W,b;x,y) & =a_j^{(l)} \delta_i^{(l+1)}\\ \frac{\partial}{b_i^{(l)}} J(W,b;x,y) & =\delta_i^{(l+1)}\\ \end{split} \end{equation*}

In the matrix-vectorial notation, the algorithm above can be re-written as:

  1. Perform a feedforward pass by computing the activations for layers L_2, L_3, and so on up to the output layer L_{n_l}.
  2. For output layer (layer n_l) , set:

    (19)   \begin{equation*}\begin{split} \delta^{(n_l)}= -(y - a^{(n_l)} . f'(z^{(n_l)})\end{split}\end{equation*}

  3. For l=n_l-1,n_l-2,n_l-3, …, 2, set:

    (20)   \begin{equation*}\delta^{(l)} = \left( (W^{(l)})^T \delta^{(l+1)} . f'(z^{(l)}) \right) \end{equation*}

  4. Compute the desired partial derivatives, which are given as:

    (21)   \begin{equation*} \begin{split} \nabla_{W^{(l)}} J(W,b;x,y) & =\delta^{(l+1)} (a^{(l)})^T\\ \nabla_{b^{(l)}} J(W,b;x,y) & =\delta^{(l+1)} \\ \end{split} \end{equation*}

Now, we are ready to derive the full Gradient Descent Algorithm.

  1. Set \triangle W^{(l)} := \triangle b^{(l)} := 0
  2. For i=1 to m,
    (a) Use back propagation to compute \nabla_W^{(l)}  J(W,b;x,y) and \nabla_{b^{(l)}} J(W,b;x,y)
    (b) \triangle W^{(l)} :=  \triangle W^{(l)} + \nabla_W^{(l)}  J(W,b;x,y)
    (c) \triangle b^{(l)} :=  \triangle b^{(l)} + \nabla_b^{(l)}  J(W,b;x,y)
  3. Update the parameters:

    (22)   \begin{equation*}    \begin{split}         W^{(l)} & = W^{(l)} - \alpha \left[ \left( \frac{1}{m}\triangle W^{(l)} \right) + \lambda W^{(l)}  \right]  \\         b^{(l)} & = b^{(l)} - \alpha \left[ \frac{1}{m}\triangle b^{(l)} \right] \\      \end{split}    \end{equation*}

Finally, the NNs now can be trained by repeating the gradient descent steps to reduce the cost function \textbf{\textit{J}}(\textbf{\textit{W}},\textbf{\textit{b}}).

Reference

Andrew Ng et.al, ‘Welcome to the Deep Learning Tutorial!: Multi-Layer Neural Network’, http://deeplearning.stanford.edu/tutorial

Populer Tutorials

Leave a Reply

Your email address will not be published. All fields are required