Activation functions in building a CNN - Pianalytix

Alan Turing, a mathematician and pioneer of artificial intelligence, built the first machine which could ‘learn’. This machine in use to decode the German Enigma codes during WW-II, which eventually led to the victory of allies in the war. This was the first organised attempt at building a machine which could learn the inputs provided to it and predict the outcomes accordingly. He had also created a test famed as ‘The Turing test’ to determine, based on a questionnaire, whether a judge was able to differentiate between a computer and human.

if not, the computer would pass the Turing test of artificial intelligence and be classified as a human instead. But the pertinent question remained unanswered and which has kept several engineers, philosophers and commentators busy for decades is whether it was appropriate to compare (and even replace) human intelligence with artificial intelligence? The answer to this question too provides by Turing, albeit with caveats, that it was not only correct to do so, but also inevitable. Since, just as we all have different ways of thinking, of approaching a problem, so is a computer just ‘different’ from us in that minor form.

Further enquiry into this ‘minor form’ and comparison with modern deep learning architecture reveals a striking similarity between the human brain and convolution neural networks (CNN) and artificial neural networks (ANN) used in deep learning algorithms.

Activation Functions | in simple words:

Just as the basic building block of a human brain is a neuron which provides for various functions of the brain, such as: decision making, identification, anticipation et al, so does a ‘neuron’ in ANNs and CNNs is used for Activation Functions. Not just the architecture is similar, but the functioning too. thus, A simplified working of a neuron in CNN can be visualised as:

Types of activation functions:

Binary step function.
Linear function.
Non-linear function.

1. Binary step function:

This is like a threshold function, which classifies input into two classes based on one threshold values. Hence, it cannot be used in classification of more than one class.

One of the major disadvantages of this activation function is the issue of vanishing gradient (gradient or derivative is equal to zero), after and before the threshold value. This renders the applicability of back-propagation in deep learning Infeasible.

2. Linear activation function:

This function takes in input from a neuron and provides output proportional to it. so, Distinct from step-function it can classify inputs into multiple categories.

Disadvantages:

Back-propagation is still not possible as the derivative is a constant.
As derivative is constant, subsequent layers in a deep neural network become linearly dependent on preceding layers, thus making the last layer linearly dependent on the first layer. Thereby making the entire neural network as a linear regression model, incapable of handling complex data.

3. Non-linear activation functions:

1. Sigmoid or logistic:

A smooth curve with no abrupt ‘steps’ in output values. Provides for smooth gradient, thus aiding for effective back-propagation in deep neural nets. Normalises output values between 0 and 1, hence making predictions neat. The disadvantage of vanishing gradient persists at higher values of inputs. It’s also computationally tedious and may also
cause the deep neural net to get stuck during training.

Range:[0,1]

2. Tan hyperbola:

Similar in form and shape to a sigmoid function, thus with the distinction of being centered around zero. so, This makes it more suitable for input values which are in the range of extreme positives or negatives.
Range: (-1,1).

Both sigmoid and tanh activation functions are used in feed-forward neural nets.

3. ReLU (Rectified Linear Unit):

max{0,x} is called ReLU function. Though it looks linear, it’s a nonlinear function. The value of output for input values less than 0 are 0 and more than 0 are proportional to the input. This results in a problem of vanishing gradients for input values less than zero (dying ReLU) and faster convergence for values more than zero. Also, for values less than zero, the graph turns abruptly to zero and remains there for all negative values, resulting in inappropriate mapping of these negative values.
Range: [0,)

4. Leaky ReLU:

max{0.01*x,x}. This prevents the vanishing gradient problem of ReLU function. But the output values for input values less than zero are at times incoherent.
Range: (-,)

5. Parametric ReLU:

This is however similar to leaky ReLU function, except for the fact that the slope for negative input valuesprovide up as a parameter (a), which the neural network shall decide for itself to ensure an optimal back-Propagation.
Range: (-,)

6. Softmax:

This is in essence a probability function, which scales the value of an output class between 0 and 1 and divides the same by their sum. It is invariably used in the output layer of neural networks to classify into multiple categories.
Range: [0,1]

7. Swish:

This is a self-gated activation function developed by Google. It is similar to ReLU, only better in terms of computational efficiency, as claimed by Google.
Range: (-,)

Further Research:

Ongoing topics of research not only include building activation functions with much better efficiency and effectiveness, but also developing deep neural nets which decide for themselves the appropriate activation functions for each layer of a deep neural net.

Written By: Sachin Shastri

Reviewed By: Vikas Bhardwaj

If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs