Activation Function in Neural Network

Reading Time: 14 minutes

An Activation Function decides whether a neuron should be activate or not. This means that it will decide whether the neuron’s input to the network is important or not in the process of prediction using simpler mathematical operations. 

The role of the Activation Function is to derive output from a set of input values fed to a node (or a layer).

The primary role of the Activation Function is to transform the summed weighted input from the node into an output value to be fed to the next hidden layer or as output. 

if you want to dig deep with ML concepts , you may go through this blog also.

in this blog we will cover following topics

  • Introduction 
  • Why this ?
  • types of activation function
  • Why are deep neural networks hard to train?
  • How to choose the right Activation Function?


Neural network made of interconnected neurons. Each of them is characterise by its weight, bias, and activation function.

Here are other elements of this network.

Input Layer 

The input layer takes raw input from the domain. No computation is performed at this layer. Nodes here just pass on the information (features) to the hidden layer. 

Hidden Layer

As the name suggests, the nodes of this layer are not expose. They provide an abstraction to the neural network. 

The hidden layer performs all kinds of computation on the features entered through the input layer and transfers the result to the output layer.

Output Layer

It’s the final layer of the network that brings the information learned through the hidden layer and delivers the final value as a result.

Why do Neural Networks Need an Activation Function?

Activation functions introduce an additional step at each layer during the forward propagation, but its computation is worth it. Here is why—

Let’s suppose we have a neural network working without the activation functions. 

In that case, every neuron will only be performing a linear transformation on the inputs using the weights and biases. It’s because it doesn’t matter how many hidden layers we attach in the neural network; all layers will behave in the same way because the composition of two linear functions is a linear function itself.

Although the neural network becomes simpler, learning any complex task is impossible, and our model would be just a linear regression model.

Mathematical proof

Suppose we have a Neural net like this

Elements of the diagram :-

Hidden layer i.e. layer 1 :-

z(1) = W(1)X + b(1)

a(1) = z(1)


  • z(1) is the vectorized output of layer 1
  • W(1) be the vectorize weights assign to neurons
    of hidden layer i.e. w1, w2, w3 and w4
  • X be the vectorize input features i.e. i1 and i2
  • b is the vectorize bias assign to neurons in hidden
    layer i.e. b1 and b2
  • a(1) is the vectorized form of any linear function.

(Note: We are not considering activation function here)

Layer 2 i.e. output layer :-

Note : Input for layer 

   2 is output from layer 1

z(2) = W(2)a(1) + b(2)  

a(2) = z(2) 

Calculation at Output layer:

 Putting value of z(1) here

z(2) = (W(2) * [W(1)X + b(1)]) + b(2) 

z(2) = [W(2) * W(1)] * X + [W(2)*b(1) + b(2)]


    [W(2) * W(1)] = W

    [W(2)*b(1) + b(2)] = b

Final output : z(2) = W*X + b

Which is again a linear function

This observation results again in a linear function even after applying a hidden layer, hence we can conclude that, doesn’t matter how many hidden layer we attach in neural net, all layers will behave same way because the composition of two linear function is a linear function itself. Neuron can not learn with just a linear function attached to it. A non-linear activation function will let it learn as per the difference w.r.t error.

Hence we need activation function.

Types of Activation Function

  • Binary Step Fnction
  • Linear Activation Function
  • Non-Linear Activation Functions

Binary Step Fnction

Binary step function depends on a threshold value that decides whether a neuron should be activate or not. 

The input fed to the activation function is compare to a certain threshold; if the input is greater than it, then the neuron is activate, else it is deactivate, meaning that its output is not passed on to the next hidden layer.

Mathematically it can be represented as:


Activation functions are useful for applying weights to certain components within a system. Given an input vector, x, which contains some numerical values, the activation function will produce an output vector, y, which is subject to some useful constraints.


The binary step function follows the form:

f(x) = 

0 for x < 0

1 for x ≥ 1


  • Range – 0 to 1
  • Monotonicity – provides a convex error surface so optimisation can be achieve faster
  • Derivative – is 0 when x ≠ 0, and undefined when x = 0
  • Discontinuous


The binary step function can be use as an activation function while creating a binary classifier


Here are some of the limitations of binary step function:

  • It cannot provide multi-value outputs—for example, it cannot be use for multi-class classification problems. 
  • The gradient of the step function is zero, which causes a hindrance in the backpropagation process.

Linear Activation Function

The linear activation function is also named as Identity Function where the activation is proportional to the input.

  • Equation : Linear function has the equation similar to as of a straight line

  i.e.   y = ax

  • No matter how many layers we have, if all are linear in nature, the final activation function of last layer is nothing but just a linear function of the input of first layer.
  • Range : -inf to +inf
  • Uses : Linear activation function is use at just one place i.e. output layer.


All layers of the neural network collapse into one — with linear activation functions, no matter how many layers in the neural network, the last layer will be a linear function of the first layer


 a linear activation function has two major problems :

  • It’s not possible to use backpropagation as the derivative of the function is a constant and has no relation to the input x. 
  • All layers of the neural network will collapse into one if a linear activation function is use. No matter the number of layers in the neural network, the last layer will still be a linear function of the first layer. So, essentially, a linear activation function turns the neural network into just one layer.

Non-Linear Activation Functions

Non-linear activation functions solve the following limitations of linear activation functions:

  • They allow backpropagation because now the derivative function would be related to the input, and it’s possible to go back and understand which weights in the input neurons can provide a better prediction.
  • They allow the stacking of multiple layers of neurons as the output that now be a non-linear combinations of input passes through multiple layers. Any output can be represents as a functional computation in a neural network.

Now, let’s have a look at ten different non-linear neural networks activation functions and their characteristics. 

Sigmoid / Logistic Activation Function 

  • It is a function which is plot as ‘S’ shaped graph.
  • This function takes any real value as input and outputs values in the ranges of 0 to 1. 
  • The larger the input (more positive), the closer the output value will be to 1.0, whereas the smaller the input (more negative), the closer the output will be to 0.0, as shown .

Mathematically it can be represented as:

Here’s why sigmoid/logistic activation function is one of the most widely used functions:

  • It is commonly used for models where we have to predict the probability as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice because of its range.
  • The function is differentiable and provides a smooth gradient, i.e., preventing jumps in output values. This is represented by an S-shape of the sigmoid activation function. 


  • The derivative of the function is f'(x) = sigmoid(x)*(1-sigmoid(x)). 

As we can see from the above Figure, the gradient values are only significant for range -3 to 3, and the graph gets much flatter in other regions. It implies that for values greater than 3 or less than -3, the function will have very small gradients. As the gradient value approaches zero, the network ceases to learn and suffers from the Vanishing gradient problem.

  • The output of the logistic function is not symmetric around zero. So the output of all the neurons will be of the same sign. This makes the training of the neural network more difficult and unstable.

Tanh ( Hyperbolic Tangent )

Tanh function is very similar to the sigmoid/logistic activation function, and even has the same S-shape with the difference in output range of -1 to 1. In Tanh, the larger the input (more positive), the closer the output value will be to 1.0, whereas the smaller the input (more negative), the closer the output will be to -1.0.

Mathematically it can be represented as:


Advantages of using this activation function are:

  • The output of the tanh activation function is Zero centered; hence we can easily map the output values as strongly negative, neutral, or strongly positive.
  • Usually used in hidden layers of a neural network as its values lie between -1 to; therefore, the mean for the hidden layer comes out to be 0 or very close to it. It helps in centering the data and makes learning for the next layer much easier.


Have a look at the gradient of the tanh activation function to understand its limitations.

it also faces the problem of vanishing gradients similar to the sigmoid activation function. Plus the gradient of the tanh function is much steeper as compared to the sigmoid function.

Note:  Although both sigmoid and tanh face vanishing gradient issue, tanh is zero centered, and the gradients are not restricted to move in a certain direction. Therefore, in practice, tanh nonlinearity is always preferred to sigmoid nonlinearity.

ReLU Function

ReLU stands for Rectified Linear Unit. 

Although it gives an impression of a linear function, ReLU has a derivative function and allows for backpropagation while simultaneously making it computationally efficient. 

The main catch here is that the ReLU function does not activate all the neurons at the same time. 

The neurons will only be deactivated if the output of the linear transformation is less than 0.

Mathematically it can be represented as:


The negative side of the graph makes the gradient value zero. Due to this reason, during the backpropagation process, the weights and biases for some neurons are not updated. This can create dead neurons which never get activated. 

  • All the negative input values become zero immediately, which decreases the model’s ability to fit or train from the data properly. 

Leaky ReLU Function

Leaky ReLU is an improved version of ReLU function to solve the Dying ReLU problem as it has a small positive slope in the negative area.

Mathematically it can be represented as:


The advantages of Leaky ReLU are same as that of ReLU, in addition to the fact that it does enable backpropagation, even for negative input values. 

By making this minor modification for negative input values, the gradient of the left side of the graph comes out to be a non-zero value. Therefore, we would no longer encounter dead neurons in that region. 

Here is the derivative of the Leaky ReLU function. 


The limitations that this function faces include:

  • The predictions may not be consistent for negative input values. 
  • The gradient for negative values is a small value that makes the learning of model parameters time-consuming.

Parametric ReLU Function

Parametric ReLU is another variant of ReLU that aims to solve the problem of gradient’s becoming zero for the left half of the axis. 

This function provides the slope of the negative part of the function as an argument a. By performing backpropagation, the most appropriate value of a is learnt.

The parameterized ReLU function is used when the leaky ReLU function still fails at solving the problem of dead neurons, and the relevant information is not successfully passed to the next layer. 

Mathematically it can be represented as:

Where “a” is the slope parameter for negative values.


This function’s limitation is that it may perform differently for different problems depending upon the value of slope parameter a.

Exponential Linear Units (ELUs) Function

Exponential Linear Unit, or ELU for short, is also a variant of ReLU that modifies the slope of the negative part of the function. 

ELU uses a log curve to define the negativ values unlike the leaky ReLU and Parametric ReLU functions with a straight line.

Mathematically it can be represented as:

ELU is a strong alternative for f ReLU because of the following advantages:

  • ELU becomes smooth slowly until its output equal to -α whereas RELU sharply smoothes.
  • Avoids dead ReLU problem by introducing log curve for negative values of input. It helps the network nudge weights and biases in the right direction.

The limitations of the ELU function are as follow:

  • It increases the computational time because of the exponential operation included
  • No learning of the ‘a’ value takes place
  • Exploding gradient problem

Softmax Function

Before exploring the ins and outs of the Softmax activation function, we should focus on its building block—the sigmoid/logistic activation function that works on calculating probability values. 

The output of the sigmoid function was in the range of 0 to 1, which can be thought of as probability. 

This function faces certain problems.
Let’s suppose we have five output values of 0.8, 0.9, 0.7, 0.8, and 0.6, respectively. How can we move forward with it?
The answer is: We can’t.
The above values don’t make sense as the sum of all the classes/output probabilities should be equal to 1. 
You see, the Softmax function is described as a combination of multiple sigmoids. 
It calculates the relative probabilities. Similar to the sigmoid/logistic activation function, the SoftMax function returns the probability of each class. 
It is most commonly used as an activation function for the last layer of the neural network in the case of multi-class classification. 

Mathematically it can be represented as:

Let’s go over a simple example together.

Assume that you have three classes, meaning that there would be three neurons in the output layer. Now, suppose that your output from the neurons is [1.8, 0.9, 0.68].

Applying the softmax function over these values to give a probabilistic view will result in the following outcome: [0.58, 0.23, 0.19]. 

The function returns 1 for the largest probability index while it returns 0 for the other two array indexes. Here, giving full weight to index 0 and no weight to index 1 and index 2. So the output would be the class corresponding to the 1st neuron(index 0) out of three.

You can see now how softmax activation function make things easy for multi-class classification problems.


It is a self-gated activation function developed by researchers at Google. 

Swish consistently matches or outperforms ReLU activation function on deep networks applied to various challenging domains such as image classification, machine translation etc. 

This function is bounded below but unbounded above i.e. Y approaches to a constant value as X approaches negative infinity but Y approaches to infinity as X approaches infinity.

Mathematically it can be represented as:

Here are a few advantages of the Swish activation function over ReLU:
Swish is a smooth function that means that it does not abruptly change direction like ReLU does near x = 0. Rather, it smoothly bends from 0 towards values < 0 and then upwards again.
Small negative values were zeroed out in ReLU activation function. However, those negative values may still be relevant for capturing patterns underlying the data. Large negative values are zeroed out for reasons of sparsity making it a win-win situation.
The swish function being non-monotonous enhances the expression of input data and weight to be learnt.

Gaussian Error Linear Unit (GELU)

The Gaussian Error Linear Unit (GELU) activation function is compatible with BERT, ROBERTa, ALBERT, and other top NLP models. This activation function is motivated by combining properties from dropout, zoneout, and ReLUs. 

ReLU and dropout together yield a neuron’s output. ReLU does it deterministically by multiplying the input by zero or one (depending upon the input value being positive or negative) and dropout stochastically multiplying by zero. 

RNN regularizer called zoneout stochastically multiplies inputs by one. 

We merge this functionality by multiplying the input by either zero or one which is stochastically determined and is dependent upon the input. We multiply the neuron input x by 

m ∼ Bernoulli(Φ(x)), where Φ(x) = P(X ≤x), X ∼ N (0, 1) is the cumulative distribution function of the standard normal distribution. 

This distribution is chosen since neuron inputs tend to follow a normal distribution, especially with Batch Normalization.

Mathematically it can be represented as:

GELU nonlinearity is better than ReLU and ELU activations and finds performance improvements across all tasks in domains of computer vision, natural language processing, and speech recognition.

Scaled Exponential Linear Unit (SELU)

SELU has both positive and negative values to shift the mean, which was impossible for ReLU activation function as it cannot output negative values. 

Gradients can be used to adjust the variance. The activation function needs a region with a gradient larger than one to increase it.

Mathematically it can be represented as:

SELU has values of alpha α and lambda λ predefined. 
Here’s the main advantage of SELU over ReLU:
Internal normalization is faster than external normalization, which means the network converges faster.
SELU is a relatively newer activation function and needs more papers on architectures such as CNNs and RNNs, where it is comparatively explored.

Why are deep neural networks hard to train?

There are two challenges you might encounter when training your deep neural networks.

Let’s discuss them in more detail.

Vanishing Gradients

Like the sigmoid function, certain activation functions squish an ample input space into a small output space between 0 and 1. 

Therefore, a large change in the input of the sigmoid function will cause a small change in the output. Hence, the derivative becomes small. For shallow networks with only a few layers that use these activations, this isn’t a big problem. 

However, when more layers are used, it can cause the gradient to be too small for training to work effectively. 

Exploding Gradients

Exploding gradients are problems where significant error gradients accumulate and result in very large updates to neural network model weights during training. 

An unstable network can result when there are exploding gradients, and the learning cannot be completed. 

The values of the weights can also become so large as to overflow and result in something called NaN values. 

How to choose the right Activation Function?

You need to match your activation function for your output layer based on the type of prediction problem that you are solving—specifically, the type of predicted variable.

Here’s what you should keep in mind.

As a rule of thumb, you can begin with using the ReLU activation function and then move over to other activation functions if ReLU doesn’t provide optimum results.

And here are a few other guidelines to help you out.

  1. ReLU activation function should only be used in the hidden layers.
  2. Sigmoid/Logistic and Tanh functions should not be used in hidden layers as they make the model more susceptible to problems during training (due to vanishing gradients).
  3. Swish function is used in neural networks having a depth greater than 40 layers.

Finally, a few rules for choosing the activation function for your output layer based on the type of prediction problem that you are solving:

  1. Regression – Linear Activation Function
  2. Binary Classification – Sigmoid/Logistic Activation Function
  3. Multiclass Classification – Softmax
  4. Multilabel Classification – Sigmoid

The activation function used in hidden layers is typically chosen based on the type of neural network architecture.

  1. Convolutional Neural Network (CNN): ReLU activation function.
  2. Recurrent Neural Network: Tanh and/or Sigmoid activation function.


In this blog we have discussed all types of activation function in Neural Network.

Written by 

Lokesh Kumar is intern in AI/ML studio at Knoldus. He is passionate about Artificial Intelligence and Machine Learning , having knowledge of C , C++ , Python and Data Analytics and much more. He is recognised as a good team player, a dedicated and responsible professional, and a technology enthusiast. He is a quick learner & curious to learn new technologies.