# Activation Functions: All You Need To Know

Activation functions in machine learning & neural networks are mathematical functions applied to each neuron or node in the network. It determines whether a neuron should be activated by calculating the weighted sum of inputs and applying a nonlinear transformation. After reading this article you will understand:

- What is an Activation Function?
- Types of Activation Function
**Threshold****Sigmoid****Hard sigmoid****Sigmoid-Weighted Linear Units****Derivative of Sigmoid-Weighted Linear Units (dSiLU)****Hyperbolic Tangent Function (Tanh)****Softmax****RELU****Softsig****Leaky ReLU (LReLU)****Elu****SELU****GELU****Swish****Softplus****Mish**

- Conclusion
- Recourses & References

# Introduction

Activation function decides, whether a neuron should be activated or not by calculating weighted sum and further adding bias with it. The purpose of the activation function is to introduce non-linearity into the output of a neuron .

The most important thing is that activation functions introduce non-linearity into the network. A neural network without activation functions is basically just a Linear Regression model and is not able to do more complicated tasks such as Deep Learning Models like language translations and image classifications. Also, linear functions are not able to make use of backpropagation .

`y = α(∑(weight∗input) + bias)`

Where α is the activation function.

# Types of Activation Functions

There are many activation function which based on mathematical theories which each used in specific task or more , here’s a sample of them :

## Threshold Activation Function

A threshold function is a Boolean function that determines whether a value equality of its inputs exceeded a certain threshold .

A python code to represent the equation of **Threshold **activation function:

```
def threshold_function(x, theta):
return np.where(x >= theta, 1, 0)
```

## Sigmoid Activation Function

Sigmoid is a non-linear activation function used mostly in feedforward neural networks. It is a bounded differentiable real function, defined for real input values.

A python code to represent the equation of **Sigmoid **activation function:

```
def sigmoid(x):
return 1/(1+np.exp(-x))
```

Sigmoid function appears in the output layers of the DL architectures, and they are used for predicting probability based output and has been applied successfully in binary classification problems, modeling logistic regression tasks as well as other neural network domains

The main reason why we use sigmoid function is because it exists between (0 to 1)

Sigmoid AF suffers major drawbacks which include sharp damp gradients during backpropagation from deeper hidden layers to the input layers which make it slow so Hard sigmoid function solve this problem That’s why It is Usually used in output layer .

## Hard sigmoid Activation Function

Hard sigmoid is non-smooth function used in place of a sigmoid function. These retain the basic shape of a sigmoid, rising from 0 to 1, but using simpler functions and offer less loss function

A python code to represent the equation of **Hard Sigmoid **activation function:

```
def Hard_Sigmoid(x):
return max(0,min(1,((x+1)/2)))
```

## Sigmoid-Weighted Linear Units (SILU) Activation Function

The Sigmoid-Weighted Linear Units is a reinforcement learning based approximation function The SiLU function can only be used in the hidden layers of the deep neural networks and only for reinforcement learning based systems.

Their experiments show that Swish tends to work better than ReLU on deeper models across a number of challenging data sets.

With ReLU, the consistent problem is that its derivative is 0 for half of the values of the input x in ramp Function, i.e. f(x)=max(0,x). As their parameter update algorithm, they have used Stochastic Gradient Descent and if the parameter itself is 0, then that parameter will never be updated as it just assigns the parameter back to itself, leading close to 40% Dead Neurons in the Neural network environment when θ=θ. Various substitutes like Leaky ReLU or SELU (Self-Normalizing Neural Networks) have unsuccessfully tried to devoid it of this issue but now there seems to be a revolution for good.

In deep neural networks, Swish achieves higher test accuracy than ReLU. In terms of batch size, the performance of both activation functions decrease as batch size increases

A python code to represent the equation of **SiLU **activation function:

```
def silu(x):
return x * sigmoid(x)
```

## Derivative of Sigmoid-Weighted Linear Units (DSiLU) Activation Function

The derivative of the Sigmoid-Weighted Linear Units is the gradient of the SiLU function and referred to as DSiLU. The DSiLU is used for gradient-descent learning updates for the neural network weight parameters.

A python code to represent the equation of **DSiLU** activation function:

```
def sigmoid_derivative(x):
sig = sigmoid(x)
return sig * (1 - sig)
def dsilu(x):
return sigmoid(x) + x * sigmoid_derivative(x)
```

## Hyperbolic Tangent Function (Tanh) Activation Function

The hyperbolic tangent function is another type of AF used in DL and it has some variants used in DL applications. The herbolic tangent function known as tanh function, whose range lies between -1 to 1.

The tanh function became the preferred function compared to the sigmoid function in that it gives better training performance for multi-layer neural networks. However, the tanh function could not solve the vanishing gradient problem

Usually used in hidden layers of a neural network

A python code to represent the equation of **Tanh **activation function:

```
def tanh(x):
return ((np.exp(x)-np.exp(-x))/(np.exp(x)+np.exp(-x)))
```

## SoftMax Activation Function

The softmax function is also a type of sigmoid non linearfunction but is handy when we are trying to handle classification problems.

Usually used when trying to handle multiple classes. The softmax function would squeeze the outputs for each class between 0 and 1 and would also divide by the sum of the outputs.

A python code to represent the equation of **Softmax **activation function:

```
def softmax(x):
e_x = np.exp(x - max(x))
return e_x / e_x.sum()
```

## RELU Activation Function

Stands for Rectified linear unit. It is the most widely used non-linear activation function. Chiefly implemented in hidden layers of Neural network.

The main advantage of using the rectified linear units in computation is that, they guarantee faster computation

ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations

In simple words, RELU learns much faster than sigmoid and Tanh function.

ReLU units can be fragile during training and can “die”. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on

A python code to represent the equation of **ReLU **activation function:

```
def relu(x):
return max(0,x)
```

## Softsig Activation Function

The Softsign is another non-linear AF used in DL applications , The Softsign function is a quadratic polynomial instead of tanh which covers exponentially .

The Softsign has been used mostly in regression computation problems, but has also been applied to DL based test to speech system .

A python code to represent the equation of **Softsig** activation function:

```
def softsign(x):
return x/(1+abs(x))
```

## Leaky ReLU (LReLU) Activation Function

AF that introduce some small negative slope to the ReLU to sustain and keep the weight updates alive during the entire propagation process

The alpha parameter was introduced as a solution to the ReLUs dead neuron problems such that the gradients will not be zero at any time during training .

A python code to represent the equation of ** Leaky ReLU **activation function:

```
def leaky_ReLU(x):
data = [max(0.05*value,value) for value in x]
return np.array(data, dtype=float)
```

## ELU Activation Function

Exponential Linear Unit. This activation function fixes some of the problems with ReLUs and keeps some of the positive things. For this activation function, an alpha α value is picked; a common value is between 0.1 and 0.3 .

A python code to represent the equation of **ELU **activation function:

```
def elu(x, alpha=1.0):
return np.where(x > 0, x, alpha * (np.exp(x) - 1))
```

## SELU Activation Function

SELU is some kind of ELU but with a little twist. α and λ are two fixed parameters, meaning we don’t backpropagate through them and they are not hyperparameters to make decisions about. α and λ are derived from the inputs , The main advantage of SELU is that we can be sure that the output will always be standardized due to its self-normalizing behavior. That means there is no need to include Batch-Normalization layers.

SELU can’t make it work alone, so a custom weight initialization technique is being used.

A python code to represent the equation of **SELU **activation function:

```
def SELU(x):
if x > 0:
return λ*x
return λ*α*(np.exp(x) - 1)
```

## GELU Activation Function

Gaussian Error Linear Unit. The GELU activation function has been found to perform better than other activation functions in some tasks, especially in transformer models.

It is used in the most recent Transformers – Google’s BERT and OpenAI’s GPT-2.

A python code to represent the equation of **GELU **activation function:

```
from scipy.specials import erf
```**def** **gelu**(x):
cdf **=** 0.5 ***** (1.0 **+** erf(x **/** np.sqrt(2.0)))
**return** x ***** cdf

## Softplus Activation Function

Softplus activation function is a smooth continuous version of ReLU Layer . You can incorporate this layer into the deep neural networks you define for actors in reinforcement learning agents.

A python code to represent the equation of **Softplus **activation function:

```
def softplus(x):
return log(1+ np.exp(x))
```

## Mish Activation Functions Activation Function

Mish also outperforms in case of Noisy Input conditions as compared to other activation functions.

a smooth, continuous, self regularized, non-monotonic activation functio

Mish is shown to have a consistent improvement over Swish using different Dropout

The evaluate of Mish tends to match or improve the performance of neural network architectures as compared to that of Swish, ReLU, and Leaky ReLU across different tasks in Computer Vision.

A python code to represent the equation of **Mish **activation function:

```
def mish(x):
return x*np.tanh(softplus(x))
```

# Conclusion

Activation functions are indispensable in the design and training of neural networks. Understanding their properties, advantages, and limitations helps in making informed decisions when building models. As research in neural networks advances, new activation functions continue to be proposed, enhancing the capabilities and performance of deep learning models.

Mahmoud Abdullah

August 17, 2024 @ 4:36 pm

Thanks Mahmoud!

You’ve made a complex topic so much clearer. This is a fantastic resource for anyone dealing with activation functions. Keep it up!