Neural Nets

Activation And Loss Functions In Deep Learning

/ 8 min read

Download As PDF

Activation and loss functions are fundamental components in deep learning architectures. They play crucial roles in both the forward propagation of signals and the backward propagation of gradients during model training. This article provides an in-depth analysis of common activation and loss functions, their mathematical foundations, and practical applications.

Activation Functions

Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. Let’s examine key activation functions and their properties.

1. Rectified Linear Unit (ReLU)

The ReLU function is currently the most widely used activation function in deep learning. Its popularity stems from its computational efficiency and effectiveness in addressing the vanishing gradient problem.

Working of http1.0

Mathematical Definition:

f(x)=max(0,x)={xif x>00if x0f(x) = \max(0, x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}

Derivative:

f(x)={1if x>00if x<0undefinedif x=0f'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x < 0 \\ \text{undefined} & \text{if } x = 0 \end{cases}

Key Properties:

  • Non-saturating for positive values
  • Sparse activation (approximately 50% of neurons are typically active)
  • Computationally efficient
  • Helps mitigate vanishing gradient problem

Use Cases:

  • Default choice for hidden layers in deep neural networks
  • Particularly effective in Convolutional Neural Networks (CNNs)
  • Computer vision applications
  • Deep architectures with many layers

2. Leaky ReLU

Leaky ReLU addresses the “dying ReLU” problem by allowing a small gradient when the unit is not active.

Working of http1.0

Mathematical Definition:

f(x)={xif x>0αxif x0f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases}

where α\alpha is typically a small constant like 0.01.

Derivative:

f(x)={1if x>0αif x<0f'(x) = \begin{cases} 1 & \text{if } x > 0 \\ \alpha & \text{if } x < 0 \end{cases}

Key Properties:

  • Prevents dead neurons
  • Allows negative gradients
  • Generally maintains the benefits of ReLU
  • Parameter α\alpha can be learned (Parametric ReLU)

Use Cases:

  • Alternative to ReLU when dead neurons are a concern
  • Deep networks where gradient flow is critical
  • Tasks requiring negative value preservation

3. Sigmoid Function

The sigmoid function maps inputs to values between 0 and 1, making it useful for probability-based predictions.

Working of http1.0

Mathematical Definition:

f(x)=11+exf(x) = \frac{1}{1 + e^{-x}}

Derivative:

f(x)=f(x)(1f(x))f'(x) = f(x)(1 - f(x))

Key Properties:

  • Output range (0,1)
  • Smooth gradient
  • Clear probabilistic interpretation
  • Suffers from vanishing gradient for extreme values

Use Cases:

  • Binary classification output layers
  • Gates in LSTM and GRU units
  • Probability estimation tasks
  • Legacy networks (historically popular)

4. Hyperbolic Tangent (tanh)

Tanh is a scaled and shifted version of the sigmoid function, mapping inputs to the range (-1,1).

Working of http1.0

Mathematical Definition:

f(x)=tanh(x)=exexex+exf(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

Derivative:

f(x)=1tanh2(x)f'(x) = 1 - \tanh^2(x)

Key Properties:

  • Output range (-1,1)
  • Zero-centered
  • Stronger gradients compared to sigmoid
  • Still suffers from vanishing gradient at extremes

Use Cases:

  • Hidden layers in shallow networks
  • NLP tasks
  • LSTM/GRU internal states
  • Cases requiring normalized outputs

5. Softmax Function

The softmax function generalizes the logistic function to handle multiple classes, converting a vector of values into a probability distribution.

Working of http1.0

Mathematical Definition: For a K-dimensional vector x:

softmax(xi)=exij=1Kexj\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^K e^{x_j}}

Derivative: For i = j

softmax(xi)xj=softmax(xi)(1softmax(xi))\frac{\partial \text{softmax}(x_i)}{\partial x_j} = \text{softmax}(x_i)(1 - \text{softmax}(x_i))

and For i ≠ j

softmax(xi)xj=softmax(xi)softmax(xj)\frac{\partial \text{softmax}(x_i)}{\partial x_j} = -\text{softmax}(x_i)\text{softmax}(x_j)

Key Properties:

  • Outputs sum to 1
  • Squashes values to range (0,1)
  • Preserves relative ordering
  • Differentiable
  • Emphasizes largest values while suppressing lower ones

Use Cases:

  • Multi-class classification output layers
  • Attention mechanisms in transformers
  • Policy networks in reinforcement learning
  • Probability distribution generation

6. Exponential Linear Unit (ELU)

ELU provides smoother gradients compared to ReLU variants while maintaining most of their benefits.

Working of http1.0

Mathematical Definition:

f(x)={xif x>0α(ex1)if x0f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}

Derivative:

f(x)={1if x>0αexif x0f'(x) = \begin{cases} 1 & \text{if } x > 0 \\ \alpha e^x & \text{if } x \leq 0 \end{cases}

Key Properties:

  • Smooth function including negative values
  • Reduces bias shift
  • Self-regularizing properties
  • Computationally more expensive than ReLU

Use Cases:

  • Deep neural networks requiring smooth gradients
  • Tasks sensitive to negative values
  • Networks requiring strong regularization
  • Alternative to batch normalization

7. Scaled Exponential Linear Unit (SELU)

SELU enables self-normalizing properties in neural networks, automatically pushing activations toward zero mean and unit variance.

Working of http1.0

Mathematical Definition:

f(x)=λ{xif x>0α(ex1)if x0f(x) = \lambda \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}

where: α1.676\alpha \approx 1.676 λ1.0507\lambda \approx 1.0507

Key Properties:

  • Self-normalizing
  • Maintains consistent mean and variance
  • Robust to perturbations
  • Requires specific initialization (LeCun normal)

Use Cases:

  • Deep networks without batch normalization
  • Networks requiring stable training
  • Applications with limited computational resources
  • Tasks requiring strong regularization

Loss Functions

Loss functions quantify the difference between predicted and actual values, guiding the optimization process during training.

1. Mean Squared Error (MSE)

MSE is the most common loss function for regression tasks.

Working of http1.0

Mathematical Definition:

MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2

Derivative:

MSEy^i=2n(yiy^i)\frac{\partial \text{MSE}}{\partial \hat{y}_i} = -\frac{2}{n}(y_i - \hat{y}_i)

Key Properties:

  • Heavily penalizes large errors
  • Differentiable everywhere
  • Non-negative
  • Convex function
  • Sensitive to outliers

Use Cases:

  • Regression problems
  • When outliers are rare or meaningful
  • When large errors should be heavily penalized
  • Signal processing applications

2. Binary Cross-Entropy Loss

Binary Cross-Entropy (BCE) is fundamental for binary classification tasks, measuring the difference between predicted probabilities and true binary labels.

Working of http1.0

Mathematical Definition:

BCE=1Ni=1N[yilog(y^i)+(1yi)log(1y^i)]\text{BCE} = -\frac{1}{N}\sum_{i=1}^N [y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]

where:

  • yiy_i is the true label (0 or 1)
  • y^i\hat{y}_i is the predicted probability
  • NN is the number of samples

Derivative with respect to logits (before sigmoid):

BCEx=σ(x)y\frac{\partial \text{BCE}}{\partial x} = \sigma(x) - y

where σ(x)\sigma(x) is the sigmoid function.

Key Properties:

  • Bounded between 0 and ∞
  • Provides stronger gradients than MSE for probabilities
  • Works well with probabilistic predictions
  • Natural pairing with sigmoid activation

Use Cases:

  • Binary classification
  • Generative Adversarial Networks (GANs)
  • Anomaly detection
  • Multi-label classification (per-label)

3. Focal Loss

Focal Loss addresses class imbalance by down-weighting easy examples and focusing on hard ones.

Working of http1.0

Mathematical Definition:

FL(pt)=αt(1pt)γlog(pt)\text{FL}(p_t) = -\alpha_t(1-p_t)^\gamma \log(p_t)

where:

  • ptp_t is the model’s estimated probability for the true class
  • γ\gamma is the focusing parameter (typically 2)
  • αt\alpha_t is the class balancing factor

Key Properties:

  • Reduces impact of easy examples
  • Automatically handles class imbalance
  • Tunable focus on hard examples via γ\gamma
  • Generalizes cross-entropy loss (γ=0\gamma = 0)

Use Cases:

  • Object detection
  • Highly imbalanced datasets
  • Dense prediction tasks
  • Medical image segmentation

4. Huber Loss

Huber Loss combines the best properties of MSE and Mean Absolute Error (MAE), being less sensitive to outliers than MSE while maintaining differentiability.

Working of http1.0

Mathematical Definition:

Lδ(y,y^)={12(yy^)2for yy^δδyy^12δ2otherwiseL_\delta(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{for } |y - \hat{y}| \leq \delta \\ \delta|y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}

Derivative:

Lδy^={yy^for yy^δδsign(yy^)otherwise\frac{\partial L_\delta}{\partial \hat{y}} = \begin{cases} y - \hat{y} & \text{for } |y - \hat{y}| \leq \delta \\ \delta \cdot \text{sign}(y - \hat{y}) & \text{otherwise} \end{cases}

Key Properties:

  • Combines MSE and MAE benefits
  • Robust to outliers
  • Differentiable everywhere
  • Adjustable sensitivity via δ\delta

Use Cases:

  • Regression with outliers
  • Robust optimization
  • Reinforcement learning
  • Time series prediction

5. Hinge Loss

Hinge Loss is primarily used in Support Vector Machines and margin-based learning.

Working of http1.0

Mathematical Definition:

L(y,y^)=max(0,1yy^)L(y, \hat{y}) = \max(0, 1 - y\hat{y})

where y{1,1}y \in \{-1, 1\} and y^\hat{y} is the model’s prediction.

Key Properties:

  • Maximum margin classification
  • Non-differentiable at hinge point
  • Sparse gradients
  • Focus on margin violations

Use Cases:

  • Support Vector Machines
  • Maximum margin classifiers
  • Structured prediction
  • Online learning algorithms

Code Samples

For Activation Functions

import numpy as np
import tensorflow as tf
# Activation Functions
# 1. ReLU
def relu(x):
return tf.nn.relu(x) # or np.maximum(0, x) for NumPy
# 2. Leaky ReLU
def leaky_relu(x, alpha=0.01): # alpha is the leak coefficient
return tf.maximum(alpha * x, x) # or np.where(x > 0, x, x * alpha) for NumPy
# 3. Sigmoid
def sigmoid(x):
return tf.sigmoid(x) # or 1 / (1 + np.exp(-x)) for NumPy
# 4. Tanh
def tanh(x):
return tf.tanh(x) # or np.tanh(x) for NumPy
# 5. Softmax
def softmax(x):
return tf.nn.softmax(x) # or np.exp(x) / np.sum(np.exp(x), axis=-1, keepdims=True) for NumPy
# 6. ELU (Exponential Linear Unit)
def elu(x, alpha=1.0): # alpha is a hyperparameter
return tf.nn.elu(x) # TensorFlow handles ELU directly
# 7. SELU (Scaled Exponential Linear Unit)
def selu(x):
alpha = 1.673
scale = 1.0507
return scale * tf.where(x >= 0.0, x, alpha * tf.exp(x) - alpha)

For Loss Functions

# 1. MSE (Mean Squared Error)
def mse(y_true, y_pred):
return tf.reduce_mean(tf.square(y_true - y_pred)) # or np.mean(np.square(y_true - y_pred)) for NumPy
# 2. Binary Cross-Entropy
def binary_crossentropy(y_true, y_pred):
return tf.reduce_mean(tf.keras.losses.binary_crossentropy(y_true, y_pred))
# 3. Focal Loss (requires a bit more setup)
def focal_loss(y_true, y_pred, gamma=2.0, alpha=0.25):
y_pred = tf.clip_by_value(y_pred, 1e-7, 1.0 - 1e-7) # avoid numerical instability
pt_1 = tf.where(tf.equal(y_true, 1), y_pred, tf.ones_like(y_pred))
pt_0 = tf.where(tf.equal(y_true, 0), y_pred, tf.zeros_like(y_pred))
return -tf.reduce_sum(alpha * tf.pow(1. - pt_1, gamma) * tf.math.log(pt_1)) - tf.reduce_sum((1 - alpha) * tf.pow(pt_0, gamma) * tf.math.log(1. - pt_0))
# 4. Huber Loss
def huber_loss(y_true, y_pred, delta=1.0):
return tf.reduce_mean(tf.keras.losses.huber(y_true, y_pred, delta=delta))
# 5. Hinge Loss
def hinge_loss(y_true, y_pred): # For multi-class, use categorical_hinge
return tf.reduce_mean(tf.keras.losses.hinge(y_true, y_pred))

Example of using the functions (TensorFlow)

x = tf.constant([-2.0, -1.0, 0.0, 1.0, 2.0])
print("ReLU:", relu(x))
y_true = tf.constant([0, 1, 0, 1, 1])
y_pred = tf.constant([0.1, 0.9, 0.2, 0.8, 0.7])
print("Binary Cross-Entropy:", binary_crossentropy(y_true, y_pred).numpy())
# Example with NumPy (replace tf functions with their NumPy equivalents
x_np = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])
print("NumPy ReLU:", np.maximum(0, x_np))
y_true_np = np.array([0, 1, 0, 1, 1])
y_pred_np = np.array([0.1, 0.9, 0.2, 0.8, 0.7])
print("NumPy MSE:", np.mean(np.square(y_true_np - y_pred_np)))

Activation-Loss Pairings

Common effective combinations:

  • Softmax + Categorical Cross-Entropy
  • Sigmoid + Binary Cross-Entropy
  • Linear + MSE/Huber
  • ReLU/ELU + Any Loss (hidden layers)

Conclusion

The activation and loss functions are foundational to deep learning models. Activation functions introduce non-linearity and control the signal flow, while loss functions guide the optimization process, quantifying the difference between predictions and target values. The choice of these functions directly impacts a model’s convergence rate, accuracy, and overall performance on specific tasks.