Math

Softmax Calculator

Convert model logits to a probability distribution instantly

T < 1 sharper ยท T = 1 default ยท T > 1 flatter

Softmax Calculator

Softmax is the activation function used in the final layer of classification neural networks to convert raw scores (logits) into a probability distribution. All output probabilities sum to 1, with each value between 0 and 1. This calculator lets you paste logits, optionally set class labels, and inspect the resulting probability distribution along with entropy and the argmax class.

How to Use This Calculator

  1. Logits โ€” paste raw model output scores, comma-separated. These can be any real numbers.
  2. Temperature โ€” a scaling factor applied before softmax. Temperature < 1 sharpens the distribution (more confident); temperature > 1 flattens it (more uniform). Default is 1.
  3. Labels โ€” optionally enter class names, one per line or comma-separated, to label the output.

The Softmax Formula

For logits z = [z1, z2, ..., zK]:

softmax(zi) = ezi/T / ∑j ezj/T

Where T is the temperature. This calculator uses the numerically stable variant โ€” the max logit is subtracted before exponentiation to prevent overflow.

Temperature Scaling Explained

Temperature scaling is a post-hoc calibration technique. A neural network trained with cross-entropy loss tends to produce overconfident predictions. Dividing logits by a temperature T > 1 (learned on a validation set) produces better-calibrated probabilities without changing the argmax.

  • T < 1: Sharpens the distribution โ€” the highest logit dominates even more.
  • T = 1: Standard softmax โ€” no scaling.
  • T > 1: Flattens the distribution โ€” makes it more uniform, more uncertain.
  • T → ∞: Approaches a uniform distribution (maximum entropy).
  • T → 0: Approaches a one-hot distribution (argmax only).

Entropy as a Confidence Measure

Shannon entropy H = -∑ pi log(pi) measures the uncertainty of the distribution. Minimum entropy (0 nats) means the model is completely certain. Maximum entropy is log(K) nats for K classes. High entropy often indicates the model is confused between classes and the prediction should be treated with caution.

Where Softmax Is Used

Multi-class classification: ResNet, ViT, BERT classifiers all end with a linear layer followed by softmax (or its implicit equivalent in cross-entropy loss).

Attention mechanisms: The attention score in Transformers is computed as softmax(QK&sup>T / √dk)) V. Temperature scaling corresponds to the 1/√dk factor.

Reinforcement learning: Policy networks output action probabilities via softmax over Q-values in actor-critic methods.

Frequently Asked Questions

What are logits?
Logits are the raw, unnormalized outputs of the final linear layer in a classification network. They can be any real number. Softmax converts them to probabilities that sum to 1. The term comes from the logit function (inverse of sigmoid), though in deep learning it more broadly refers to pre-activation scores.
Why does probability always sum to 1?
Softmax divides each exponentiated logit by the total sum of all exponentiated logits. This normalization guarantees that all outputs are positive and sum to exactly 1, making them valid class probabilities.
What is temperature in softmax?
Temperature T divides all logits before applying softmax. T < 1 makes the distribution sharper (more peaked) โ€” the model becomes more confident. T > 1 flattens the distribution โ€” the model becomes more uncertain. Temperature is used for model calibration and to control diversity in text generation.
What is the numerically stable softmax?
Directly computing exp(z) can overflow for large logits. The stable version subtracts the maximum logit first: exp(z - max(z)). This does not change the output because the extra factor cancels in the numerator and denominator, but it keeps the exponent in a safe numerical range.
What is the difference between softmax and sigmoid?
Sigmoid is used for binary classification or multi-label classification โ€” each output is independent and the probabilities do not sum to 1. Softmax is used for mutually exclusive multi-class classification โ€” exactly one class is correct and the outputs sum to 1.
What does high entropy mean for my model?
High entropy means the model distributes probability fairly evenly across classes โ€” it is uncertain. This can indicate the input is genuinely ambiguous, or that the model is poorly calibrated or has not seen similar examples during training. Low entropy means the model is confident about one class.