Entropy is one of those concepts that shows up everywhere — information theory, thermodynamics, machine learning — and the math is surprisingly compact.

Shannon Entropy

For a discrete random variable with distribution , the entropy is:

The units are bits (when using ). It measures the average uncertainty — how surprised you are, on average, by the outcome.

Cross-Entropy

When you approximate the true distribution with a model distribution , the cross-entropy is:

This is the quantity you minimize when training a classifier with a log-loss objective. The KL divergence relates the two:

Continuous Case

For continuous distributions, replace the sum with an integral:

This is called differential entropy, and unlike discrete entropy, it can be negative. A uniform distribution on has entropy , which is negative when .

A Simple Inequality

For any two distributions and :

with equality iff almost everywhere. This follows from Jensen’s inequality and is the reason KL divergence works as a distance-like measure (though it’s not symmetric).


Nothing groundbreaking here — just a reference I keep coming back to.