Entropy is one of those concepts that shows up everywhere — information theory, thermodynamics, machine learning — and the math is surprisingly compact.
Shannon Entropy
For a discrete random variable with distribution , the entropy is:
The units are bits (when using ). It measures the average uncertainty — how surprised you are, on average, by the outcome.
Cross-Entropy
When you approximate the true distribution with a model distribution , the cross-entropy is:
This is the quantity you minimize when training a classifier with a log-loss objective. The KL divergence relates the two:
Continuous Case
For continuous distributions, replace the sum with an integral:
This is called differential entropy, and unlike discrete entropy, it can be negative. A uniform distribution on has entropy , which is negative when .
A Simple Inequality
For any two distributions and :
with equality iff almost everywhere. This follows from Jensen’s inequality and is the reason KL divergence works as a distance-like measure (though it’s not symmetric).
Nothing groundbreaking here — just a reference I keep coming back to.