A Quick Note on Entropy

Entropy is one of those concepts that shows up everywhere — information theory, thermodynamics, machine learning — and the math is surprisingly compact.

Shannon Entropy

For a discrete random variable $X$ with distribution $P$ , the entropy is:

H (X) = - x \in X \sum P (x) lo g_{2} P (x)

The units are bits (when using $lo g_{2}$ ). It measures the average uncertainty — how surprised you are, on average, by the outcome.

Cross-Entropy

When you approximate the true distribution $P$ with a model distribution $Q$ , the cross-entropy is:

H (P, Q) = - x \sum P (x) lo g_{2} Q (x)

This is the quantity you minimize when training a classifier with a log-loss objective. The KL divergence relates the two:

D_{K L} (P ∥ Q) = H (P, Q) - H (P)

Continuous Case

For continuous distributions, replace the sum with an integral:

h (X) = - \int p (x) ln p (x) d x

This is called differential entropy, and unlike discrete entropy, it can be negative. A uniform distribution on $[0, a]$ has entropy $ln a$ , which is negative when $a < 1$ .

A Simple Inequality

For any two distributions $P$ and $Q$ :

D_{K L} (P ∥ Q) \geq 0

with equality iff $P = Q$ almost everywhere. This follows from Jensen’s inequality and is the reason KL divergence works as a distance-like measure (though it’s not symmetric).

Nothing groundbreaking here — just a reference I keep coming back to.

Zihao Wang

Explorer

A Quick Note on Entropy

Shannon Entropy

Cross-Entropy

Continuous Case

A Simple Inequality

Graph View

Table of Contents