Information Entropy

The possibilities are the same- me living or dying, but we can all agree that the crossing of the street is a bit boring, and the Russian roulette… maybe too exciting.

This is partially because we pretty much know what will happen when I cross the street, but we don’t really know what will happen in Russian roulette.

Another way of looking at this, is to say we gain less information observing the result of crossing the street than we do from Russian roulette.

A formal way of putting that is to say the game of Russian roulette has more ‘entropy’ than crossing the street.

Entropy is defined as ‘lack of order and predictability’, which seems like an apt description of the difference between the two scenarios.

When is information useful?Information is only useful when it can be stored and/or communicated.

We have all learned this lesson the hard way when we have forgotten to save a document we were working on.

In a digital form, information is stored in ‘bits’, or a series of numbers that can either be 0 or 1.

The letters in your keyboard are stores in a ‘byte’, which is 8 bits, which allows for 2⁸ =256 combinations.

It is important to know that information storage and communication are almost the same thing, as you can think of storage as communication with a hard disk.

Examples of symbols and their 8 digit codesInformation StorageThe mathematician Claude Shannon had the insight that the more predictable some information is, the less space is required to store it.

Crossing the street is more predictable than Russian roulette, therefore you would need to store more information about the game of Russian roulette.

Shannon had a mathematical formula for the ‘entropy’ of a probability distribution, which outputs the minimum number of bits required, on average, to store its outcomes.

EntropyFormula from entropy from WikipediaAbove is the formula for calculating the entropy of a probability distribution.

It involves summing P*log(p) with base 2, for all the possible outcomes in a distribution.

Here is a function to do this in Python:import numpy as npdef entropy(dist): su=0 for p in dist: r= p/sum(dist) if r==0: su+=0 else: su+= -r*(np.

log(r)) return su/np.

log(2)Example: Russian RouletteIf we were to quantify the crossing the street example as having a 1 in a billion chance of death, and Russian roulette as 1 in 2, we’d get entropy([1, 999_999_999]) ≈ 3.

1*10^-8 bits , and entropy([50,50])=1 bit, respectively.

This means that if we repeated both experiments a trillion times, it would take at least 31,000 bits to store the results of crossing the street, and 1 trillion bits to store the results of Russian roulette, in line with our earlier intuition.

Some distributions and their entropiesExample: English LanguageThe English language has 26 letters, if you assume each letter has a probability of 1/26 of being next, the language has an entropy of 4.

7 bits.

However, some letters are more common than other letters, and some letters appear often together, so through clever ‘guessing’ (i.


not assigning probabilities of 1/26), we can be much more efficient.

Random guessing on average takes us 13.

5 guesses to get the correct letter.

Let us say we are given the first letter of every word in this sentence:H_ _ /A_ _ /Y_ _ /D_ _ _ _ / M_ /F_ _ _ _ _?It would be very bad if it took us 13.

5*16=216 guesses to fill in the 16 blanks.

It would likely take us less than an average of two guesses per blank to figure out the sentence is “How are you doing my friend?”.

So even if we exhaustively guessed the first letter and it took us 13.

5 guesses, it would take us roughly 5.

1 guesses/letter to fill in all the blanks, a huge improvement on random guessing.

Experiments by Shannon showed that English has an entropy between 0.

6 and 1.

3 bits.

To put that into perspective, a 3 sided die has an entropy of 1.

58 bits, and takes on average 2 guesses to predict.

Also, note that the encoding system on your keyboard uses 8 bits per letter.

So it could theoretically make all files in only the English language at least 6 times smaller!ApplicationsShannon’s work found uses in data storage, spaceship communication, and even communication over the internet.


Even if we are not working in any of those fields, ‘KL divergence’ is an idea derived from Shannon’s work, that is frequently used in data science.

It tells you how good one distribution is at estimating another by comparing their entropies.

Communication and storage of information is what has made humans great, and Shannon’s work revolutionised the way we do so in the digital age.


. More details

Leave a Reply