Information gain calculates the reduction in entropy or surprise from transforming a dataset in some way.
It is commonly used in the construction of decision trees from a training dataset, by evaluating the information gain for each variable, and selecting the variable that maximizes the information gain, which in turn minimizes the entropy and best splits the dataset into groups for effective classification.
Information gain can also be used for feature selection, by evaluating the gain of each variable in the context of the target variable.
In this slightly different usage, the calculation is referred to as mutual information between the two random variables.
In this post, you will discover information gain and mutual information in machine learning.
After reading this post, you will know:Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.
Let’s get started.
What is Information Gain and Mutual Information for Machine LearningPhoto by Giuseppe Milo, some rights reserved.
This tutorial is divided into five parts; they are:Information Gain, or IG for short, measures the reduction in entropy or surprise by splitting a dataset according to a given value of a random variable.
A larger information gain suggests a lower entropy group or groups of samples, and hence less surprise.
Information quantifies how surprising an event is from a random variable in bits.
Entropy quantifies how much information there is in a random variable, or more specifically, the probability distribution for the events of the random variable.
A larger entropy suggests lower probability events or more surprise, whereas a lower entropy suggests larger probability events with less surprise.
We can think about the entropy of a dataset in terms of the probability distribution of observations in the dataset belonging to one class or another, e.
two classes in the case of a binary classification dataset.
One interpretation of entropy from information theory is that it specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S (i.
, a member of S drawn at random with uniform probability).
— Page 58, Machine Learning, 1997.
For example, in a binary classification problem (two classes), we can calculate the entropy of the data sample as follows:A dataset with a 50/50 split of samples for the two classes would have a maximum entropy (maximum surprise) of 1 bit, whereas an imbalanced dataset with a split of 10/90 would have a smaller entropy as there would be less surprise for a randomly drawn example from the dataset.
We can demonstrate this with an example of calculating the entropy for this imbalanced dataset in Python.
The complete example is listed below.
Running the example, we can see that entropy of the dataset for binary classification is less than 1 bit.
That is, less than one bit of information is required to encode the class label for an arbitrary example from the dataset.
In this way, entropy can be used as a calculation of the purity of a dataset, e.
how balanced the distribution of classes happens to be.
An entropy of 0 bits indicates a dataset containing one class; an entropy of 1 or more bits suggests maximum entropy for a balanced dataset (depending on the number of classes), with values in between indicating levels between these extremes.
Information gain provides a way to use entropy to calculate how a change to the dataset impacts the purity of the dataset, e.
the distribution of classes.
A smaller entropy suggests more purity or less surprise.
… information gain, is simply the expected reduction in entropy caused by partitioning the examples according to this attribute.
— Page 57, Machine Learning, 1997.
For example, we may wish to evaluate the impact on purity by splitting a dataset S by a random variable with a range of values.
This can be calculated as follows:Where IG(S, a) is the information for the dataset S for the variable a for a random variable, H(S) is the entropy for the dataset before any change (described above) and H(S | a) is the conditional entropy for the dataset given the variable a.
This calculation describes the gain in the dataset S for the variable a.
It is the number of bits saved when transforming the dataset.
The conditional entropy can be calculated by splitting the dataset into groups for each observed value of a and calculating the sum of the ratio of examples in each group out of the entire dataset multiplied by the entropy of each group.
Where Sa(v)/S is the ratio of the number of examples in the dataset with variable a has the value v, and H(Sa(v)) is the entropy of group of samples where variable a has the value v.
This might sound a little confusing.
We can make the calculation of information gain concrete with a worked example.
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Download Your FREE Mini-CourseIn this section, we will make the calculation of information gain concrete with a worked example.
We can define a function to calculate the entropy of a group of samples based on the ratio of samples that belong to class 0 and class 1.
Now, consider a dataset with 20 examples, 13 for class 0 and 7 for class 1.
We can calculate the entropy for this dataset, which will have less than 1 bit.
Now consider that one of the variables in the dataset has two unique values, say “value1” and “value2.
” We are interested in calculating the information gain of this variable.
Let’s assume that if we split the dataset by value1, we have a group of eight samples, seven for class 0 and one for class 1.
We can then calculate the entropy of this group of samples.
Now, let’s assume that we split the dataset by value2; we have a group of 12 samples with six in each group.
We would expect this group to have an entropy of 1.
Finally, we can calculate the information gain for this variable based on the groups created for each value of the variable and the calculated entropy.
The first variable resulted in a group of eight examples from the dataset, and the second group had the remaining 12 samples in the data set.
Therefore, we have everything we need to calculate the information gain.
In this case, information gain can be calculated as:Or:Or in code:Tying this all together, the complete example is listed below.
First, the entropy of the dataset is calculated at just under 1 bit.
Then the entropy for the first and second groups are calculated at about 0.
5 and 1 bits respectively.
Finally, the information gain for the variable is calculated as 0.
That is, the gain to the dataset by splitting it via the chosen variable is 0.
Perhaps the most popular use of information gain in machine learning is in decision trees.
An example is the Iterative Dichotomiser 3 algorithm, or ID3 for short, used to construct a decision tree.
Information gain is precisely the measure used by ID3 to select the best attribute at each step in growing the tree.
— Page 58, Machine Learning, 1997.
The information gain is calculated for each variable in the dataset.
The variable that has the largest information gain is selected to split the dataset.
Generally, a larger gain indicates a smaller entropy or less surprise.
Note that minimizing the entropy is equivalent to maximizing the information gain …— Page 547, Machine Learning: A Probabilistic Perspective, 2012.
The process is then repeated on each created group, excluding the variable that was already chosen.
This stops once a desired depth to the decision tree is reached or no more splits are possible.
The process of selecting a new attribute and partitioning the training examples is now repeated for each non terminal descendant node, this time using only the training examples associated with that node.
Attributes that have been incorporated higher in the tree are excluded, so that any given attribute can appear at most once along any path through the tree.
— Page 60, Machine Learning, 1997.
Information gain can be used as a split criterion in most modern implementations of decision trees, such as the implementation of the Classification and Regression Tree (CART) algorithm in the scikit-learn Python machine learning library in the DecisionTreeClassifier class for classification.
This can be achieved by setting the criterion argument to “entropy” when configuring the model; for example:Information gain can also be used for feature selection prior to modeling.
It involves calculating the information gain between the target variable and each input variable in the training dataset.
The Weka machine learning workbench provides an implementation of information gain for feature selection via the InfoGainAttributeEval class.
In this context of feature selection, information gain may be referred to as “mutual information” and calculate the statistical dependence between two variables.
An example of using information gain (mutual information) for feature selection is the mutual_info_classif() scikit-learn function.
Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable.
A quantity called mutual information measures the amount of information one can obtain from one random variable given another.
— Page 310, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.
The mutual information between two random variables X and Y can be stated formally as follows:Where I(X ; Y) is the mutual information for X and Y, H(X) is the entropy for X and H(X | Y) is the conditional entropy for X given Y.
The result has the units of bits.
Mutual information is a measure of dependence or “mutual dependence” between two random variables.
As such, the measure is symmetrical, meaning that I(X ; Y) = I(Y ; X).
It measures the average reduction in uncertainty about x that results from learning the value of y; or vice versa, the average amount of information that x conveys about y.
— Page 139, Information Theory, Inference, and Learning Algorithms, 2003.
Kullback-Leibler, or KL, divergence is a measure that calculates the difference between two probability distributions.
The mutual information can also be calculated as the KL divergence between the joint probability distribution and the product of the marginal probabilities for each variable.
If the variables are not independent, we can gain some idea of whether they are ‘close’ to being independent by considering the Kullback-Leibler divergence between the joint distribution and the product of the marginals […] which is called the mutual information between the variables— Page 57, Pattern Recognition and Machine Learning, 2006.
This can be stated formally as follows:Mutual information is always larger than or equal to zero, where the larger the value, the greater the relationship between the two variables.
If the calculated result is zero, then the variables are independent.
Mutual information is often used as a general form of a correlation coefficient, e.
a measure of the dependence between random variables.
It is also used as an aspect in some machine learning algorithms.
A common example is the Independent Component Analysis, or ICA for short, that provides a projection of statistically independent components of a dataset.
Mutual Information and Information Gain are the same thing, although the context or usage of the measure often gives rise to the different names.
For example:Notice the similarity in the way that the mutual information is calculated and the way that information gain is calculated; they are equivalent:andAs such, mutual information is sometimes used as a synonym for information gain.
Technically, they calculate the same quantity if applied to the same data.
We can understand the relationship between the two as the more the difference in the joint and marginal probability distributions (mutual information), the larger the gain in information (information gain).
This section provides more resources on the topic if you are looking to go deeper.
In this post, you discovered information gain and mutual information in machine learning.
Specifically, you learned:Do you have any questions?.Ask your questions in the comments below and I will do my best to answer.
Develop Your Understanding of Probability .
with just a few lines of python codeDiscover how in my new Ebook: Probability for Machine LearningIt provides self-study tutorials and end-to-end projects on: Bayes Theorem, Bayesian Optimization, Distributions, Maximum Likelihood, Cross-Entropy, Calibrating Models and much more.
Finally Harness Uncertainty in Your Projects Skip the Academics.