The potential ambiguity here is that the dataset you are looking at contains both features and outcome variable, the outcome variable being in the last column. The problem you are trying to solve for is "Do feature 1 and feature 2 help me predict the Outcome"?
Another way to state this is, if I split my data according to feature 1, do I get better information on the Outcome?
In this case, without splitting, the Outcome variable is [ yes, yes, no, no, no ]. If I split on Feature 1, I get 2 groups: Feature 1 = 0 -> Outcome is [ no, no ] Feature 1 = 1 -> Ouctome is [ yes, yes, no ]
The idea here is to see if you are better off with that split. Initially, you had a certain information, described by the Shannon Entropy of [ yes, yes, no, no, no ]. After the split, you have two groups, with "better information" for the group where Feature 1 = 0: you know in that case that the Outcome is no, and that is measured by the Entropy of [ no, no ].
In other words, the approach is to figure out if out of the Features you have available, there is one which, if used, increased your information on what you care about, that is, the Outcome variable. The tree building will greedily pick the feature with the highest information gain at each step, and then see if it's worth splitting even further the resulting groups.