What Exactly Is an Inductive Bias?
- By Bruce Nielson
- ML & AI Specialist
In a previous post, I walked through Tom Mitchell's proof that a learner with no prior assumptions cannot generalize at all. The conjunctive restriction on our hypothesis space was doing all the real work -- without it, the algorithm was paralyzed. I promised a follow-up that would define that concept more precisely.
Mitchell calls it the learner's inductive bias. He defines this as:
"...the inductive bias of a learner as the set of additional assumptions B sufficient to justify its inductive inferences as deductive inferences." (Mitchell, p. 44)
Here is his formal definition:
The inductive bias of a learner L is any minimal set of assertions B such that for any target concept c and corresponding training examples D, (B + D + x) deductively entails L(x, D) for all new instances x. (Mitchell, p. 44)
Compare this to Wikipedia's definition of an 'inductive bias':
The inductive bias (also known as learning bias) of a learning algorithm is the set of assumptions that the learner uses to predict outputs of given inputs that it has not encountered (link)
In plain language: the inductive bias is whatever you'd have to add to the training data so that the learner's predictions follow by pure deduction. It is the gap between what the data says and what the learner concludes. It may be explicit, or implicit.
In the previous post, we saw that the Candidate-Elimination algorithm's inductive bias was the assertion "the target concept can be expressed as a conjunction of the attributes Genre, Mood, and Pacing." Feed that assertion to a deductive theorem prover along with the training data, and you get the same output as the so-called "inductive" learning algorithm. The "induction" was deduction plus an unstated assumption. This is why Popper was actually correct that there is no 'induction' per se. The so-called 'inductive' algorithms work just fine in real life -- but what is actually happening under the hood is always equivalent to deduction once the background knowledge is taken into consideration.
This definition is not limited to one algorithm. Every learner has an inductive bias, and identifying it tells you exactly what assumptions the learner is smuggling in.
Comparing Learners by Their Bias
As Mitchell puts it:
One advantage of viewing inductive inference systems in terms of their inductive bias is that it provides a nonprocedural means of characterizing their policy for generalizing beyond the observed data (Mitchell, p. 44)
To understand what he means, consider a "Rote-Learner" algorithm. All it does is store every training example in memory, and when you ask it to classify a new instance, it looks for an exact match. If it has seen that exact instance before, it returns the stored classification. Otherwise, it refuses to classify it at all. This algorithm has no inductive bias -- and it is misleading to even call it a "learner" because it never actually learns anything. It just stores data points. It is the "unbiased learner" from the previous post, and as we saw there, it is completely useless for generalization. (Mitchell, p. 45)
Now compare this to the Candidate-Elimination algorithm from the previous post. It classifies a new instance only when every hypothesis in the version space agrees on the answer. Its inductive bias is a single assumption: the target concept is contained in the hypothesis space H, which is the space of conjunctions between attributes. (i.e. conjunction means the attributes can be "AND"ed but not "OR"ed together.) Because it has a stronger bias than the Rote-Learner, it can classify instances the Rote-Learner cannot. But the correctness of those classifications depends entirely on whether the bias is true -- whether the target concept really is in H. If it is not, the version space may collapse entirely -- or worse, the algorithm may converge on the wrong answer. (Mitchell, p. 45).
Mitchell also describes the Find-S algorithm, which only tracks the S boundary -- the most specific hypothesis consistent with the positive examples. It ignores negative examples entirely and uses that single hypothesis to classify everything: if the hypothesis covers a new instance, it predicts positive; otherwise, it predicts negative. This gives it an even stronger bias than the Candidate-Elimination algorithm. In addition to assuming the target concept is in H, it assumes that all instances are negative unless the opposite is entailed by its other knowledge (Mitchell, p. 45). Where Candidate-Elimination would say "I don't know" when the version space is split, Find-S always has an answer: negative. The advantage is that it now always gives an answer, unlike the Candidate-Elimination algorithm. But this comes at the cost of sometimes being confidently wrong. This makes it the most aggressive of the three -- and the most dependent on its assumptions being correct.
The pattern is clear: stronger bias means more generalization, but also more risk. As Mitchell notes, "more strongly biased methods make more inductive leaps, classifying a greater proportion of unseen instances" (Mitchell, p. 45). But those leaps are only as good as the assumptions that enable them.
The pattern here generalizes. All learning algorithms can be characterized in terms of their inductive bias. Some biases are categorical restrictions that completely rule out certain concepts -- like the assumption that H contains the target concept (e.g. Candidate-Elimination algorithm). Others merely express preferences, ranking some hypotheses above others -- like "prefer simpler hypotheses over complex ones." And some inductive biases are hardwired into the algorithm's design. But it is even possible to create a learner that can modify its own inductive bias. (Mitchell, p. 45).
Restriction Bias vs. Preference Bias
Not all biases work the same way. Mitchell draws an important distinction between two kinds (Mitchell, pp. 62-64).
A restriction bias limits which hypotheses the learner can consider at all. The Candidate-Elimination algorithm has a restriction bias -- it literally cannot represent hypotheses outside its hypothesis space. If the true concept is a disjunction (i.e. "OR"s) and the space only allows conjunctions ("AND"s), the algorithm will never find it. The advantage is that within its restricted space, it searches completely -- it considers every consistent hypothesis (Mitchell, p. 64).
A preference bias does not restrict the hypothesis space but instead orders it -- the learner prefers some hypotheses over others. Decision tree learners like ID3 have a preference bias. The hypothesis space of decision trees can represent any discrete-valued function -- it is univeral! But ID3 does not search that space exhaustively. It uses a greedy, top-down strategy that favors shorter trees and places the most informative attributes near the root (Mitchell, p. 62). Its inductive bias is entirely a consequence of this search strategy, not the expressiveness of its representation.
A great example of a preference bias is genetic programming. Its hypothesis space is the set of all programs that can be composed from a given set of primitives. Since nature is computable (see the Church-Turing-Deutsch thesis), this means its search space is in principle universal -- it can represent anything. There is no restriction bias at all. But it still has an inductive bias. As Mitchell notes, the performance of genetic programming depends crucially on the choice of representation and on the choice of fitness function (Mitchell, p. 266). And realistically, the space of all programs is so vast that the algorithm will only ever explore a tiny fraction of it -- mostly relatively small programs. The evolutionary search strategy -- selection, crossover, mutation -- determines which programs get explored and which get discarded. The bias is entirely in how it searches, not in what it can represent.
Mitchell is direct about which is generally better: a preference bias is typically more desirable than a restriction bias, because it allows the learner to work within a complete hypothesis space that is guaranteed to contain the target function. A restriction bias always carries the risk that you have excluded the right answer from the start (Mitchell, p. 64).
Neural Networks: A Different Kind of Bias
When Mitchell turns to neural networks trained with backpropagation, the inductive bias becomes harder to pin down -- but it does not disappear (Mitchell, pp. 104-107).
The hypothesis space is now continuous rather than discrete. As Mitchell describes it, "every possible assignment of network weights represents a syntactically distinct hypothesis that in principle can be considered by the learner" -- the hypothesis space is the n-dimensional Euclidean space of all the network's weights (Mitchell, p. 106). And the search strategy is completely different from the algorithms we have been discussing. Backpropagation is essentially a hill-climbing algorithm that uses calculus to determine which direction is downhill. It follows the slope of the error surface, taking small steps in whatever direction reduces the error most. This means it can get trapped in local minima -- valleys that are not the deepest valley -- and is only guaranteed to converge toward some local minimum, not necessarily the global one (Mitchell, p. 104).
So what is the inductive bias? Mitchell is candid that "it is difficult to characterize precisely the inductive bias of BACKPROPAGATION learning, because it depends on the interplay between the gradient descent search and the way in which the weight space spans the space of representable functions." But he offers an approximate answer: "one can roughly characterize it as smooth interpolation between data points." Given two positive training examples with no negative examples between them, backpropagation will tend to label points in between as positive as well (Mitchell, pp. 106-107).
This is still an inductive bias -- an assumption that goes beyond what the data logically entails. The data does not say anything about what lies between the training examples. The network's architecture and training procedure are assuming the answer is smooth. That assumption is what makes generalization possible, and it is also what makes generalization fallible.
Why This Matters
Every time a learning algorithm says anything about an instance it has not seen, it is going beyond what the data alone can justify. It is making an assumption. The inductive bias is that assumption, named and made explicit.
This is Popper's point. You cannot get from observations to general theories without bringing something to the table. The question is never whether you have prior assumptions -- you always do. The question is whether they are good ones.
An Optimal Inductive Bias?
This does raise an interesting question. Is there an "optimal" inductive bias? One whose search space is universal -- every possible function -- but whose search strategy is still efficient and tractable for any problem? We have seen that genetic programming and decision trees are both universal in their hypothesis space -- they can in principle represent any function. But universality alone is not enough. Both still depend on their search strategy to find the right hypothesis, and that search strategy is where the bias lives. And we have seen that stronger biases enable more generalization but risk being wrong. Is there a sweet spot? That is a question for a future post.
All page references to Mitchell are from Machine Learning (McGraw-Hill, 1997).
If you need help with your Artificial Intelligence solutions, we're here to help.