Induction is a Myth: The Futility of Unbiased Learning

Induction is a Myth: The Futility of Unbiased Learning

Karl Popper's Disproof of Induction

Karl Popper argued that it is logically impossible to derive a general theory from specific observations. You can stare at a million data points and no universal law will ever logically follow from them. The observations are always concrete and specific; the theory is always abstract and universal. You simply cannot get from one to the other by logic alone. (See Conjectures and Refutations: The Growth of Scientific Knowledge, pp. 251-253)

Thus induction is a myth. No "inductive logic" exists. And although there exists a "logical" interpretation of the probability calculus, there is no good reason to assume that this "generalized logic" (as it may be called) is a system of "inductive logic" (Unended Quest, p. 171)

Despite such a startlingly powerful disproof, most people assumed it must be wrong. They sought after a new kind of "inductive logic" that would allow you to somehow start with only observations and generalize to a universal law. The reasoning went that there must be such an inductive logic because we do—in practice—induce general laws all the time in science. Probability theory—particularly the Bayesian interpretation—was often advanced as the missing inductive logic that would bridge the gap. When it was discovered that probability theory could be thought of as "extending deductive logic" (by which they really meant extending propositional logic, not first order logic), many became convinced they had found what they were looking for. The arrival of Cox's theorem was often interpreted as cementing this idea.

And on top of all that, wasn't it just a fact that machine learning algorithms generalize from data every day? Your streaming service watches you binge a few movies and then somehow knows what to recommend next. That's induction, right? Isn't the machine deriving a general rule from specific observations? Popper must be wrong!

Tom Mitchell's "Futility of Bias-Free Learning"

Tom Mitchell—one of the foundational figures in machine learning—has a devastating answer to this question. In his textbook Machine Learning (McGraw-Hill, 1997), he proves something that should be far better known: a learner that makes no prior assumptions about what it's looking for cannot generalize at all. Not even a little. It can memorize what it has seen, but the moment you show it something new, it has literally no rational basis for making a classification. Mitchell calls this "The Futility of Bias-Free Learning" (Mitchell, p. 42).

What makes this so interesting is that Mitchell arrives at essentially the same conclusion as Popper, but from the completely opposite direction. Popper was a philosopher arguing against inductivism. Mitchell is a computer scientist trying to make induction work. And yet they converge on the same point: you cannot generalize from observations alone. You always need something else—some set of prior assumptions—to bridge the gap.

But Mitchell goes one step further. He shows that when you do add those prior assumptions, the resulting "inductive" algorithm is actually equivalent to a deductive theorem prover. The so-called induction was deduction all along, just as Popper claimed. It just didn't look like it.

Let me walk through how this works.

A Movie Recommendation Problem

Imagine a movie streaming service trying to figure out whether you'll enjoy a given film. To keep things tractable, suppose the system describes movies using three attributes:

  • Genre: Action, Comedy, Drama, Sci-Fi
  • Mood: Light, Dark
  • Pacing: Fast, Slow

Each movie is some combination of these attributes, and your reaction is binary: you either enjoyed it or you didn't. The service's job is to figure out the general rule—the target concept—that explains your taste.

This isn't meant to be a realistic example, but suppose you subconsciously only enjoy fast-paced movies with a dark mood, regardless of genre. The service is trying to figure that out and then recommend other such movies to you.

Now, there are 4 x 2 x 2 = 16 possible movie descriptions in this space. And the streaming service has watched you react to five or six of them. From those data points, it needs to learn a rule that correctly predicts your reaction to all the other movies you haven't seen yet.

This is, at its core, what Mitchell calls a "concept learning" task: searching through a space of possible hypotheses for the one that fits the observed data (Mitchell, p. 23).

How Candidate-Elimination Works (a.k.a. Falsification)

The algorithm Mitchell uses to illustrate this is called the Candidate-Elimination algorithm. And despite the dry name, what it actually does is strikingly Popperian: it starts with every possible hypothesis about your taste and then systematically eliminates the ones that are contradicted by the data.

The algorithm restricts itself to hypotheses that take the form of conjunctions of attribute values. So a hypothesis might be something like "I enjoy Action movies that are Dark"—represented as (Action, Dark, ?), where the question mark means "I don't care about this attribute."

The algorithm also uses a null symbol to mean "no value is acceptable." So the hypothesis (null, null, null) means "no movie is enjoyable"—the most specific possible hypothesis that rejects everything.

The algorithm maintains two boundaries that define a range of surviving hypotheses:

  • The G boundary (most general hypothesis consistent with the data)
  • The S boundary (most specific hypothesis consistent with the data)

Everything between these two bounds is called the "version space"—in Popper's language, it is the set of hypotheses not yet refuted by the evidence (Mitchell, pp. 32-33).

Before we see any data, G starts as broad as possible and S starts as narrow as possible:

Step 0: Initial state before any training data

Now the training data starts coming in.

Step 1: You watch a dark, fast-paced action movie and enjoy it. This is a positive example, so the S boundary has to move—it generalizes from "no movie is enjoyable" to the most specific hypothesis that covers this movie. The G boundary doesn't need to change yet because nothing has been ruled out:

Step 1: First positive example generalizes S

Step 2: You watch a light, slow comedy and don't enjoy it. This is where the real elimination begins. The current G boundary—(?, ?, ?) or "every movie is enjoyable"—is inconsistent with this negative example. It predicted you'd enjoy this movie, but you didn't. So (?, ?, ?) gets falsified and must be replaced.

The algorithm replaces it with the minimal specializations that exclude this negative example while still being more general than S. There are exactly three ways to do this—each one changes a single attribute to rule out the movie you disliked:

Step 2: G splits into three competing hypotheses

This is a critical moment. We now have three competing theories about your taste, and the data will eventually falsify two of them.

Step 3: You watch a dark, fast-paced sci-fi movie and enjoy it. Now things get interesting on both sides. On the S side, you've liked both Action and Sci-Fi, so genre can't be the deciding factor—S generalizes to (?, Dark, Fast). On the G side, the hypothesis (Action, ?, ?) predicted you'd only like Action movies, but you just liked a Sci-Fi movie. Falsified! It gets removed from G:

Step 3: S generalizes genre away, G loses the Action-only hypothesis

Two hypotheses remain in G: maybe it's about mood (?, Dark, ?), or maybe it's about pacing (?, ?, Fast). We need more data to tell them apart.

Step 4: You watch a dark, slow-paced action movie and don't enjoy it. This movie was Dark—just like the ones you enjoyed—but you disliked it. The hypothesis (?, Dark, ?) predicted you'd enjoy it, because it says dark mood is all that matters. But you didn't. Falsified! Only (?, ?, Fast) survives in G:

Step 4: The Dark-mood-only hypothesis is falsified

Now G says (?, ?, Fast)—pacing is all that matters—while S says (?, Dark, Fast)—both mood and pacing matter. They haven't converged yet. We need one more piece of evidence.

Step 5: You watch a light, fast-paced comedy and don't enjoy it. This is the final falsification. This movie was fast-paced, matching G's only remaining requirement, but you still disliked it. The difference from the movies you enjoyed? It was Light, not Dark. G must specialize mood from "?" to "Dark"—and now S and G are identical:

Step 5: Final falsification forces convergence

The version space has been squeezed from both sides until exactly one hypothesis remains:

Final: The algorithm has converged

Notice what happened here. The algorithm did not build up a theory from observations. It tore down the wrong theories. It started with a space of conjectures and refuted them using evidence. This is Popper's falsificationism implemented as a computer program. Mitchell himself frames concept learning as a search problem (Mitchell, p. 23), and as I've argued on my podcast, search is really just a form of variation and selection—which is exactly how Donald Campbell and Karl Popper described the growth of knowledge.

If the training data is error-free and the correct hypothesis is somewhere in the hypothesis space, the algorithm will converge on it. Every incorrect hypothesis gets falsified. The correct one survives.

The Catch: A Biased Hypothesis Space

But here's the problem. We restricted the hypothesis space to conjunctions of attributes—meaning the learned rule can only take the form "attribute 1 must be X and attribute 2 must be Y and attribute 3 must be Z" (where any of those can be relaxed to "any value is fine").

That works for a rule like (?, Dark, Fast)—"dark and fast-paced, any genre." But what if your actual taste is more complicated? What if you enjoy dark action movies and also light comedies—two completely different profiles with nothing in common? There is no single conjunction of attributes that captures both. Any conjunction broad enough to include dark action movies and light comedies would also include things you don't like, such as dark comedies or light action movies.

Our conjunctive hypothesis space—including hypotheses with ? (any value) and null (no value)—contains only 46 hypotheses, a tiny fraction of the 65,536 possible concepts that could be defined over our 16 movie descriptions. As Mitchell puts it, "a very biased hypothesis space indeed!" (Mitchell, p. 41).

If the correct hypothesis is not representable in the space, the algorithm will fail. It will eliminate every hypothesis, leaving the version space empty, and you'll know something has gone wrong.

So there is an obvious temptation: why not just expand the hypothesis space to include every possible hypothesis? Remove the bias entirely. Let the learner consider any conceivable pattern in the data.

The Power Set: An Unbiased Learner

Mitchell takes this idea seriously. He proposes expanding the hypothesis space to the power set of all possible instances—the set of all possible subsets (Mitchell, p. 40). This means the learner can now represent any target concept whatsoever. No more restrictions. No more bias. The learner can consider disjunctions, negations, arbitrary combinations—everything. If you like dark action movies and light comedies but nothing else, there's a hypothesis for that. If you like exactly seven specific movies for no discernible reason, there's a hypothesis for that too.

For our movie example with 16 possible movie descriptions, the power set contains 2^16—65,536 possible target concepts. Our biased conjunctive space could only represent a handful of those. The unbiased learner must now contend with all 65,536.

Problem solved, right?

Not even close. Mitchell shows that this "unbiased" learner is now completely unable to generalize beyond the observed examples (Mitchell, p. 41).

To see why, think about what the algorithm has to work with after seeing our five training examples. It knows you liked two specific movies and disliked three specific movies. In the biased version, the conjunctive restriction forced the algorithm's hand—there were only so many ways to draw the line, and most of them got falsified. But now? The hypothesis space contains every possible way to divide the 16 movies into "liked" and "disliked." And there are a staggering number of ways to do that which are perfectly consistent with our five data points.

Consider a new movie you haven't rated—say (Drama, Dark, Fast). Should the algorithm predict you'll enjoy it? In the biased version, the answer was clear: (?, Dark, Fast) covers it, so yes. But in the unbiased version, for every hypothesis in the version space that says you'll like this movie, there exists another hypothesis that is identical in every respect—agrees on all five training examples—except that it says you won't like this one (Mitchell, p. 41). Both hypotheses are equally consistent with everything the algorithm has seen.

This isn't a minor inconvenience. It's total paralysis. The version space splits exactly 50/50 on every unseen movie (Mitchell, p. 41). The algorithm can only say "I don't know" to every new movie it encounters. It has become a glorified lookup table—perfectly memorizing what it has seen, but completely powerless to predict anything it hasn't.

The reason is almost embarrassingly simple once you see it. In the biased version, the conjunctive restriction was doing most of the work. It told the algorithm: "The answer has a structure—it's a rule defined by attribute values." That assumption is what made it possible to look at a dark, fast-paced sci-fi movie and say "this is similar to the dark, fast-paced action movie you liked, so you'll probably like this too." Without that structural assumption, there is no basis for calling any two movies "similar." Each movie is just an isolated point, and knowing you liked one tells you nothing about any other.

To converge on a single final hypothesis, this unbiased learner would need to see every single one of the 16 possible movies as a training example (Mitchell, p. 41). At which point it hasn't learned anything—it's just stored your complete viewing history.

Mitchell's Conclusion: The Futility of Bias-Free Learning

Mitchell states his conclusion directly:

"A learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances." (Mitchell, p. 42)

Read that again. It is a remarkable statement. It means that the only reason the original Candidate-Elimination algorithm could generalize at all was because it had a built-in assumption—the bias toward conjunctive hypotheses—that constrained the space of possibilities.

Mitchell calls this assumption the algorithm's "inductive bias." We'll discuss the formal definition of inductive bias in more detail in a future post. But a short version is that the inductive bias of a learner is the minimal set of additional assertions B such that the learner's classifications follow deductively from B combined with the training data (Mitchell, pp. 42-43).

That last part is the kicker. Mitchell is saying that what looks like induction is actually deduction in disguise—just as Popper claimed. The algorithm appears to be generalizing from data, but what it is really doing is deducing conclusions from the data plus an unstated set of assumptions.

The Deductive Theorem Prover

Mitchell makes this equivalence explicit with a striking thought experiment (Mitchell, pp. 43-44). Imagine two systems side by side.

On the left, the Candidate-Elimination algorithm—exactly the one we just walked through. You feed it your movie ratings and a new movie to classify. It searches the version space and outputs a prediction.

On the right, a deductive theorem prover. You feed it the same movie ratings, the same new movie, and one additional input: the explicit assertion "the target concept can be represented as a conjunction of the attributes Genre, Mood, and Pacing."

Mitchell proves that these two systems will produce identical outputs for every possible set of training examples and every possible new instance. They are functionally the same system. The only difference is that the inductive bias is implicit in the code of the learning algorithm, while it is explicit as an input to the theorem prover.

Think about what this means. When our algorithm concluded (?, Dark, Fast), it felt like it was inducing a general rule from specific examples. But Mitchell has shown that it was actually deducing a conclusion from the training data plus an unstated assumption about the structure of the answer. The assumption—"your taste can be expressed as a conjunction of attributes"—was baked into the algorithm's design. Make that assumption explicit, hand it to a theorem prover along with the data, and you get the same answer by pure deduction.

As Mitchell puts it, the inductive bias "exists only in the eye of us beholders. Nevertheless, it is a perfectly well-defined set of assertions" (Mitchell, p. 44).

The so-called induction was deduction all along.

What This Means

Let me be blunt about the implications.

Mitchell has shown that every "inductive" learning algorithm is, underneath, a deductive system operating on unstated assumptions. The assumptions are doing the real work. The data merely selects among the possibilities that the assumptions have already circumscribed. Without those assumptions, we saw what happens: the unbiased learner is paralyzed, unable to classify a single new instance.

This is precisely what Popper argued from the philosophy side. You cannot derive general theories from observations alone. You always need a prior theoretical framework—what Kant called imposing laws upon nature—to make sense of the data. The data doesn't speak for itself. It never has.

But here is the part that both the machine learning community and many Popperians seem to miss. Popper proved that you can't generalize from observations alone. He did not prove that you can't generalize at all. Mitchell's work shows exactly when and how generalization becomes possible: when you bring background knowledge—an inductive bias—to the table. That bias, combined with observations, lets you deduce conclusions you couldn't have reached with either one alone.

The learner doesn't start from a blank slate and induce its way to knowledge. It starts with a constrained space of possibilities and uses observations to falsify the wrong ones. That is not "induction" in the classical Baconian sense that Popper demolished. It is conjecture and refutation, running on silicon.

Bias-free learning is futile. But biased learning—learning with prior theoretical commitments—is not only possible, it is the only kind of learning there is.

This is a companion post to a podcast episode where I discuss these ideas in more depth, including how they relate to Karl Popper's epistemology and David Deutsch's interpretation of it. All page references to Mitchell are from Machine Learning (McGraw-Hill, 1997).

If you need help with your Artificial Intelligence solutions, we're here to help.

SHARE


comments powered by Disqus

Follow Us

Latest Posts

subscribe to our newsletter