Cox's Theorem: Is Probability Theory Universal?
- By Bruce Nielson
- ML & AI Specialist
Most of life involves reasoning using incomplete information. Or put another way, it's about reasoning under uncertainty. Will it rain tomorrow? Should you carry an umbrella? Classical (propositional) logic handles things that are exactly true or false — but it doesn’t tell you how to reason when you’re unsure.
In our last several posts, particularly this one, we talked about how probability theory is a generalization of boolean and propositional logic. That is to say, probability theory extends regular deductive logic and includes it; at least for boolean and propositional logic. (There is still an open question on if this is also true for First Order logic.) So probability theory gives us a way to reason with incomplete or uncertain information.
But is probability theory just one possible way to reason under uncertainty? Or is it the only way to do so?
Cox’s theorem (R. T. Cox, 1946) attempts to answer that question. Put simply: if you want a system for plausible reasoning that is sensible and consistent, you are inevitably led to probability theory. Edwin T. Jaynes championed and popularized this view; see his book (Probability Theory: The Logic of Science).
If you want to dive into the originals, Cox’s paper and notes are available online:
- Cox, Probability, Frequency and Reasonable Expectation (American Journal of Physics, 1946). (PDF)
- Cox, The Algebra of Probable Inference (manuscript/PDF)
- Short overview / background: Cox’s theorem (Wikipedia)
What we Demand of a Reasoner (According to Jayne)
Jaynes summarized Cox’s intuition in three crisp requirements. Jayne calls these 'desiderata' because they aren't axioms assumed to be true, but rather desires or goals we wish to constrain ourselves to. Despite the simplicity and even apparent 'obviousness' of these 'desiderata', these are the only things we ask of a system that assigns degrees of plausibility to statements:
(I) Representation — degrees of plausibility are real numbers.
You should be able to say, “this is more plausible than that” and encode that judgment with a single number:
- Everything gets a number. Any statement A has a plausibility
pl(A)(think: a slider). - Negation is linked. Knowing
pl(A)should tell youpl(not A)(they’re not independent).
The first requirement means you can always compare two (mutually exclusive) statements and say one is more plausible than the other, or they're equally plausible. On the surface this seems rather reasonable and almost inevitable. How else could plausibility work but via some way to ultimately decide between mutually exclusive options? And that implies there must be some way to weigh the options compared to their competitors. This isn't possible if you can't translate everything down to a single continuous value.
The second sub point seems intuitively correct because knowing how plausible a statement is, you should automatically know how plausible its negation (the opposite statement) is. Suppose you think “it will rain” is somewhat likely: pl(rain) = 0.8 (80% plausible). Then, under Cox-like rules, the plausibility of “not rain” is determined: pl(not rain) = 0.2. If you also learn “it’s cloudy” and that makes “rain” more plausible, the rules tell you precisely how to combine the old plausibilities and the new evidence — and that combination must behave like Bayesian updating (i.e., use conditional probabilities).
Despite the 'obviousness' of this desiderata, this is probably the most challenged requirement.
(II) Qualitative correspondence with common sense.
When information is crisp, the system should reduce to ordinary logic; small changes in evidence cause small changes in plausibility:
- Respect logic: logically equivalent statements get the same plausibility.
- Continuity: tiny changes in input → tiny changes in output.
- The AND rule (decomposability): the plausibility of “A and B” should depend only on how plausible B is, and how plausible A would be if B were true.
The continuity requirement seems sensible because you wouldn't expect a tiny piece of new information to cause a massive, sudden jump in your confidence about something.
(III) Consistency.
Different, valid routes to the same conclusion must give the same answer; the rules must scale and compose:
- Universality: the same rules must work in any domain.
- Non-contradiction: if a conclusion can be derived more than one way, all derivations agree.
- Scalability: rules that make sense for one case should still make sense for many repeated or combined cases.
Your system of plausible reasoning should be consistent like formal logic (for our purposes that means propositional logic). For example, if two statements are logically equivalent (they mean the same thing), they should always have the same plausibility. Likewise, if something is always true by definition (a "tautology"), its plausibility should be at the maximum possible value.
And the system should be universal – it shouldn't just apply to a specific type of problem or domain. You should be able to reason about anything, from unrelated events to complex scenarios, and the underlying rules should still hold. This is crucial for making the logic broadly applicable, just like the fundamental rules of mathematics or logic apply everywhere.
One Probability Logic to Rule Them All?
Cox showed — and later expositions (Jaynes, others) made rigorous — that those seemingly mild, intuitive demands uniquely determine the algebra of plausible reasoning. Up to a simple re-scaling, the plausibility numbers must obey the product rule and the sum/negation rule. After mapping to the usual 0–1 scale, those rules are exactly the familiar axioms of probability:
P(A and B | C) = P(A | B,C) × P(B | C)(product rule)P(A | C) + P(not A | C) = 1(negation/sum rule)
It is possible to come up with alternatives that might on the surface look a lot different. But there will always be a function that maps the results back to simple probability theory. The only alternative to this is a system that doesn't match those seemingly innocuous desiderata. (Which seems rather undesirable.)
In short: If you accept Jaynes desiderata, probability theory is the only consistent extension of Boolean logic to uncertain situations. Cox and Jayne argue that probability theory is not merely a tool for frequencies or gambling — it’s the uniquely rational way to handle degrees of belief in uncertain situations.
So Why Does This Matter?
Cox's theorem is used to argue for several important points:
- Foundational justification for Bayesian reasoning. Cox gives a principled reason to treat probabilities as degrees of belief (not only long-run frequencies).
- Objectivity and consistency. If you accept the desiderata, then any other scheme will give contradictory or “nonsensical” answers in some cases.
- Applies to single events. You don’t need repeatable trials — probability applies to unique hypotheses (e.g., “life exists on other planets”) because it codifies consistent belief, not frequency.
- Practical payoff. The rules Cox forces on us are exactly the ones used in Bayesian inference, decision theory, and much of modern machine learning.
Conclusion and A Quick Caution
There are ongoing technical discussions (and a few edge-case counterexamples) in the literature about which exact axioms are needed. There are also a few (not so well known) alternative theories of plausibility that try out different desiderata -- such as not requiring plausibilities to be put in terms of a single real value.
Still — this seems like a big breakthrough. It takes the idea we previously proved — that probability theory is an extension to propositional logic — and formalizes it and proves that under seemingly reasonable assumptions probability theory is the sole and only way to represent logical plausibilities.
It is not an accident that Bayesian probabilities have come to dominate Machine Learning in recent years.