From Certainty to Belief: How Probability Extends Logic - Part 2
- By Bruce Nielson
- ML & AI Specialist
In our last post, we built an intuition for why probability theory might be thought of as an extension of deductive logic. In an earlier post we went over the basics of probability theory. (And in an even earlier post, we went over the basics or propositional logic.)
Let's now see if we can formalize our intuition and show that it is possible to do deductive logic using only probability theory. To do this, I'm going to use two fun examples taken from David Barber’s textbook Bayesian Reasoning and Machine Learning (we follow his Example 1.5 and 1.6 on p. 11-12). We'll work them through his examples step by step, explicitly showing which rules or probability theory we use at every line. The goal is to make the connection between logic and probability completely transparent.
Quick reminder of the rules of probability theory
Rules all taken from this post.
-
Variables & states
A variable (e.g.A,F,T) can taketr(true) orfa(false). -
Joint symmetry
p(x, y) = p(y, x). The joint probability is symmetric. -
Conditional probability (definition)
p(x | y) = p(x, y) / p(y)— equivalentlyp(x, y) = p(x | y) × p(y). -
Marginalisation (summing out)
p(x) = sum_y p(x, y)— in two states:p(x) = p(x, y=tr) + p(x, y=fa). -
Bayes’ Rule (derived from the above)
p(x | y) = [p(y | x) × p(x)] / p(y). -
Conditional independence (when stated)
p(z | x, y) = p(z | y)means onceyis known,xgives no extra info aboutz.
Example 1 — Aristotle’s syllogism (Barber Example 1.5)
Informal statement:
- “All apples are fruit."
- "All fruit grow on trees."
- "Therefore, all apples grow on trees.”
Variables:
A= "is an apple" (A = trmeans it is an apple)F= "is a fruit"T= "grows on trees"
Given (probabilistic form of the premises):
p(F = tr | A = tr) = 1(All apples are fruit.)p(T = tr | F = tr) = 1(All fruits grow on trees.)- Conditional independence assumption:
p(T | A, F) = p(T | F).
Note that #3 means that if you know fruit and apples grow on trees that since an apple is a fruit this is the same as saying just fruit grows on trees. We're simply showing that you learn nothing more given A if you already have F.
Goal: show p(T = tr | A = tr) = 1.
That is to say, show that if you know something is an apple, you know it grew on a tree given your premises. (Exactly like deductive logic.)
Step-by-step derivation (every step labeled with the rule used):
-
Start with the target:
p(T = tr | A = tr).
(Goal) -
Use marginalisation over
F(rule 4):
p(T = tr | A = tr) = Σ_s∈{tr,fa} p(T = tr, F = s | A = tr).
(We sum over the unobserved variableFto account for both statess = trands = fa.) -
Use the definition of joint/conditional (rule 3):
p(T = tr, F = s | A = tr) = p(T = tr | F = s, A = tr) × p(F = s | A = tr).
Therefore:
p(T = tr | A = tr) = Σ_s p(T = tr | F = s, A = tr) × p(F = s | A = tr).
- Apply the conditional independence assumption
p(T | A, F) = p(T | F)(given):
Replacep(T = tr | F = s, A = tr)withp(T = tr | F = s).
Now:
p(T = tr | A = tr) = Σ_s p(T = tr | F = s) × p(F = s | A = tr).
-
Write out the explicit two-term sum (
s = faands = tr):
p(T = tr | A = tr) = p(T = tr | F = fa) × p(F = fa | A = tr) + p(T = tr | F = tr) × p(F = tr | A = tr). -
Use premise 1:
p(F = tr | A = tr) = 1. By normalisation,p(F = fa | A = tr) = 0.
This is because probability of all possible events must sum to 1 under probability theory. So if p(F = tr | A = tr) = 1 then its inverse p(F = fa | A = tr) = 0. Or put in plain English, if it is certain that something being an apple is always a fruit too, then it must be equally certain that given something is an apple it must be false that it is not a fruit. (Double negative there.)
Substitute:
p(T = tr | A = tr) = p(T = tr | F = fa) × 0 + p(T = tr | F = tr) × 1.
- Use premise 2:
p(T = tr | F = tr) = 1.
Since the first term is multiplied by0, p(T = tr | F = fa) drops out, leaving:
p(T = tr | A = tr) = 0 + 1 × 1 = 1.
Note that we do not need to know what the probability p(T = tr | F = fa) (that is, the probability if you know something isn't a fruit that it grows on a tree) to resolve this problem thanks to it dropping out.
Conclusion (Example 1): p(T = tr | A = tr) = 1.
In words: given the premises and conditional independence, probabilistic rules yield the conclusion: "all apples grow on trees" just like the classical logical syllogism.
Barber points out that this reasoning is a form of resolution, a kind of transitivity: from A ⇒ F and F ⇒ T, we can infer A ⇒ T.
Or put simply, it is possible to use the probability calculus to do regular logical resolution and work out a syllogism.
Example 2 — The contrapositive (Barber Example 1.6)
This example is important because Karl Popper noted that the contrapositive (aka Modus Tollens) is the basis for the logic of scientific reasoning. (That we 'falsify' our theories rather than verify them.) So showing that we can do the same thing with probability theory will show that probability theory is also the logic of science. (Or at least can be.)
Informal statement:
- "If A then B" implies "If not B then not A."
Variables:
- A and B are boolean variables (tr/fa).
Given (probabilistic premise):
- p(B = tr | A = tr) = 1. (If A is true then B is certainly true.)
Goal: show p(A = fa | B = fa) = 1. (That is, show that given B to be false that you know for sure that A is also false.)
Step-by-step derivation:
-
Start from normalization for
Aconditioned onB = fa:
p(A = fa | B = fa) = 1 - p(A = tr | B = fa).
Because p(A=fa | B) + p(A=tr | B) = 1. (i.e. the total of all possibilities must sum to 1) -
We will show
p(A = tr | B = fa) = 0. If that holds, the left side equals 1. -
From the premise
p(B = tr | A = tr) = 1, take the complement to get:
p(B = fa | A = tr) = 1 - p(B = tr | A = tr) = 1 - 1 = 0.
(If A is true, B cannot be false.) -
Now use Bayes’ Rule to express
p(A = tr | B = fa)in terms ofp(B = fa | A = tr):
p(A = tr | B = fa) = [ p(B = fa | A = tr) × p(A = tr) ] / p(B = fa).- The numerator contains
p(B = fa | A = tr), which we just showed equals0. - Therefore the numerator is
0 × p(A = tr) = 0.
- The numerator contains
So the fraction equals 0 / p(B = fa) = 0 (provided p(B = fa) > 0). The denominator can be ignored. Note that if p(B = fa) = 0, then p(B = tr) = 1 and the contrapositive is vacuously satisfied anyhow because given p(B = tr) it must be the case that p(B = tr | A = tr) regardless.
- Thus
p(A = tr | B = fa) = 0. Plugging this into step 1:
p(A = fa | B = fa) = 1 - 0 = 1.
Conclusion (Example 2): p(A = fa | B = fa) = 1.
Or in other words, if B is false than A must also be false. This matches the logical contrapositive (modus tollens): from "If A then B" we infer "If not B then not A."
Why these worked and what they teach us
-
Every step used only the standard rules of probability theory: conditional probability, marginalisation (summing), joint expansion, Bayes’ rule and (where stated) conditional independence. I called out which rule was used at each line so you can trace the logic directly back to the primer.
-
When premises are certainties (probability = 1), probabilistic reasoning reproduces classical logic. In the examples above the premises used conditional probabilities equal to 1; that forces certain conditional and marginal terms to be 0 or 1 and collapses the algebra to the logical conclusions.
-
Probability can reproduce Aristotle's Syllogisms . We specifically choose to reproduce Aristotle's Syllogisms so that we could prove that probability theory was a generalization of syllogistic logic.
-
Probability is strictly more general than propositional or Boolean logic. If any premise had been less than 1 (for example
p(B = tr | A = tr) = 0.9), the same algebra would produce posterior probabilities strictly between 0 and 1 instead of 0 or 1. That is the strength of the probabilistic approach: it handles degrees of belief as naturally as it handles certainties.
Conclusion
This demonstration goes beyond our previous intuition that probability theory extends classical deductive logic -- at least for syllogistic logic so far. But does this represent a formal proof? Can probability theory represent all of propositional logic? And what about First Order logic?
We will consider these questions in future blog posts.