Introduction to Bayesian Methodology

Kevin T. Kelly

Department of Philosophy

Carnegie Mellon University

Mathematical probability theory

Think of propositions as sets of possible states of the world. Thus, "the sky is blue" picks out all world states in which the color of the sky is blue. Some of these world states will have houses and cars and others will not.

An algebra is a collection of propositions such that for any two propositions p, q in the collection, the propositions

p&q
p or q
not p
not q

are in the collection.

T = the vacuous proposition.

A probability function P on an algebra is an assignment of numbers to propositions such that:

P(p) >= 0;
P(T) = 1;
p is inconsistent with q ===> P(p or q) = P(p) + P(q).

Definition of conditional probability:

P(H|E) = P(H & E)/P(E), P(E) > 0.

Bayes' theorem: A trivial consequence of the definition.

P(H|E) = P(H)P(E|H)/P(E).

Total probability theorem: A trivial consequence of the definition and axioms.

P(E) = SUMi P(E|Hi)P(Hi), where the Hi's are mutually exclusive and exhaustive.

Bayesian methodology:

A ratioanal agent whose degrees of belief are represented by probability function P should update her degrees of belief to P(.|E) after observing E.

By Bayes' theorem, the new degree of belief in H after seeing E is

P(H|E) = P(H)P(E|H)/P(E).

This formula is so important that the individual parts have special, time-honored names:

Likelihood of E given H = P(E|H). This is usually fairly definite and objective, since a theory usually makes a determinate prediction or specifies a probability of occurrence of a given experimental outcome. Classical statisticians allow only such probabilities to enter into methodology, so theories themselves cannot be said to have probabilities. You were warned about this in your elementary statistics class. Possibly you already forgot!

Prior probability of H = P(H). This may be quite subjective, reflecting a theory's initial "plausibility" prior to scientific investigation. This plausibility depends on such factors as intelligibility, simplicity, and whether the mechanism posited by the theory has been observed to operate elsewhere in nature (e.g., uniformitarian vs. catastrophist geology). In the 19th c. it was proposed that only causes observed to operate in nature could be invoked in new theories. This reflects prior probability.

Prior probability of E = P(E). This is subjective and very hard to specify. Using total probability, P(E) = SUMi P(E|Hi)P(Hi), with the sum taken over all possible theories. Nobody knows what all possible theories are. At most they are aware of the dominant paradigm and a few competitors.

Some methodological consequences:

Refutation is fatal: If consistent E is inconsistent with H, then P(H|E) = 0.

Proof: Note E & H = not T, since the two are inconsistent. Also, P(not T) + P(T) = 1 by axiom (3). P(T) = 1 by axiom (2). Hence, P(not T) = 0. Now we have:

P(H|E) = P(H & E)/P(E) = P(not T)/P(E) = 0/P(E) = 0.

Surprising predictions are good, initial plausibilities being similar: If H entails E, then P(H|E) = P(H)/P(E), which is greater insofar as P(E) is lower.

Strong explanations are good, initial plausibilities being similar: P(H1|E)/P(H2|E) = [P(H1)/P(H2)][P(E|H1)/P(E|H2)].

Unification is good, initial plausibilities being similar: A unified theory explains some regularity that the disunified theory does not. For example, Copernicus' theory entails that the total number of years must equal the total number of synodic periods + the total number of periods of revolution.

High initial plausibility is good, explanations being similar: P(H1|E)/P(H2|E) = [P(H1)/P(H2)][P(E|H1)/P(E|H2)].

Saying more lowers probability: H entails H' ===> P(H) =< P(H').

Conflict turns explanatory strength into an asset: Hey, didn't we just say that strong explanations are good??? That is true if the initial plausibilities are similar. But if one theory entails the other, they won't be. Thus, unification-style arguments only work if the competing theories are mutually contradictory!

Some defeasible objections

Scientific method should be objective. The method is objective. Everybody is supposed to update by calcluating personal probabilities. Some of the inputs to this method (prior probabilities) are not objective.

Scientific method should not consider subjective, prior plausibilities. That's just the kind of blind, pre-paradigm science Kuhn ridicules as being sterile. Without prior plausibilities to guide inquiry, no useful experiments would ever be performed.

Priors should be flat. What is flat? If we are uncertain about the size of a cube, should we be indifferent about

the possible volumes,
the areas of the sides, or
the lengths of the sides?

Whichever one we are unbiased about, we are strongly biased about the others!

Some more serious objections

High posterior probability doesn't mean that the theory is true. To some extent, one can show that the agent must believe that she will converge to the truth. But this doesn't mean that she will.

It isn't clear that numbers like P(E) even exist. One can respond with a protocol for eliciting such numbers, but in practice it doesn't always work. One can say that the subjects are "irrational", but the audience can always blame Bayesianism instead of the subjects.

The old evidence problem. If E is already known, then P(H|E) = P(H) P(E|H)/P(E) = P(H). So old evidence never "confirms" a hypothesis.

Responses:

Counterfactual confirmation: Some counterfactual version of yourself who had not already learned that E would have found that P(H|E) > P(H) even though you do not because for you P(E) = 1.

Objection: Many different conterfactual persons could have turned into you after seeing E. Which are you?

Something new is learned: that the theory entails the old data. This makes old evidence an instance of the "problem of found constraints" below.

Objection: Evidential support shouldn't depend on prior mathematical ignorance.

Some very pertinent objections concerning scientific revolutions

The problem of lost constraints arises when new possibilities arise that were previously thought impossible.

The problem of new theories: The idea that we start out with a prior plausibility for each possible theory we will ever consider in the future flies in the face of the historical facts. We are unaware of theories until they are proposed. If they are plausible, we suddenly accord them high plausibilities at the expense of theories we are already aware of. This shift in probability mass is not accounted for by the conditioning rule. This is just what happens in a paradigm crisis.

The problem of pseudo-refutations: Sometimes refuted theories are unrefuted. For example, Newton thought that any wave theory of optics would entail that light is visible around corners. Fresnel showed that although light does go around corners, it is not visible around corners, undoing the anomaly by showing the fallacy of Newton's argument. The redirection of probability toward the unjustly blamed theory is not governed by the conditioning rule.

The problem of found constraints arises when possibilities formerly entertaind are no longer though possible: Sometimes it is learned that a theory makes a prediction that nobody noticed before. Thus, Einstein found that general relativity implies the orbital precession of Mercury, whereas Maxwell discovered that Ampere's law is inconsistent with the other principles of electromagnetism. The movement of probability toward or away from the theory affected by the newly discovered implication is not governed by the conditioning rule.

The problem of found constraints is a bit easier than the problem of lost constraints, since adding constraints determines a new probability model, whereas relaxing constraints could lead to many very different models.

A Quasi-Bayesian model of scientific revolutions

Quasi-Bayesianism:

We want to look as much like rational Bayesians as possible. But...

Limited inventiveness: We are too stupid to visualize all possible future theories all at once. Hence, we always over-rate the probability of our current "pet" theory at the expense of possible future theories. When a plausible new theory is discovered, we shift probability to it.
Limited theorem proving ability: We are too stupid to see all the consequences of the theories we know about. Our degree of belief that one will be found goes down as we fail to find one.
Limited proof-checking ability: Even when presented with a proof, we don't trust it right away. Our degree of belief that it is a proof goes up as it withstands criticism. It goes down when prominent people doubt it.

Normal science:

By (1), nearly all probability mass is on the current "pet" theory (no theory we can think of is close).

Testing tests the tester rather than the theory, since no other possibilities carry any probability mass. Conditioning cannot lower a unit probability: if P(H) = 1 then P(H|E) = P(H & E)/P(E) = P(E)/P(E) = 1.

When derivations refuting the theory are found (3), we are suspicious of them and wait for experts in the paradigm to sign off first.

When we fail to give a good explanation, we hope that we will find one later (2).

Crisis:

Our experts fail to find explanations or to resolve apparent refutations.

This makes the amorphous "something else" hypothesis look better, stimulating spade-work on new theories.

Revolution:

A plausible, theory is formulated that either escapes the old anomalies or unifies or explains what remained unexplained. By (1), this theory attracts significant prior probability. Then the account of unification raises its posterior probability.

Still some holdouts will suspect the new explanations or find the postulated mechanisms too implausible to be overcome by the higher likelihood. They don't trust the new unifications or anomaly resolutions (something could still be wrong). Also the prior plausibility of the old paradigm makes it reasonable to continue to try to fix it.

After the new theory is used for a while, its initial implausibility wears off for new students through increased familiarity. Thus, the prior probabilities of new scientists do not penalize the new theory.

Eventually, staid proponents of the old paradigm die out.

Incommensurability

On the Bayesian approach, method is objective, but method has subjective inputs. Thus, there is no need to accept Kuhn's claim that a logical approach to method would force everyone to move to a new paradigm at once.

Paradigm choice is a decision. Decision theory involves utilities as well as probabilities. These are also subjective.

It doesn't matter that everyone share prior probabilities. It matters only that enough people are impressed by new unifications for the probabilities of most community members to shift.

It doesn't matter that problems are lost. Probability is a matter of "all things considered". It suffices that most community members count the new solutions sufficiently heavily.

A shared observation language is not necessary. All that is required for Baayes' theorem is that the theory look good in terms of the data that it partly generates.

It is true that we can't envision all new theories or problem solutions at once. But the Bayesian approach is still explanatory if we can explain behavior in terms of striving to satisfy it. Adding this striving, the Bayesian approach looks a lot like the normal-crisis-revolution cycle.

Reality:

If logical knowledge steadily increases and we are inventive enough, this process might converge to the truth.