Two of Barcelona’s architectural masterpieces are as different as different could be. The Sagrada Família, designed by Antoni Gaudí, is only a few miles from the German Pavilion, built by Mies van der Rohe. Gaudí’s church is flamboyant and complex. Mies’s pavilion is tranquil and simple. Mies, the apostle of minimalist architecture, used the slogan ‘less is more’ to express what he was after. Gaudí never said ‘more is more’, but his buildings suggest that this is what he had in mind.
One reaction to the contrast between Mies and Gaudí is to choose sides based on a conviction concerning what all art should be like. If all art should be simple or if all art should be complex, the choice is clear. However, both of these norms seem absurd. Isn’t it obvious that some estimable art is simple and some is complex? True, there might be extremes that are beyond the pale; we are alienated by art that is far too complex and bored by art that is far too simple. However, between these two extremes there is a vast space of possibilities. Different artists have had different goals. Artists are not in the business of trying to discover the uniquely correct degree of complexity that all artworks should have. There is no such timeless ideal.
Science is different, at least according to many scientists. Albert Einstein spoke for many when he said that ‘it can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience’. The search for simple theories, then, is a requirement of the scientific enterprise. When theories get too complex, scientists reach for Ockham’s Razor, the principle of parsimony, to do the trimming. This principle says that a theory that postulates fewer entities, processes or causes is better than a theory that postulates more, so long as the simpler theory is compatible with what we observe. But what does ‘better’ mean? It is obvious that simple theories can be beautiful and easy to understand, remember and test. The hard problem is to explain why the fact that one theory is simpler than another tells you anything about the way the world is.
One of the most famous scientific endorsements of Ockham’s Razor can be found in Isaac Newton’s Mathematical Principles of Natural Philosophy (1687), where he states four ‘Rules of Reasoning’. Here are the first two:
Rule I. No more causes of natural things should be admitted than are both true and sufficient to explain their phenomena. As the philosophers say: nature does nothing in vain, and more causes are in vain when fewer suffice. For nature is simple and does not indulge in the luxury of superfluous causes.
Rule II. Therefore, the causes assigned to natural effects of the same kind must be, so far as possible, the same. Examples are the cause of respiration in man and beast, or of the falling of stones in Europe and America, or of the light of a kitchen fire and the Sun, or of the reflection of light on our Earth and the planets.
Newton doesn’t do much to justify these rules, but in an unpublished commentary on the book of Revelations, he says more. Here is one of his ‘Rules for methodising/construing the Apocalypse’:
To choose those constructions which without straining reduce things to the greatest simplicity. The reason of this is… [that] truth is ever to be found in simplicity, and not in the multiplicity and confusion of things. It is the perfection of God’s works that they are all done with the greatest simplicity. He is the God of order and not of confusion. And therefore as they that would understand the frame of the world must endeavour to reduce their knowledge to all possible simplicity, so it must be in seeking to understand these visions…
Newton thinks that preferring simpler theories makes sense, whether the task is to interpret the Bible or to discover the laws of physics. Ockham’s Razor is right on both counts because the Universe was created by God.
In the 20th century, philosophers, statisticians and scientists have made progress on understanding why the simplicity of a theory is relevant to assessing what the world is like. Their justifications of Ockham’s Razor do not depend on theology, nor do they invoke the grandiose thesis that nature is simple. There are at least three ‘parsimony paradigms’ within which the razor can be justified.
The first is exemplified by the advice given to medical students that they should ‘avoid chasing zebras’. If a patient’s symptoms can be explained by the hypothesis that she has common disease C, and also can be explained by the hypothesis that she has rare disease R, you should prefer the C diagnosis over the R. C is said to be more parsimonious. In this case, the more parsimonious hypothesis has the higher probability of being true.
There is another situation in which simpler theories have higher probabilities. It involves the version of Ockham’s Razor that I call ‘the razor of silence’. If you have evidence that C1 is a cause of E, and no evidence that C2 is a cause of E, then C1 is a better explanation of E than C1&C2 is. The 19th-century philosopher John Stuart Mill was thinking of such cases when he said that the principle of parsimony is
a case of the broad practical principle, not to believe anything of which there is no evidence … The assumption of a superfluous cause is a belief without evidence; as if we were to suppose that a man who was killed by falling over a precipice must have taken poison as well.
Mill is talking about the razor of silence. The better explanation of E is silent about C2; it does not deny that C2 was a cause. The problem changes if you consider two conjunctive hypotheses. Which is the better explanation of E: C1¬C2 or C1&C2? The razor of silence provides no guidance, but another razor, the razor of denial, does. It tells you to prefer the former. Unfortunately, it is unclear what justification there could be for this claim if you have no evidence, one way or the other, as to whether C2 is true. The razor of silence is easy to justify; justifying the razor of denial is more difficult.
Postulating a single common cause is more parsimonious than postulating a large number of independent, separate causes
In the example of the rare and common diseases, the two hypotheses confer the same probability on the observations. The second parsimony paradigm focuses on situations in which a simpler hypothesis and a more complex hypothesis confer different probabilities on the observations. In many such cases, the evidence favours the simpler theory over its more complex competitor. For example, suppose that all the lights in your neighbourhood go out at the same time. You then consider two hypotheses:
(H1) something happened to the power plant at 8pm on Tuesday that influenced all the lights; or
(H2) something happened to each of the light bulbs at 8pm on Tuesday that influenced whether the light would go on.
Postulating a single common cause is more parsimonious than postulating a large number of independent, separate causes. The simultaneous darkening of all those lights is more probable if H1 is true than it would be if H2 were true. Building on ideas developed by the philosopher Hans Reichenbach, you can prove mathematically (from assumptions that flesh out what H1 and H2 are saying) that the observations favour H1 over H2. The mathematically curious could have a look at my book Ockham’s Razors: A User’s Manual (2015).
An important biological example in which common causes are preferred to separate causes can be found in Charles Darwin’s hypothesis that all present-day life traces back to one or a few original progenitors. Modern biologists are on the same page when they point to the near universality of the genetic code as strongly favouring the hypothesis of universal common ancestry over the hypothesis of multiple ancestors. The shared code would be a surprising coincidence if different groups of organisms stemmed from different start-ups. It would be much more probable if all current life traced back to a single origination.
According to the third parsimony paradigm, parsimony is relevant to estimating how accurately a model will predict new observations. A central result in the part of statistics called ‘model selection theory’ is due to Hirotugu Akaike, who proved a surprising theorem that demonstrated this relevance. This theorem is the basis of a model evaluation criterion that came to be called AIC (the Akaike Information Criterion). AIC says that a model’s ability to predict new data can be estimated by seeing how well it fits old data and by seeing how simple it is.
Here’s an example. You are driving down a country road late in the summer and notice that there are two huge fields of corn, one on each side of the road. You stop your car and sample 100 corn plants from each field. You find that the average height in the first sample is 52 inches and the average height in the second sample is 56 inches. Since it is late in the growing season, you assume that the average heights in the two huge fields will not change over the next few days. You plan to return to the two fields tomorrow and sample 100 corn plants from each. Which of the following two predictions do you think will be more accurate?
Prediction A: the 100 plants you sample tomorrow from the first population will average 52 inches and the 100 plants you sample tomorrow from the second will average 56 inches.
Prediction B: each of the two samples will average 54 inches.
Model selection theory says that this problem can be solved by considering the following two models of the average heights in the two populations:
DIFF: the average height in the first population = h1, and the average height in the second population = h2.
NULL: the average height in the first population = the average height in the second population = h.
Neither model says what the values are of h1, h2, and h; these are called ‘adjustable parameters.’ The NULL model has that name because it says that the two populations do not differ in their average heights. The name I give to the DIFF model is a little misleading, since the model doesn’t say that the two populations differ in their average heights. DIFF allows for that possibility, but it also allows that the two populations might have the same average height.
What do DIFF and NULL predict about the data you will draw from the two fields tomorrow? The models on their own don’t provide numbers. However, you can fit each model to your old data by estimating the values of the adjustable parameters (h1, h2, and h) in the two models. The result is the following two fitted models:
f(DIFF): h1= 52 inches, and h2 = 56 inches.
f(NULL): h = 54 inches.
The question of which model will more accurately predict new data is interpreted to mean: which model, when fitted to the old data you have, will more accurately predict the new data that you do not yet have?
DIFF, you might be thinking, has got to be true. And NULL, you might also be thinking, must be false. What are the odds that two huge populations of corn plants should have exactly the same average heights? If your goal were to say which of the two models is true and which is false, you’d be done. But that is not the problem at hand. Rather, you want to evaluate the two models for their predictive accuracies. One of the surprising facts about models such as NULL and DIFF is that a model known to be false will sometimes make more accurate predictions than a model known to be true. NULL, though false, might be close to the truth. If it is, you might be better off using NULL to predict new data, rather than using DIFF to make your prediction. After all, the old data might be unrepresentative! NULL keeps you to the straight and narrow; DIFF invites you to stray.
There is no disputing matters of taste when it comes to the value of simplicity and complexity in works of art. But simplicity, in science, is not a matter of taste
The Akaike Information Criterion evaluates NULL and DIFF by taking account of two facts: f(DIFF) fits the old data better than f(NULL) does, and DIFF is more complex than NULL. Here the complexity of a model is the number of adjustable parameters the model contains. As I mentioned, AIC is based on Akaike’s theorem, which can be described informally as follows:
An unbiased estimate of the predictive accuracy of model M = [how well f(M) fits the old data] minus [the number of adjustable parameters M contains].
A mathematical result, therefore, can establish that parsimony is relevant to estimating predictive accuracy.
Akaike’s theorem is a theorem, which means that it is derived from assumptions. There are three. The first is that old and new data sets are generated from the same underlying reality; this assumption is satisfied in our example if each population’s average height remains unchanged as the old and new data sets are drawn. The second assumption is that repeated estimates of each of the parameters in a model will form a bell-shaped distribution. The third assumption is that one of the competing models is true, or is close to the truth. That assumption is satisfied in the corn example, since either NULL or DIFF must be true.
Gaudí and Mies remind us that there is no disputing matters of taste when it comes to assessing the value of simplicity and complexity in works of art. Einstein and Newton say that science is different – simplicity, in science, is not a matter of taste. Reichenbach and Akaike provided some reasons for why this is so. The upshot is that there are three parsimony paradigms that explain how the simplicity of a theory can be relevant to saying what the world is like:
Paradigm 1: sometimes simpler theories have higher probabilities.
Paradigm 2: sometimes simpler theories are better supported by the observations.
Paradigm 3: sometimes the simplicity of a model is relevant to estimating its predictive accuracy.
These three paradigms have something important in common. Whether a given problem fits into any of them depends on empirical assumptions about the problem. Those assumptions might be true of some problems, but false of others. Although parsimony is demonstrably relevant to forming judgments about what the world is like, there is in the end no unconditional and presuppositionless justification for Ockham’s Razor.