# Death to the ritual of significance tests!

When most people think of what statistics are and what they offer, they think of significance tests. In fact, if you claim something to be statistically significant, you can be rest assured that your audience will pay close attention. Ironically, the same people who respond to statistics with the usual skeptical and often misquoted sayings are often firm believers that statistics can deliver security which otherwise doesn’t exist.

This applies for laymen as well as experts. Haller and Krauss, for example, studied the knowledge of both teachers and students of statistics courses that focused on significance. The results were disastrous. The concept was completely misunderstood – on both sides of the lectern! The textbooks, most of which contained insufficient presentations, didn’t help the situation either. Even critics of significance tests are rarely able to make an enlightening explanation of the problem at hand. Plus, they word the essence in such a toned-down manner to prevent that people learn what they should really do with significant results.

Absolutely NOTHING!

Let me try to explain the general principle of a significance test with a simple example. Let’s say that you run a supermarket chain. In each of your three stores, you tried a different approach to make buying margarine more attractive for shoppers. The following chart shows the daily margarine sales as well as the mean per store over a five-day workweek:

Day 1 | Day 2 | Day 3 | Day 4 | Day 5 | Mean | |
---|---|---|---|---|---|---|

Store 1 Normal placement |
27 | 19 | 20 | 24 | 22 | 22 |

Store 2 Cardboard display |
46 | 44 | 42 | 39 | 40 | 42 |

Store 3 Free tasting |
78 | 58 | 34 | 32 | 28 | 46 |

As you can see the means are different. Does this have something to do with our marketing or not? The hypothesis is yes – the effectiveness of the marketing activities cause the difference in results.

The only problem is that significance tests don’t look at the hypothesis. The three stores also may vary in many other aspects. The employees may be older or younger. The floor space may be larger or smaller. The average customer may spend more or less. The weather may be different on these days, and so on. The three stores are different. Period. And why shouldn’t they be?

This is the same dull, unbelievable and futile conception that you always assume when you test for significance. Significance only means that the differences are larger than they should be at random. Since the data only shows sales for five workdays and not over a longer period of time, you have to assume that there will be differences in the means.

It gets better. When there is no significance, this only means that we cannot dismiss the null hypothesis. The differences can be random – but they don’t have to be. When the test sample is larger, it is much easier to determine significance. In fact, everything is significant when you have a large test sample.

It is not surprising that neither the fathers of significance tests (Fisher, Neyman-Pearson) nor great statisticians have spent much time researching this topic. The reason why significance tests have become so important is because misinterpreting them has its appeal – not to mention its benefits. It is to declare the hypothesis for proven when the null hypothesis test is successful.

Confrontation, however, is useless. As improbable as it may seem, significant results are the absolute condition for getting published in many scientific journals. Geoffrey Loftus, the editor of “Memory and Cognition“ magazine between 1994 and 1997, attempted to change this practice. He encouraged his authors to publish articles that deliver information about how data values are scattered and help to evaluate how representative the mean truly is – instead of the useless null hypothesis ritual. His success was modest. And what he achieved was quickly destroyed by his successors.

This shouldn’t mislead those of you who analyze business data. You need to make your own conclusions. And it cannot be said often enough. There are no statistical methods that can do this for you.

For more information:

- Buttler, G., Statistisch getestet – Gütesiegel oder Etikettenschwindel?, in: Brachinger, H. W. u. a. (Hrsg.), Wirtschaftsstatistik: Festschrift zum 65. Geburtstag von Eberhard Schaich, München 2006, S. 25–35.
- Gigerenzer, G., Mindless Statistics, The Journal of Socio-Economics 33 (2004), S. 587–606.
- Haller, H., Krauss, S., Misinterpretations of Significance: A Problem Students Share with Their Teachers?, Methods of Psychological Research Online 7 (2002) 1.
- Krämer, W., Gigerenzer, G., How to confuse with statistics or: the use and misuse of conditional probabilities, Statistical Science 20 (2005) 3, S. 223–230.
- Loftus, G. R., A picture is more worth than thousand p-values: On the irrelevance of hypothesis testing in the computer age, Behavior Research Methods, Instrumentation and Computers 25 (1993) 2, S. 250 ff.