Monday, 16 August 2010

Statistical significance explained in plain English

Warren Davies, a positive psychology MSc student at UEL, provides the latest in our ongoing series of guest features for students. Warren has just released a Psychology Study Guide, which covers information on statistics, research methods and study skills for psychology students.

Today I'm delighted to discuss an absolutely fascinating topic in psychology - statistical significance. I know you're as excited about this as I am!

Why is psychology a science? Why bother with complicated research methods and statistical analyses? The answer is that we want to be as sure as possible that our theories about the mind and behaviour are correct. These theories are important - many decisions in areas like psychotherapy, business and social policy depend on what psychologists say.

Despite the myriad rules and procedures of science, some research findings are pure flukes. Perhaps you're testing a new drug, and by chance alone, a large number of people spontaneously get better. The better your study is conducted, the lower the chance that your result was a fluke - but still, there is always a certain probability that it was.

In science we're always testing hypotheses. We never conduct a study to 'see what happens', because there's always at least one way to make any useless set of data look important. We take a risk; we put our idea on the line and expose it to potential refutation. Therefore, all statistical tests in psychology test the probability of obtaining your given set of results (and all those that are even more extreme) if the hypothesis were incorrect - i.e. the null hypothesis were true.

Say I create a loaded die that I believe will always roll a six. I’ve invited you round to my house tonight for a nice cup of tea and a spot of gambling. I plan to hustle you out of lots of money (don’t worry, we’re good friends and always playing tricks like this on each other). Before you arrive I want to test my hypothesis that the die is loaded against my null hypothesis that it isn't.

I roll the die. A six. Success! But wait... there’s actually a 1:6 chance that I would have gotten this result, even if the null hypothesis was correct. Not good enough. Better roll again. Another six! That’s more like it; there’s a 1:36 chance of getting two sixes, assuming the null hypothesis is correct.

The more sixes I roll, the lower the probability that my results came about by chance, and therefore the more confident I could be in rejecting the null hypothesis.

This is what statistical significance testing tells you - the probability that the result (and all those that are even more extreme) would have come about if the null hypothesis were true (in this case, if the die were truly random and not loaded). It's given as a value between 0 and 1, and labelled p. So p = .01 means a 1% chance of getting the results if the null hypothesis were true; p = .5 means 50% chance, p = .99 means 99%, and so on.

In psychology we usually look for p values lower than .05, or 5%. That's what you should look out for when reading journal papers. If there's less than a 5% chance of getting the result if the null hypothesis were true, a psychologist will be happy with that, and the result is more likely to get published.

Significance testing is not perfect, though. Remember this: 'Statistical significance is not psychological significance.' You must look at other things too; the effect size, the power, the theoretical underpinnings. Combined, they tell a story about how important the results are, and with time you'll get better and better at interpreting this story.

And that, in a nutshell, is what statistical significance is. Enthralling, isn't it?

--
Editor's note (07/09/2010): This post has been edited to correct for the fact that statistical significance pertains to the likelihood of a given set of results (and those even more extreme) being obtained if the null hypothesis were true, not to the probability that the hypothesis is correct, as was erroneously stated before. Sincere apologies for any confusion caused.

9 comments:

  1. Here is a link to my review of the book 'The Cult of Statistical Significance'. This book shows how many scientific disciplines rely way too much on the concept of statistical significance.

    Briefly, the critique is: Testing for statistical significance is asking the question how likely it is that an effect exists. It tries to answer that question by looking at how precisely the effect can be measured. It does not answer at all how strong and important this effect is. And this latter question about the effect size is much more important from a scientific and a practical perspective. Statistical significance does not imply an effect is important, lack of statistical significance does not mean an effect is not important.
    More explanation here: http://ow.ly/2q2Dw

    ReplyDelete
  2. continuation of CV's comment:

    Furthermore, your probability value does not tell you the likelihood that your results were obtained by chance. Rather, it is the probability of obtaining your results, given the null hypothesis (see Kline, 2004). In plain English that means that we assume that the die is NOT loaded, and observe the probability of obtaining running sixes. Even in extreme cases, where we obtain six consecutive sixes, the result only means that we have observed an incredibly rare event (p=0.00002) given our earlier assumption. The logic of significance testing does not permit inferences about the likelihood that observations were obtained by chance. Unfortunately, this is the textbook definition, at least in the social sciences.

    ReplyDelete
  3. Fergus Neville4:56 pm

    I disagree with the statement that "In science we're always testing hypotheses. We never conduct a study to 'see what happens'".

    Science is a process of hypothesis-formation as well as hypothesis-testing. Exploratory research (both quantitative and qualitative) is a vital part of the scientific process, and to label it unscientific contributes to the fetishisation of experimental method in psychology. By all means conduct experiments and test hypotheses, but you must ask the right questions to get the right answers...

    ReplyDelete
  4. Anonymous10:30 am

    Although I applaud any effort to help students with difficult subjects, I believe psychology has been 'dumbed down' well enough. I understand that from an ideological ('yes-you-can-as-long-as-you-want-to'), as well as a financial (e.g. in the Netherlands, universities get funding as a function of number of students attracted vs number of students finishing degrees), there is a hope that every single person in the world can and should study and finish psychology. In the end, however, some people just fail to understand something incredibly simple because they lack either the ability or the motivation. In my mind, they should not be helped to study, but helped to find something else to do with their life.

    ReplyDelete
  5. An important distinction that's easy to miss is that "the probability I would have gotten this result, given that the null hypothesis is correct" is different from "the probability that the null hypotheiss is correct, given that I got this result". There's a 1:36 chance of getting two sixes if the null hypothesis is correct, but this does not mean that the null hypothesis has a 1:36 chance of being correct if we get two sixes.

    For example. Suppose that we've acquired a die from a specific manufacturer, and we know that the quality of the dice they make varies. More specifically, we know that out of every 10,000 dice they make, 100 are loaded and give only sixes. The remaining 9,900 dice in each shipment are fair.

    Let's say that we now roll a die that we know has come from one such shipment. It produces two sixes. Now, if it were one of the 100 loaded dice, then it would produce two consecutive sixes with a certainty. But out of the 9,900 fair dice, 1/36 of them, or 275, will also happen to produce two sixes by chance. So if we rolled each die from the shipment twice, we would expect to see 375 series of two sixes. Of those 375 observations of two sixes, 100 would be caused by loaded dice, and 275 would be caused by fair dice.

    In other words, even though there was only a 1:36 chance of getting two sixes given a fair die, seeing two sixes only gives us a probability of 100/375 (about 0.27) of the dice actually being loaded.

    It's often claimed (as it was in this post) that p values give us "the probability that the result would have come about by chance". But that's really only the case in a situation where the null hypothesis and the proposed hypothesis are equally likely a priori. This very rarely happens: most proposed hypotheses are false, so even a p value of .05 can still mean that there's a very high probability of the result being a chance event. A paper published in PLoS Medicine went as far as to claim that because of this, most published research findings are false.

    For more on this, see e.g. Jacob Cohen's 1994 paper.

    ReplyDelete
  6. I'm glad that in this post you mention how statistical significance is not the same as actual significance to people. I think that's commonly confused.

    We want useful data after all.

    ReplyDelete
  7. Apologies for the misunderstanding, in trying to make this concept as simple of possible to help people get the gist of it, I appear to have cut one too many corners. Let me clarify things here.

    Ed:

    "Editor's note (07/09/2010): This post has been edited to correct for the fact that the aim of statistical tests is to show the likelihood of a given set of results being obtained if the null hypothesis were true, not to test the probability that the hypothesis is correct, as was erroneously stated before. Sincere apologies for any confusion caused."

    This is the gist of it. To clarify though, the null hypothesis is never true. That's one of the issues with null hypothesis significance testing - it's founded on an assumption that is never correct.

    Even completely random samples will have slightly different means values. To get the means of two random samples close together, you can use larger samples. The difference might by very small, but the larger sample size you use, the more chance your results have of being statistically significant!

    So therein lies the problem. In the psychokinetics research, you can read researchers claiming tiny, tiny, mean differences are psychologically significant because they are statistically significant - but they use tens of thousands of data points in their studies!


    Levi:

    "Rather, it is the probability of obtaining your results, given the null hypothesis"

    You're right, I should have phrased it this way to start with. But to clarify it's the probability of obtaining your results - or greater/more extreme results - assuming the null hypothesis is correct.


    Fergus:

    "Science is a process of hypothesis-formation as well as hypothesis-testing. Exploratory research (both quantitative and qualitative) is a vital part of the scientific process, and to label it unscientific contributes to the fetishisation of experimental method in psychology. By all means conduct experiments and test hypotheses, but you must ask the right questions to get the right answers..."

    Fair point. I should have said 'rarely', not never.

    Apologies again for any misunderstanding. This paper is good:

    http://www.uvm.edu/~bbeckage/Teaching/DataAnalysis/AssignedPapers/Cohen_1990.pdf

    for more detailed information.

    ReplyDelete
  8. Even now the statement isn't correct. Actually the probability of getting the exact result is almost always very small, because there are many, many possible outcomes, especially as the number of trials increases. In a binomial trial, for example, the probability of obtaining "the result observed", say k successes on n trials, is given by the binomial distribution choose(n,k)p^k(1-p)^(n-k). But that's not the number that the p-value calculates, which is rather that number plus all those probabilities more extreme than the one actually observed (the tail area). The reason that tail areas were adopted for these calculations is that the particular binomial probabilities get smaller and smaller, as more and more data are observed.

    But this is why Bayesians object to tail-area tests of this sort, since they use not only the data that were actually observed, but also data that were not observed, but might have been (or might be observed in another run of an identical experiment which has not in fact been run). Thus it violates the Likelihood Principle.

    It is quite true that point null hypotheses are hardly ever true...even ones that are believable (such as Jim Berger and Mohan Delampady's example "I cannot improve the growth of my plants by talking to them") are likely to be affected by systematic biases introduced in the actual test situation (such as the fact that talking to your plants might involve breathing carbon dioxide onto them, which might slightly improve their growth). So the procedure of introducing point nulls only to try to get enough data to show that they are not exact is kind of foolish. As Herman Rubin has often said, you don't need any data at all to know that a point null hypothesis is false.

    A more sensible test would divide the effect sizes into one region that was practically insignificant, and another that was practically significant. Bayesian tests such as are found in Berger and Delampady's paper can do this, since they apply not only to exact point nulls but also to nulls that are small but not zero. Alternatively, and possibly better, would be to reformulate the problem as a decision-theoretic one, since in practical terms the reason one often does testing is to inform about actions to be taken, and that requires introducing a loss function that would depend on the actual effect size.

    ReplyDelete
  9. I've tweaked the clarification - thanks!

    ReplyDelete

Note: only a member of this blog may post a comment.

Google+