A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Friday, December 19, 2014

Observed power, and what to do if your editor asks for post-hoc power analyses

Observed power (or post-hoc power) is the statistical power of the test you have performed, based on the effect size estimate from your data. Statistical power is the probability of finding a statistical difference from 0 in your test (aka a ‘significant effect’), if there is a true difference to be found. Observed power differs from the true power of your test, because the true power depends on the true effect size you are examining. However, the true effect size is typically unknown, and therefore it is tempting to treat post-hoc power as if it is similar to the true power of your study. In this blog, I will explain why you should never calculate the observed power (except for blogs about why you should not use observed power). Observed power is a useless statistical concept, and at the end of the post, I’ll give a suggestion how to respond to editors who ask for post-hoc power analyses.

Observed (or post-hoc) power and p-values are directly related. Below, you can see a plot of observed p-values and observed power for 10000 simulated studies with approximately 50% power (the R code is included below). It looks like a curve, but the graph is basically a scatter plot of a large number of single observations that fall on a curve expressing the relation between observed power and p-values.

Below, you see a plot of p-values and observed power for 10000 simulated studies with approximately 90% power. Yes, that is exactly the same curve these observations fall on. The only difference is how often we actually observe high p-values (or have low observed power). You can see there are only a few observations with high p-values if we have high power (compared to medium power), but the curve stays exactly the same. I hope these two figures drive home the point of what it means that p-values and observed power are directly related: it means that you can directly convert your p-value to the observed power, regardless of your sample size or effect size.  

Let’s draw a vertical line at p = 0.05, and a horizontal line at 50% observed power. We can see below that the two lines meet exactly at the line visualizing the relationship between p-values and observed power. This means that anytime you observed a p-value of p = 0.05 in your data, your observed power will be 50% (in infinite sample sizes, in t-tests - Jake Westfall pointed me to this paper showing the values at smaller samples, and for F-tests with different degrees of freedom).

I noticed these facts about the relationship between observed power and p-values while playing around with simulated studies in R, but they are also explained in Hoenig& Heisey, 2001.

Some estimates (e.g., Cohen, 1962) put the average power of studies in psychology at 50%. What observed power can you expect, when you perform a lot of studies which have a true power of 50%? We know that the p-values we can expect should be split down the middle, with 50% being smaller than p = 0.05, and 50% being larger than p = 0.05. The graph below gives the p-value distribution for 100000 simulated independent t-tests:

The bar on the left are all (50.000 out of 100.000) test results with a p < 0.05. The observed power distribution is displayed below:

It is clear you can expect just about any observed power when the true power of your experiment is 50%. The distribution of observed power changes from positively skewed to negatively skewed as the true power increases (from 0 to 1), and when power is around 50% we observe a tipping point where there is a switch from a negatively skewed distribution to a positively skewed distribution. With slightly more power (e.g., 56%) the distribution becomes somewhat U-shaped, as can be seen in the figure below. I’m sure a mathematical statistician can explain the why and how of this distribution in more detail, but here I just wanted to show what it looks like, because I don’t know of any other sources of information where this distribution is reported (thanks to a reader, who in the comments points out Yuan & Maxwell, 2005 also discuss observed power distributions).

Editors asking for post-hoc power analyses

Editors sometimes ask researchers to report post-hoc power analyses when authors report a test that does not reveal a statistical difference from 0, and when authors want to conclude there is no effect. In such situations, editors would like to distinguish between true negatives (concluding there is no effect, when there is no effect) and false negatives (concluding there is no effect, when there actually is an effect, or a Type 2 error). As the preceding explanation of post-hoc power hopefully illustrates, reporting post-hoc power is nothing more than reporting the p-value in a different way, and will therefore not answer the question editors want to know.

Because you will always have low observed power when you report non-significant effects, you should never perform an observed or post-hoc power analysis, even if an editor requests it (feel free to link to this blog post). Instead, you should explain how likely it was to observe a significant effect, given your sample, and given an expected or small effect size. Perhaps this expected effect size can be derived from theoretical predictions, or you can define a smallest effect size of interest (e.g., you are interested in knowing whether an effect is larger than a ‘small’ effect of d < 0.3).

For example, if you collected 500 participants in an independent t-test, and did not observe an effect, you had more than 90% power to observe a small effect of d = 0.3. It is always possible that the true effect size is even smaller, or that your conclusion that there is no effect is a Type 2 error, and you should acknowledge this. At the same time, given your sample size, and assuming a certain true effect size, it might be most probable that there is no effect.


  1. 1. I would like to add that post-hoc power for a single statistical test is useless. However, post-hoc power provides valuable information for a set of independent statistical tests. I would not trust an article that reports 10 significant results when the median power is 60%. Even if median power is 60% and only 60% of results are significant, post-hoc power is informative. It would suggest that researchers have only a 60% chance to get a significant effect in a replication study and should increase power to make a replication effort worthwhile.

    2. The skewed distribution of observed power for true power unequal .50 was discussed in Yuan and Maxwell (2005) http://scholar.google.ca/scholar_url?url=http://irt.com.ne.kr/data/on_the_pos-ences.pdf&hl=en&sa=X&scisig=AAGBfm1W3wVWuS3MecdWMU0dEkoVal4U3A&oi=scholarr&ei=wZOUVNuXD67IsASZ-YKIDQ&ved=0CB0QgAMoADAA

    3. It was also discussed in Schimmack (2012) as a problem in the averaging of observed power as an estimate of true power and was the reason why the replicability index uses the median to estimate true (median) power of a set of studies. http://r-index.org/uploads/3/5/6/7/3567479/introduction_to_the_r-index__14-12-01.pdf

  2. You are right to focus attention on the effect size rather than power as an argument for "proving the null" (even if only suggestively); but we have a long way to go in agreeing what effect size is indeed too small to matter, when discredited examples like Rosenthal's aspirin study keep circulating.

    I would go further and say that a priori power analysis is useful for a direct replication; has only suggestive value for a conceptual replication; and is near-useless for the 90% (?) of published research that rests on finding a novel effect, even one such as a moderation by context that may include an incidental replication (the power needed for the interaction will have little to do with that needed for the main effect).

    1. Hi Roger, could you elaborate a little on "discredited examples like Rosenthal's aspirin study"? I am aware of the paper you're referring to but not clear on what is discredited.

    2. The way the effect size was calculated by Rosenthal has been discredited (well it may be harsh to say "discredited"). See: http://www.researchgate.net/publication/232494334_Is_psychological_research_really_as_good_as_medical_research_Effect_size_comparisons_between_psychology_and_medicine