The 20% Statistician: May 2014

Friday, May 30, 2014

The probability of p-values as a function of the statistical power of a test

I used to be really happy with any p-value smaller than .05, and very disappointed when p-values turned out to be higher than .05. Looking back, I realize I was suffering from a bi-polar p-value disorder. Nowadays, I interpret p-values more evenly. Instead of a polar division between p-values above and below the .05 significance level, I use a gradual interpretation of p-values. As a consequence, I'm no longer very convinced something is going on by p-values between .02 and .05. Let me explain.

In my previous blogpost, I explained how p-values can be calibrated to provide best-case posterior probabilities that the H0 was true. High p-values leave quite something to be desired, with a p = .05 yielding a best-case scenario with a 71% probability that H1 is true (assuming H0 and H1 are a-priori equally likely). Here, I want to move beyond best case scenario’s. Instead of only looking at p-values, we are going to look at the likelihood that a p-value represents a true effect, given the power of a statistical test.

This blog post is still based on the paper by Sellke, Bayarri, & Berger, 2001. The power of a statistical test that yields a specific p-value is determined by the size of the effect, the significance level, and the size of the sample. The more observations, and the larger the effect size, the higher the statistical power. The higher the statistical power, the higher the likelihood of observing a small (e.g., p = .01) compared to a high (e.g., p = .04) p-value, assuming there is a true effect in the population. We can see this in the figure below. The top and bottom halves of the figure display the same information, but the scale showing the percentage of expected p-values differs (from 0-100 in the top, from 0-10 in the bottom, where the percentages for p-values between .00 and .01 are cut off at .1). As the top pane illustrates, the probability of observing a p-value between 0.00 and 0.01 is more than twice as large if a test has 80% power, compared to when the test has only 50% power. In an extremely high powered experiment (e.g., 99% power) the p-value will be smaller than .01 in approximately 96% of the tests, and between 0.01 and 0.05 in only 3.5% of the tests.

In general, the higher the statistical power of a test, the less likely it is to observe relatively high p-values (e.g., p > .02). As can be seen in the lower pane in the figure, in extremely high powered statistical tests (i.e., 99% power), the probability of observing a p-value between .02 and .03 is less than 1%. If there is no real effect in the population, and the power of the statistical test is 0% (i.e., there is no chance to observe a real effect), p-values are uniformly distributed. This means that every p-value is equally likely to be observed, and thus that 1% of the p-values will fall within the .02 and .03 interval. As a consequence, when a test with extremely high statistical power returns a p = .024, this outcome is more likely when the null hypothesis is true, than when the alternative hypothesis is true (the bars for a p-value between .02 and .03 is higher when power = 0%, than when power = 99%). In other words, a statistical difference at the p < .05 level is surprising, assuming the null-hypothesis is true, but should still be interpreted as support for the null-hypothesis (we also explain this in Lakens & Evers (2014).

The fact that with increasing sample size, a result can at the same time be a statistical difference with p < .05, while also being stronger support for the null-hypothesis than for the alternative hypothesis, is known as Lindley’s paradox. This isn’t a true paradox – things just get more interesting to people if you call them a paradox. There are just two different questions that are asked. First, the probability of the data, assuming the null-hypothesis is true, or Pr(D|H0), is very low. Second, the probability of the alternative hypothesis, is lower than the probability of the null-hypothesis, given the data, or Pr(H1|D)<Pr(H0|D). Although it is often interpreted by advocates of Bayesian statistics as a demonstration of the ‘illogical status of significance testing’ (Rouder, Morey, Verhagen, Province, & Wagemakers, in press), it also an illustration of the consequences of using improper priors in Bayesian statistics (Robert, 2013).

An extension of these ideas is now more widely known in psychology as p-curve analysis (Simonsohn, Nelson, & Simmons, 2014, see www.p-curve.com). However, you can apply this logic (with care!) when subjectively evaluating single studies as well. In a well-powered study (with power = 80%) the odds of a statistical difference yielding a p-value smaller than .01 compared to a statistical difference between .01 and .05 is approximately 3 to 1. In general, the lower the p-value, the more the result supports the alternative hypothesis (but don't interpret p-values directly as support for H0 or H1, and always consider the prior probability of H0). Nevertheless, 'sensible p-values are related to weights of evidence' (Good, 1992), and the lower the p-value the better. A p-value for a true effect can be higher than .03, but it's relatively unlikely to happen a lot across multiple studies, especially when sample sizes are large. In small sample sizes, there is a lot of variation in the data, and a relatively high percentage of higher p-values is possible (see the figure for 50% power). Remember that if studies only have 50% power, there should also be 50% non-significant findings.

The statistical reality explained above also means that in high-powered studies (e.g., with a power of .99, for example when you collect 400 participants (divided over 2 conditions in an independent t-test) and the effect size is d=.43), setting the significance level to .05 is not very logical. After all, p-values > .02 are not even more likely under the alternative hypothesis, than under the null-hypothesis. Unlike my previous blog, where subjective priors were involved, this blog post is focused on a the objective probability of observing p-values under the null hypothesis and the alternative hypothesis, as a function of power. It means that we need to stop using fixed significance levels of α = .05 for all our statistical tests, especially now that we are starting to collect larger samples. As Good (1992) remarks:

The real objection to p-values is not that they usually are utter nonsense, but rather that they can be highly misleading, especially if the value of N is not also taken into account and is large.’

How we can decide which significance level we should use, depending on our sample size, is a topic for a future blog post. With which I mean to say I haven't completely figured out how it should be done. If you have, I'd appreciate a comment below.

Tuesday, May 27, 2014

Prior probabilities and replicating 'surprising and unexpected' effects

Recently, people have wondered why researchers seem to have a special interest in replicating studies that demonstrated unexpected or surprising results. In this blog post, I will explain why, statistically speaking, this makes sense.

When we evaluate the likelihood that findings reflect real effects, we need to take the prior likelihood that the null-hypothesis is true into account. Null-hypothesis significance testing ignores this issue, because p-values give us the probability of observing the data (D), assuming H0 is true, or Pr(D|H0). If we want to know the probability the null-hypothesis is true, given the data, or Pr(H0|D) we need Bayesian statistics. I generally like p-values, so I will not try to convince you to use Bayesian statistics (although it’s probably smart to educate yourself a little on the topic), but I will explain how you can use calibrated p-values to get a feel for the probability H0 and H1 are true, given some data (see Sellke, Bayarri, & Berger, 2001). This nicely shows how p-values can be related to Bayes Factors (see also Good, 1992, Berger, 2003).

Everything I will talk about can be applied with the help of the nomogram below (taken from Held, 2010). On the left, we have the prior probability that H0 is true. For now, let’s assume the null hypothesis and the alternative hypothesis are equally likely (so the probability of H0 is 50%, and the probability of H1 is 50%). The middle line gives the observed p-value in a statistical test. It goes up to p = .37, and for a mathematical reason cannot be used for higher p-values. The right scale is the posterior probability of the null-hypothesis, from almost 0 (it is practically impossible H0 is true) to 50% probability that H0 is true (where 50% means that after we have performed a study, H0 and H1 are still equally likely to be true. By drawing straight lines between two of the scales, you can read off the corresponding value on the third scale. For example, assuming you think H0 and H1 are equally likely to be true before you begin (a prior probability of H0 of 50%), and you observe a p-value of .37, a straight line will bring us to a posterior probability for H0 of 50%, which means the likelihood that H0 or H1 is true has not changed, even though we have collected data.

If we observe a p = .049, which is a statistical difference with an alpha level of .05, the posterior likelihood that H0 is true is still a rather high 29%. The likelihood of the alternative hypothesis (H1) is 100% - 29% = 71%. This gives a Bayes Factor (the probability of H1, given the data, divided by the probability of H0, given the data, or Pr(H1|D)/Pr(H0|D) of 0.40, or 2,5 to 1 odds against H0. Bayesian do not consider this strong enough support against H0 (instead, it should be at least 3 to 1 odds against H0). This might be a good moment to add that these calculations are a best case scenario. This prior distribution is chosen in a way to given the highest possible Bayes Factor, so the real Bayes Factor is the value that follows from the nomogram, or worse. Also, now you’ve seen how easy it is to use the nomogram, I hope showing the Sellke et al., 2001 formula these calculations are based on won’t scare you away:

What if, a-priori, it seems the hypothesized alternative hypothesis is at least somewhat unlikely? This is a subjective judgment, and difficult to quantify, but you often see researchers themselves describe a result as ‘surprising’ or ‘unexpected’. Take a moment to think how likely the H0 should be for a finding to be ‘surprising’ and ‘unexpected’. Let’s see what happens if you think the a-priori probability of H0 is 75% (or 3 to 1 odds for H0). Observing a p = .04 would in that instance lead to, at best, a 51% probability H0 is true, and only a 49% probability H1 is true. That means that even though the observed data are unlikely, assuming H0 is true (or Pr(D|H0)), it is still more likely that H0 is true (Pr(HO|D) than that H1 is true (Pr(H1|D). I've made a spreadsheet you can use to perform these calculations (without any guarantees), in case you want to try out some different values of the prior probability and the observed p-value.

With a prior probability of 50%, a p = .04 would give a posterior probability of 26%. To have the same posterior probability of 26%, with an prior probability for H0 of 75%, the p-value would need to be p = .009. In other words, with decreasing a-priori likelihood, we need lower p-values to achieve a comparable posterior probability that H0 is true. This is why Lakens & Evers (2014, p. 284) stress that “When designing studies that examine an a priori unlikely hypothesis, power is even more important: Studies need large sample sizes, and significant findings should be followed by close replications.” To have a decent chance of observing a low enough p-value, you need to have a lot of statistical power. When reviewing studies that use the words 'unexpected' and 'surprising', be sure to check whether, given the a-priori likelihood of H0 (however subjective this assessment is) the p-values lead to a decent posterior probability that H1 is true. If we would do this consistently and fairly, there would be a lot less complaining about effects that are 'sexy but unreliable'.

This statistical reality has as a consequence that given two studies with equal sample sizes that yielded results with identical p-values, researchers who choose te replicate the more ‘unexpected and surprising’ finding are doing our science a favor. After all, that is the study where H0 still has the highest posterior likelihood, and is thus the finding where the likelihood that H1 is true is still relatively low. Replicating the more uncertain result leads to the greatest increase in posterior likelihoods. You can disagree about which finding is subjectively judged to be a-priori less likely, but the choice to replicate a-priori less likely results (all else being equal) makes sense.

Friday, May 23, 2014

A pre-publication peer-review of the 'Feeling The Future' meta-analysis

I recently read a meta-analysis on precognition studies by Bem, Tressoldi, Rabeyron, and Duggan (available on SSRN). The authors conclude in the abstract: 'We can now report a metaanalysis of 90 experiments from 33 laboratories in 14 different countries which yielded an overall positive effect in excess of 6 sigma with an effect size (Hedges’ g) of 0.09, combined z = 6.33, p = 1.2 ×10^-10). A Bayesian analysis yielded a Bayes Factor of 1.24 × 10^9, greatly exceeding the criterion value of 100 for “decisive evidence” in favor of the experimental hypothesis (Jeffries, 1961).'

If precognition is true, this would be quite something. I think our science should be able to answer the question whether pre-cognition exists in an objective manner. My interest was drawn to the meta-analysis because the authors reported a p-curve analysis (recently developed by Simonsohn, Nelson, & Simmons, 2014), which is an excellent complement to traditional meta-analyses (see Lakens & Evers, 2014). The research area of precognition has been quite responsive to methodological and statistical improvements (see this site where parapsychologists can pre-register their experiments).

The p-curve analysis provided support of evidential value in a binomial test (no longer part of the 2.0 p-curve app), but not when the more sensitive chi-squared test was used (now the only remaining option). The authors did not provide a p-curve disclosure table, so I worked on one for a little bit to distract myself while my wife was in the hospital and I took time off from work (it's emotion regulation, researcher-style). My p-curve disclosure table is not intended to be final or complete, but it did led me to some questions. I contacted the authors of the paper, who were just about as responsive and friendly as you could imagine, and responded incredibly fast to my questions - I jokingly said that if they wanted to respond any better, they'd have to answers my questions before I asked them ;).

First of all, I’d like to thank Patrizio Tressoldi for his extremely fast and helpful replies to my e-mails. Professor Bem will reply to some of the questions below later, but did not have time due to personal circumstances that happen to coincide with this blog post. I liked to share the correspondence at this point, but will update it in the future with answers by Prof. Bem. I also want to thank many other scholars that have provided answers and discussed this topic with me - the internet makes scientific discussions so much better. I'm also grateful to the researchers who replied to Patrizio Tressoldi's questions (answers included below). Patrizio and I both happened to share a view on scientific communication not unlike that outlined in 'Scientific Utopia' by Brian Nosek and Yaov Bar-Anan. I think this pre-publication peer review of a paper posted on SSRN might be an example of a step in that direction.

Below are some questions concerning the manuscript ‘Feeling the Future: A Meta-analysis of 90 Experiments on the Anomalous Anticipation of Random Future Events’ by Bem, Tressoldi, Rabeyron, & Duggan. My questions below concern only a small part of the meta-analysis, primarily those studies that were included in the p-curve analysis reported in the manuscript. I did not read all unpublished studies included in the meta-analysis, but will make some more general comments about the way the meta-analysis is performed.

The authors do not provide a p-curve disclosure table (as recommended by Simonsohn, Nelson, & Simmons, 2014). I’ve extended the database by Bem and colleagues to include a p-curvedisclosure table. In total, I’ve included 18 significant p-values (Bem and colleagues mention 21 significant values). I’ve included a different p-value for study 6 by Bem (2011) and Bierman (2011). I’ve also excluded the p-value by Traxler, which accounts for the 3 different p-values (these choices are explained below).

In my p-curve analysis, the results are inconclusive, since we can neither conclude there is evidential value (χ²(36)=45.52, p=.13) nor that they lack evidential value (χ²(36)=43.5, p=.1824). It is clear from the graph there is no indication of p-hacking.

An inconclusive result with 18 p-values is not an encouraging result. The analysis has substantial power (as Simonsohn et al note, ‘With 20 p-values, it is virtually guaranteed to detect evidential value, even when the set of studies is powered at just 50%’). The lack of evidential value is mainly due to the fact that Psi researchers have failed to perform studies of high informational value (e.g., with large samples) that, if the Psi effect exists, should return primarily p-values smaller than p < .01 (for an exception, discussed below, see Maier et al., in press). Running high-powered studies is a lot of effort. Perhaps psi-researchers can benefit from a more coordinated and pre-registered research effort similar to the many labs approach (https://osf.io/wx7ck/). The current approach seems not to lead to studies that can provide conclusive support for their hypothesis.The p-curve is a new tool, and although Nelson et al (2014) argue it might be better than traditional meta-analyses, we should see comparisons of p-curve analyses and meta-analyses in the future.

Reply Tressoldi: The main use of p-curve is a control of p-hacking procedure, certainly not a stat to summarize quantitatively the evidence available related to a given phenomenon given the well-known limitations of the p values. The parameters obtained by the meta-analyses based on all available studies, are surely more apt to this demonstration.

For the meta-analysis, the authors simply state that ‘Effect sizes (Hedges’ g) and their standard errors were computed from t test values and sample sizes.’ However, for within-subject designs, such as the priming effects, the standard error can only be calculated when the correlation between the two means that are compared is known. This correlation is practically never provided, and thus one would expect some information about how these correlations were estimated when the raw data was not available.

Reply Tressoldi: We probably refer to different formula to calculate the ES. That used by the Comprehensive Meta-Analysis software we used, is t/Sqr(N) multiplied by the Hedges’ correction factor. It gives the same result of using independent t-test assuming a correlation of 0.50.

(comment DL - the assumption of a correlation of 0.50 is suboptimal, I think).

One might wonder whether the authors have really emptied the file-drawer. Most likely not. For example, a master thesis under supervision of Lobach and Bierman at the University of Amsterdam (http://dare.uva.nl/document/214097 ‎) did not reveal a Psi effect, but is not included in the analysis.

Reply Tressoldi: this study is related to the so called predictive physiological anticipatory effects summarized and discussed in the Mossbridge et al. (2012) meta-analysis cited on pag. 5.

Similarly, Bem himself contributes only studies with effect sizes that are equal to or higher than the meta-analytic effect size, despite the fact that Schimmack (2012) has noted that it is incredibly unlikely anyone would observe such a set of studies without having additional studies in a filedrawer. That this issue is not explicitly addressed in the current meta-analysis is surprising and worrisome.

Prof. Bem will respond to this later.

If we look at which effects are included in the meta-analysis, we can see surprising inconsistencies. Boer & Bierman (2006) observed an effect for positive, but not for negative pictures. Bem finds an effect for erotic pictures, but not for positive pictures. The question is which effect is analyzed: If it’s the effect of positive non-erotic pictures, the study by Bem should be excluded, if it’s erotic pictures, the study by Boer & Bierman should be excluded. If the authors want to analyze psi effects for all positive pictures (erotic or not), both erotic as positive pictures by Bem should be included. Now, the authors of the meta-analysis are cherry-picking significant results that share nothing (except their significance level). This hugely increases the overall effect size in the meta-analysis, and is extremely inconsistent. A better way to perform this meta-analysis would be to include all stimuli (positive, negative, erotic, etc) in a combined score, or to perform separate meta-analysis for each picture type. Now, it is unclear what is being meta-analyzed, but it looks like a meta-analysis of picture types for which significant effects were observed, while the meta-analysis should be about a specific or all picture types. This makes the meta-analysis flawed as it is.

Reply Tressoldi: Boer & Bierman (2006) used a retro-priming whereas Bem exp. 1 used a reward protocol. As you can see in our Table 2, we considered seven different protocols. Even if all seven protocols test a similar hypothesis, their results are quite different as expected. They ESs range from negative using the retro-practice for reading speed, to 0.14 for the detection of reinforcement. Furthermore with only two exceptions (retroactive practice and the detection of reinforcement) all other ES show a random effect, caused by probable moderator variables, we did not analyzed further given the low number of studies, but that future investigation must take in account.

Bierman reply: The mean response time in the retro conditon for negative valence is 740 while the control gives 735. The conclusion is of course that there is no priming effect. It doesn’t make sense to run a t-test if the effect is so small and not in the predicted direction.

Of course at the time we were not so sensitive for specifying precisely where we would expect the retro-active priming effect and I think the proper way to deal with this is to correct the p-value for selecting only the positive condition. On the other hand the p-value given is two-tailed which is in retrospect rather strange because priming generally results in faster response times so it is a directionally specified effect. The quoted p-value of 0.004 two tailed should have been presented as 0.002 one-tailed but should have been corrected for the selection of the positive condition. So I would use a score corresponding to p=0.004 (this time one-tailed) (which corresponds to the present t-test value 2.98, my note)

Similarly, in the study by Starkie (2009) the authors include all 4 type of trials, and perform some (unspecified) averaging which yields a single effect size. In this case, there are 4 effects, provided by the same participants. These observations are not independent, but assuming the correct averaging function is applied, the question is whether this is desirable. For example, Bem (2011) argued neutral trials do not necessarily need to lead to an effect. The authors seem to follow this reasoning by not including the neutral trials from study 6 by Bem (2011), but ignore this reasoning by including the neutral trials in Starkie (2009). This inconsistency is again in the benefit of the authors own hypothesis, given that excluding the neutral trials in the study by Starkie (2009) would substantially increase the effect size, which is in the opposite direction as predicted, and would thus lower the meta-analytic Psi effect size.

Reply Tressoldi: We already clarified how we averaged multiple outcomes from a single study. As to Starkie’s data, yes, you are right, we added the results of the neutral stimuli increasing the negative effect (with respect to the alternative hypothesis). We will correct this result in the revision of our paper.

In Bem, experiment 6, significant effects are observed for negative and erotic stimuli (in opposite directions, without any theoretical explanation for the reduction in the hit-rate for erotic stimuli). The tests for negative (51.8%,t(149) = 1.80,p =.037, d=0.15, binomial z = 1.74, p =.041) and erotic (48.2%, t(149) = 1.77, p = .039, d =0.14, binomial z = 1.74, p = .041) should be included (as is the case in the other experiments in the meta-analysis, where the tests with deviations against guessing average are included). Instead, the authors include a the test of the difference between positive and erotic picture. This is a different effect, and should not be included in the meta-analysis (indeed, they correctly do not use a comparable difference test reported in Study 1 in Bem, 2011). Moreover, the two separate effects have a Cohen’s dz of around 0.145 each, but the difference score has a Cohen’s dz of 0.19 – this inflates the ES.

Prof. Bem will respond to this later.

An even worse inflation of the effect size is observed in Morris (2004). Here, the effects for negative, erotic, and neutral trials with effect sizes of dz = 0.31, dz = 0.14, and dz = 0.32 are combined in some (unspecified) way to yield an effect size of dz = 0.447. That is obviously plain wrong – combining three effects meta-analytically cannot return a substantially higher effect size.

Reply Tressoldi: Yes, you are correct. We averaged erroneously the t-test values. We corrected the database and all corresponding analyses with the correct value 2.02 (averaging the combining erotic and negative trials with the boredom effect).

Bierman (2011) reported a non-significant psi-effect of t(168) = 1.41, p = 0.08 one-tailed. He reported additional exploratory analyses where all individuals who had outliers on more than 9 (out of 32) trials were removed. No justification for this ad-hoc criterion is given (note that normal outlier removal procedures, such as removing responses more than 3 SD removed from the mean response time, were already applied). The authors nevertheless choose to include this (t(152) = 1.97, p = 0.026, one-tailed) test in their meta-analysis, which inflates the overall ES. A better (and more conservative) approach it to include the effect size of the test that includes all individuals.

Reply Tressoldi: we emailed this comment to Bierman, and he agreed with your proposal. Consequently we will update our database.

The authors do make one inconsistent choice that is against their own hypothesis. In Traxler et al (Exp 1b) they include an analysis over items, which is strongly in the opposite direction of the predicted effect, while they should actually have included the (not significant) analysis over participants (as I have done in my p-curve analysis).

Reply Tressoldi: The choice to include item vs participants analysis is always debated. Psycholinguistics prefer the first one, others the second one. We wrote to Traxler and now we corrected the data using the by participants stats. You are correct that in our p-curve analysis we erroneously added this negative effect due to a bug of the apps that does not take in account the sign of the stats.

The authors state that ‘Like many experiments in psychology, psi experiments tend to produce effect sizes that Cohen (1988) would classify as “small” (d ≈ 0.2).’ However, Cohen’s d_z cannot be interpreted following the guidelines by Cohen, and the statement by the authors is misleading.

Reply Tressoldi: Misleading to what? We agree that Cohen’s classification is arbitrary and each ES must be interpreted within its context given that sometime “small differences = big effects" and viceversa.

(Comment DL - From my effect size primer: As Olejnik and Algina (2003) explain for eta squared (but the same is true for Cohen's dz), these benchmarks were developed for comparisons between unrestricted populations (e.g., men vs. women), and using these benchmarks when interpreting the effect size in designs that include covariates or repeated measures is not consistent with the considerations upon which the benchmarks were based.)

On the ‘Franklin’ tab of the spreadsheet, it looks like 10 studies are analyzed which all show effects in the expected direction. That is incredibly unlikely, if so. Other included studies, such as http://www.chronos.msu.ru/old/EREPORTS/polikarpov_EXPERIMENTS_OF_D.BEM.pdf have too low quality to be included. The authors would do well to more clearly explain which data is included in the meta-analysis, and provide some more openness about the unpublished materials.

Reply Tressoldi: all materials are available upon request. As to Franklin data, we had a rich correspondence with him and he sent us the more “conservative” results. As to Fontana et al. study, we obtained the raw data from the authors we analyzed independently.

Finally, I have been in e-mail contact with Alexander Batthyany.
[Note that an earlier version of this blog post did not accurately reflect the comments made by Alexander Batthyany. I have corrected his comments below. My sincere apologies for this miscommunication].
The meta-analysis includes several of his data-sets, and one dataset by him with the comment ‘closed-desk but controlled’. He writes: ‘I can briefly tell you that the closed desk condition meant that there was an equal amount of positive and negative targets, and this is why I suggested to the authors of the meta-analysis that they should not include the study. Either consciously or unconsciously, subjects could have sensed whether more or less negative or positive stimuli would be left in the database, in which case their “guessing” would have been altered by expectancy effects (or a normal unconscious) bias. My idea to test this possibility was to see whether there were more hits towards the end for individual participants – I let Daryl Bem run this analysis and he said that there is no such trend.

Despite no such trend, I personally believe the difference between a closed desk and open desk condition seems worthwhile to discuss in a meta-analysis.

Prof. Bem will respond to this later.

Indeed, there is some additional support for this confound in the literature. Maier and colleagues (in press, p 10) write: ‘A third unsuccessful replication was obtained with another web-based study that was run after the first web study. Material, design, and procedure were the same as in the initial web study with two changes. Instead of trial randomization without replacement, a replacement procedure was used, i.e., the exact equal distribution of negative pictures to the left and right response key across the 60 trials was abandoned.’

Note that Maier and colleagues are the only research team to perform a-priori power analyses, and run high powered studies. With such high sample sizes, the outcome of studies should be stable. In other words, it would be very surprising if several studies with more than 95% power do not reveal an effect. Nevertheless, several studies with extremely high power did fail to provide an effect. This is a very surprising outcome (although Maier and colleagues do not discuss it, instead opting to meta-analyze all results), and makes it fruitful to consider important moderators indicated by procedural differences between studies. This equal distribution of pictures issue is very important to consider, as it could be an important confound, and there seems to be support for this confound in the literature.

Markus Maier and Vanessa Buechner reply: We agree with Dr. Lakens’ argument that studies with a high power should have a higher likelihood of finding the effect. However, his “high-power” argument does not apply to our second web study. Although most of our studies have been performed based on a priori power analyses the two web studies reported in our paper have not (see p. 10, participants section). For both web studies a data collection window of three months was pre-defined. This led to a sample size of 1222 individuals in web study 1 but only to 640 participants in web study 2 (we think that the subject pool was exhausted after a while). If we use the effect size of web study 1 (d = .07) to perform an a posteriori power calculation (G*power 3.1.3; Faul et al., 2007) of web study 2 we reach a power of .55. This means that web study 2 is significantly under-powered, what might explain the statistical trend that we found only. Web study 1 instead had an a posteriori power of .79.

Nevertheless, we admit that unsuccessful study 2 had an a priori power of 95% and yet did not reveal any significant effect. This is surprising and constitutes from a frequentist point of view a clear empirical failure of finding evidence for retro-causation. However, from a Bayesian perspective this study does neither provide evidence for H0 nor H1.

We also agree that procedural difference such as open vs. closed deck procedures should be experimentally explored in future studies, but we think that this variation does not explain whether a study in our series was significant or not. For example, in the successful studies 1 to 3 as well as unsuccessful studies 1 and 2 a closed deck procedure was used and in successful study 4 and unsuccessful study 3 an open deck procedure was applied. This data pattern indicates that such a procedural difference is uncorrelated with the appearance of a retroactive avoidance effect. However, we do think that such a variation might explain effect size differences found in the studies (see our discussion of Study 4).

These are only some questions, and some answers, which I hope will benefit researchers interested in the topic of precognition. I present the responses without any 'concluding' comments from my side (except 2 short clarifications) because I believe other scientists should continue this discussion, and these points are only intermittent observations. It's interactions like these, between collaborative researchers who are trying to figure something out that make me proud to be a scientist.