Mel Slater's Presence Blog

Thoughts about research and radical new applications of virtual reality - a place to write freely without the constraints of academic publishing,and have some fun.

My Photo
Name:

I still find immersive virtual reality as thrilling now as when I first tried it 20 years ago.

14 November, 2016

The fallacy of the large sample size argument

Another bureaucratic solution to problems of science is being increasingly heard. This is the idea that papers that report experiments with small sample sizes should be "desk rejected". This is a relatively new phenomenon. Recently I have had two referees on two different papers mention this point. I have seen tweets suggesting that anything less than n = 100 won't do. This is supposed to contribute to the solution of the 'crisis of reproducibility' - especially in psychology and related disciplines.


Such proposals for bureaucratic solutions with fixed norms do not take account of elementary statistics. The sample size needed for estimation or hypothesis testing is relative to the variance of the variables involved. 

Take the simplest example - where a normally distributed random variable X has unknown mean m and known variance s2. The experiment will deliver a sample of n independent observations on X. The goal is to estimate m, with a 95% confidence interval, u to v. Critically we want to choose a sample size n such that v-u< e.

The 95% confidence interval is  xbar ± 1.96*s/sqrt(n) where xbar is the sample mean.

Therefore we require:
3.92s/sqrt(n) < e, leading to n > (3.92s/e)^2. 

The sample size needed is proportional to the variance. For example, suppose that s = e. Then a sample size of 16 is good enough.  Suppose that s = 0.80*e, then a sample size of around 10 would be good enough. A sample size of 100 would be needed if s would be 2.55 times e. It is clearly the variance in relation to the required error size that is important.

This really matters because running experiments is typically very expensive. If a sample size of 10 would do why use 100? It would only be to satisfy a slogan, not for any real contribution to more reliable statistical inference.

The supposed failure of reproducibility is built into the system and will not be overcome by adopting slogans. In classical statistical inference with 5% significance levels, according to the theory the null hypothesis will be wrongly rejected approximately 5% of the time anyway. No researcher can know if their experiment is one of the 5%! This is the point of repeat studies. Yet repeat studies are hard to publish (no novelty) and if indeed they do find results at odds with the original study then the researchers on the original study may have their integrity questioned. Yet who can know whether the second study is one of the 20% that will falsely not reject the null hypothesis (assuming a power of 80%)!

The Bayesian approach does not have these problems. An experiment will result in probabilities about hypotheses or probability distributions over parameters rather than fixed answers or conclusions. As more data are collected through repetition studies so these probabilities will be updated. Also it should be noted that a similar argument used in the example above will hold for finding a Bayesian credible interval. 

To paraphrase a well known saying: "It's the variance, stupid!" 

2 Comments:

Blogger Não said...

Nice post. However, I'd say that it is in the hands of the research to replicate its own results within her lab before publication. If she can indeed replicate them, the probability she got p<0.05 twice by chance decreases dramatically (1 in 400).

Also, sample size estimations are also problematic and incomplete as one has to take a huge step assuming the variance of the population, which is by itself something you can reasonably estimate only by running experiments.

In conclusion, having a "slogan" might be a good idea, because it is only that - a slogan. Nobody will reject your paper if you have 40 subjects, because the slogan says 100.

1:08 pm  
Blogger Mel Slater said...

Hi, thanks for your post.

I completely agree that replication is vital. But there has to be something to replicate. So there has to be a first experiment.
Now if you do experiment A with sample size n1, and then replicate it with experiment B with sample size n2, and then publish that as your first paper, in fact you have still only done one experiment with sample size n1+n2. So this doesn't really solve the problem, since (i) that needs to be replicated and (ii) it doesn't say anything about the issue of registration.


Moreover the argument that P = 0.05 on A and P = 0.05 on B, means overall P = 0.0025 isn't quite correct, since the multiplication assumes independence.
P(reject H0 in A | Ho is 'true') * P(reject H0 in B | H0 is true) does not equal P(reject H0 in A and B combined | H0 is true).

The problem with a "slogan" is when people start to believe that what it says is true: i.e., the more it is said the more that it must be true. It strays into the world of politics unfortunately.

My main point is that imposing bureaucratic solutions to scientific problems cannot possibly be an answer.
On the other hand if people want to register their experimental designs I don't think it is a problem. However, the problem comes in registering *analyses* - since as I tried to illustrate, real data is a mess, and there are typically (and ideally) new discoveries in data that wipe out the validity of the originally planned analysis.

Also it should be noted that much of the problem is inherent in the paradigm for analysis. If experiment A rejects the null hypothesis at 5% a certain finding has gone into the scientific literature. But then there is a replication B which does not reject the null hypothesis. So now the original finding is put into doubt. Somehow blame is then put on the people who did A. But inherent in the method is that we know that 5% of the time there will be a false positive. But how were the researchers in A supposed to know that theirs was one of the 5%? Or turning this around - suppose that the power of B was the conventional 80%. How can the researchers of B know that they weren't in the unlucky 20% that failed to reject H0 even though it is 'true'?

I personally find the Bayesian approach more attractive, where each new batch of data simply updates your probability for the hypothesis in question, and doesn't lead to a 'decision' (reject or not) - rather it leads to information. If a decision MUST be made one way or another, then it should be put in the context of decision theory which also takes into account relative costs.

4:29 pm  

Post a Comment

<< Home