The percent difference fallacy and a solution: the ratio t-test

September 12, 2018 – 12:26 pm

One of the most common forms of statistical malpractice I see as a reviewer and reader of scientific literature is using a null hypothesis statistical test (NHST) to declare an effect “significant” and then using this as justification for reporting completely unqualified percentage differences among treatments with no attempt to deal with the associated uncertainty. We might call this the percent difference fallacy.

It’s as if when p < 0.05 is achieved, suddenly our estimates of percent difference among treatments are accurate to 2-4 significant digits! Here’s an example I came across in a recent issue of Ecology.

Survival rates were 20.4% higher in the first cohort compared with the second (historical mean control) cohort (Dunnett’s test, p<0.001)

This is not from an article I would normally read. I just scanned recent articles for about 5 minutes till I found this one, but hopefully now that you know about it, it will annoy you as much as me when you start seeing it everywhere.

While this is not just a significant digits issue, the three digits presented in the above example (20.4%) represent what I think readers will often interpret as vastly better constraint on that value than is warranted. Even if the reader doesn’t interpret it as a value constrained to between 20.35 and 20.45%, I think most readers will underestimate how much greater the uncertainty is in most cases.

I’ll demonstrate with an R example (though you ought to still be able to follow here even if you are not an R user). Let’s say we have two treatments and the actual difference between them is that b is 50% larger than a, and there is some randomly distributed sampling error.

a <- 10 + rnorm(10)
b <- 15 + rnorm(10)
t.test(a,b)

This gives us output similar to the following:

Welch Two Sample t-test
data: a and b
t = -11.37, df = 17.997, p-value = 1.199e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.783967 -3.979782
sample estimates:
mean of x mean of y
10.05559 14.93747

Ok, so we have an absurdly significant p-value of 0.000000001199! The difference is “real” and percent differences quite accurate we might think to ourselves. So let’s calculate the percent difference: (14.93747-10.05559) / (10.05559) * 100 = 48.54892. Hm, that’s a lot of digits in both cases, so we’ll just report that treatment b is 48.5% higher than treatment a (p < 0.001). Boom, quantitative analysis is complete.

But wait, wasn’t that difference supposed to be 50%? It was. And we got somewhere between 48 and 49% which is, let’s call it, “off by a bit” when we use the easy-on-the-eyes three significant digits system and these particular parameters. Keep in mind that in many studies, especially in ecology, variation is higher and sample sizes lower (in other words, not all p-values are 0.000000001199).

So this leads to a question: is there a good way to quantify this uncertainty that we are seeing in these percent differences? And it turns out for a simple problem like this that there is.

We can do a “ratio t-test” in which we take the logs of the data, run a t-test, and then un-log the values. Meta-analysis practitioners who want to synthesize a lot of treatment/control ratios often take advantage of this situation where the difference in the log values is related to the quotient of the un-logged values.

t.test(log(b), log(a))

Welch Two Sample t-test
data: log(b) and log(a)
t = 11.022, df = 16.039, p-value = 6.801e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.3213321 0.4743339
sample estimates:
mean of x mean of y
2.701976 2.304143

Then we un-log and we can get a confidence interval for our percent difference

exp(2.701976 - 2.304143) = 1.488595
exp(0.3213321) = 1.378963
exp(0.4743339) = 1.606943

So, that 1.488595 looks familiar, it’s close to the 48.5% we calculated before. Why is it not exactly the same? Idunno and someone that is better than math and stats and R inner workings that me can figure that out, but you’ll find it’s always really close like this.

More interesting though is the 95% confidence interval: 37.8% to 60.7%. Pretty big! That third significant digit is starting to look ridiculous with what we now see is more of a 20% window. I would bet most scientists doing this type of analysis would not think it was so large when the p-value was so small. But it is, and we can demonstrate with 10,000 simulations:

set.seed(1)
ans <- logical(10000)
for(x in 1:10000) {
a <- rnorm(10) + 10
b <- rnorm(10) + 15
m <- t.test(log(b), log(a))
lwr <- exp(m$conf.int[1])
upr <- exp(m$conf.int[2])
ans[x] <- lwr < 1.5 & upr > 1.5}
summary(ans)

This gives 473 FALSE to 9527 TRUE, which is what we expect with a 95% confidence interval. Try it with other seeds, you’ll get the same answer. This shows that the ratio t-test confidence intervals (e.g., the 48.8% [37.8%, 60.7%] from above) are correct. And thus it follows that even if you have a ludicrously low p-value, you CANNOT assume or fairly present a percentage difference with 2 or 3 significant digits and no estimate of uncertainty. And remember if you see a p-value that is a lot closer to 0.05, that interval is probably something like 48.8% [1%, 90%]. In my view, scientists must start presenting these intervals done by ratio t-tests or similar approaches to avoid the percent difference fallacy.

Post a Comment