Data snooping is a form of statistical bias manipulating data or analysis to artificially get statistically significant results.
Extended data manipulation increases your chances of observing statistically significant results because of the probabilistic nature of all statistical tests. Although some of these results may be significant by nature, other results could demonstrate this property just by chance.
Alternative names: data dredging, data fishing, p-hacking.
In this simple simulation we will demonstrate an example of data snooping.
Suppose you study the effect of jelly beans on acne. To do so, you conduct a survey among $500$ volunteers and collect the following information:
Let us simulate the data. We generate two independent random variables:
Here we assume that $90\%$ of the population eats jelly beans on a regular basis.
set.seed(1)
people = 500
p_eat_jelly_bean = 0.9
#draw a sample
acne_condition = runif(people)
eating = sample(c('eating', 'not eating'), people, replace = TRUE, prob = c(p_eat_jelly_bean, 1 - p_eat_jelly_bean))
#create data frame
data = data.frame('acne_condition' = acne_condition, 'eating' = eating)
head(data)
Before doing formal statistical testing we visualize the data. According to the boxplot, there is no significant difference in eating_condition between two eating categories.
library(ggplot2)
options(repr.plot.width = 5, repr.plot.height = 4)
#boxplot for eating vs acne_condition
ggplot(data)+
geom_boxplot(aes(x = eating, y = acne_condition, fill = eating))+
scale_fill_manual(breaks = c('eating', 'not eating'), values = c('grey', 'white'))+
theme(legend.position = "none", plot.title = element_text(hjust = 0.5))+
xlab('eating jelly beans') +
ylab('acne condition') +
ggtitle('Distribution of acne condition \nfor people eating/not eating jelly beans')
Now we state the null hypothesis: "There is no effect of eating jelly beans on acne condition." To check this hypothesis we run t-test.
t.test(acne_condition ~ eating, data = data)
The resulting p-value is high, i.e. $p = 0.6688>0.05$, so we cannot reject the null. In other words, we conclude that there is not enough evidence to say that eating jelly beans causes acne. (Actually, we are not surprized as eating and acne_condition were generated independently of each other).
You got It non-significant p-value and feel disappointed as you cannot publish your results. Your colleague suggests you to further investigate the data and check different colors of jelly beans. Maybe one of them is more dangerous than the others?
Now we add the color information to the data. We assume that there exist $20$ different jelly bean colors and that people eat jelly beans of each color with equal probability.
The code below generates one more random variable jelly_bean_color.
library(tidyverse)
set.seed(1)
#create a list of colors
colors = c('red', 'grey', 'blue', 'yellow', 'orange', 'purple', 'limegreen',
'cyan', 'brown', 'pink', 'gold', 'salmon', 'magenta',
'peachpuff', 'tan', 'aquamarine', 'green', 'coral', 'steelblue', 'beige')
#add color column to the data
data$jelly_bean_color = sample(colors, people, replace = TRUE)
data = data %>% mutate(jelly_bean_color = ifelse(eating == "not eating", NA, jelly_bean_color))
head(data)
We plot the distribution of acne condition for different jelly bean colors.
options(repr.plot.width = 12, repr.plot.height = 5)
ggplot(data %>% filter(!is.na(jelly_bean_color)))+
geom_boxplot(aes(x = jelly_bean_color, y = acne_condition, fill = jelly_bean_color))+
scale_fill_manual(breaks = colors, values = colors)+
xlab('jelly bean color') +
ylab('acne condition') +
theme(legend.position = "none")
To check the influence of a specific jelly bean color on acne condition you again run t-test.
#special function that runs t-test for a specific color
test_color = function(color){
data = data %>% mutate(eating_color = ifelse(jelly_bean_color != color | is.na(jelly_bean_color), "no", "yes"))
t.test(acne_condition ~ eating_color, data = data)$p.value
}
You check red color first, the test is not significant.
cat('p value for red color = ', test_color('red'))
You try blue color next. Same result!
cat('p value for blue color = ', test_color('blue'))
Maybe yellow color will work? No success.
cat('p value for yellow color = ', test_color('yellow'))
So you run t-test for all $20$ jelly bean colors and you observe the following distribution of p-values.
options(repr.plot.width = 12, repr.plot.height = 4)
ttest_data = data.frame('color' = colors, 'pval' = sapply(colors, test_color))
ggplot(ttest_data)+
geom_point(aes(x = color, y = pval, color = color), size = 5)+
scale_color_manual(breaks = colors, values = colors)+
geom_hline(aes(yintercept = 0.05), linetype = 'dashed')+
ylim(0, 1)+
xlab('jelly bean color') +
ylab('p-value') +
theme(legend.position = "none")
You discover that green jelly bean color has significant p-value.
cat('p value for green color = ', test_color('green'))
You finally publish your breaking news!
Now let's figure out what is going on here. So far it looks quite counterintuitive: it seems that the way we generated the data should make the acne condition completely independent of eating jelly beans of any color.
Actually, it is not the data that causes the problem, it is the testing procedure!
It turns out that even if there is no effect of eating jelly beans on acne the probability to observe a significant result is equal to $0.05$. This can be explained by the fact that p-value is uniformly distributed under the null; thus, observing p-value less than 0.05 has exactly $5\%$ chance.
It is not hard to derive the formal proof of this fact, however, one can check this statement via simulation as well. To do so, we
#special function that generates data
generate_data = function(){
acne_condition = runif(people)
eating = sample(c('eating', 'not eating'), people, replace = TRUE, prob = c(p_eat_jelly_bean, 1 - p_eat_jelly_bean))
data = data.frame('acne_condition' = acne_condition, 'eating' = eating)
return(data)
}
#compute pvalues for 10000 random data sets
trial = 10000
pvals = replicate(trial, t.test(acne_condition ~ eating, data = generate_data())$p.value)
The histogram demonstrates that p-value follows the uniform distribution. Thus, we conclude that there is $5\%$ chance to observe significant p-value even if the null is true.
options(repr.plot.width = 5, repr.plot.height = 4)
ggplot()+
geom_histogram(aes(x = pvals, y = ..density..), breaks = seq(0,1, 0.1), fill = 'orange', alpha = 0.5)
Next, if there is $5\%$ to observe a significant result under the null, then testing $20$ hypotheses will result in $$1 - 0.95^{20}\approx 64\%$$ chance to observe at least one significant test among these $20$. In other words, running many tests on the same data increased the probability to observe some significant result form $5\%$ to $64\%$! This explaines why checking 20 different jelly bean colors revealed one color that is "associated" with acne.
Looking for patterns in data is legitimate. However, you should be very careful when interpreting the results. As we showed above, running multiple tests and presenting only the significant ones without the appropriate context can be very misleading.
There are some remedies for data snooping: