Data snooping¶

Data snooping is a form of statistical bias manipulating data or analysis to artificially get statistically significant results.

Extended data manipulation increases your chances of observing statistically significant results because of the probabilistic nature of all statistical tests. Although some of these results may be significant by nature, other results could demonstrate this property just by chance.

Alternative names: data dredging, data fishing, p-hacking.

Jelly beans experiment¶

In this simple simulation we will demonstrate an example of data snooping.

Part 1: does eating jelly beans cause acne?¶

Suppose you study the effect of jelly beans on acne. To do so, you conduct a survey among $500$ volunteers and collect the following information:

if a participant eats jelly beans on a regular basis;
participant's acne condition measured as a value between $0$ and $1$.

Let us simulate the data. We generate two independent random variables:

acne_condition indicating the acne condition of a participant drawn from uniform distribution $U(0,1)$;
eating indicating jelly bean consumption will drawn from Bernoulli distribution $Bern(0.9)$;

Here we assume that $90\%$ of the population eats jelly beans on a regular basis.

set.seed(1)
people = 500 
p_eat_jelly_bean = 0.9

#draw a sample
acne_condition = runif(people)
eating = sample(c('eating', 'not eating'), people, replace = TRUE, prob = c(p_eat_jelly_bean, 1 - p_eat_jelly_bean))

#create data frame
data = data.frame('acne_condition' = acne_condition, 'eating' = eating)
head(data)

Before doing formal statistical testing we visualize the data. According to the boxplot, there is no significant difference in eating_condition between two eating categories.

library(ggplot2)
options(repr.plot.width = 5, repr.plot.height = 4)

#boxplot for eating vs acne_condition
ggplot(data)+
geom_boxplot(aes(x = eating, y = acne_condition, fill = eating))+
scale_fill_manual(breaks =  c('eating', 'not eating'), values =  c('grey', 'white'))+
theme(legend.position = "none", plot.title = element_text(hjust = 0.5))+
xlab('eating jelly beans') + 
ylab('acne condition') +
ggtitle('Distribution of acne condition \nfor people eating/not eating jelly beans')

Now we state the null hypothesis: "There is no effect of eating jelly beans on acne condition." To check this hypothesis we run t-test.

t.test(acne_condition ~ eating, data = data)

	Welch Two Sample t-test

data:  acne_condition by eating
t = -0.42948, df = 73.664, p-value = 0.6688
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.09402103  0.06067896
sample estimates:
    mean in group eating mean in group not eating 
               0.4937211                0.5103921

The resulting p-value is high, i.e. $p = 0.6688>0.05$, so we cannot reject the null. In other words, we conclude that there is not enough evidence to say that eating jelly beans causes acne. (Actually, we are not surprized as eating and acne_condition were generated independently of each other).

Part 2: which jelly bean color causes acne?¶

You got It non-significant p-value and feel disappointed as you cannot publish your results. Your colleague suggests you to further investigate the data and check different colors of jelly beans. Maybe one of them is more dangerous than the others?

Now we add the color information to the data. We assume that there exist $20$ different jelly bean colors and that people eat jelly beans of each color with equal probability.

The code below generates one more random variable jelly_bean_color.

library(tidyverse)
set.seed(1)

#create a list of colors
colors = c('red', 'grey', 'blue', 'yellow', 'orange', 'purple', 'limegreen', 
                     'cyan', 'brown', 'pink', 'gold', 'salmon', 'magenta', 
                     'peachpuff', 'tan', 'aquamarine', 'green', 'coral', 'steelblue', 'beige')

#add color column to the data
data$jelly_bean_color = sample(colors, people, replace = TRUE)
data = data %>% mutate(jelly_bean_color = ifelse(eating == "not eating", NA, jelly_bean_color))
head(data)

We plot the distribution of acne condition for different jelly bean colors.

options(repr.plot.width = 12, repr.plot.height = 5)

ggplot(data %>% filter(!is.na(jelly_bean_color)))+
geom_boxplot(aes(x = jelly_bean_color, y = acne_condition, fill = jelly_bean_color))+
scale_fill_manual(breaks = colors, values =  colors)+
xlab('jelly bean color') + 
ylab('acne condition') +
theme(legend.position = "none")

To check the influence of a specific jelly bean color on acne condition you again run t-test.

#special function that runs t-test for a specific color
test_color = function(color){
    data = data %>% mutate(eating_color = ifelse(jelly_bean_color != color | is.na(jelly_bean_color), "no", "yes"))
    t.test(acne_condition ~ eating_color, data = data)$p.value
}

You check red color first, the test is not significant.

cat('p value for red color = ',  test_color('red'))

p value for red color =  0.4506474

You try blue color next. Same result!

cat('p value for blue color = ', test_color('blue'))

p value for blue color =  0.2353636

Maybe yellow color will work? No success.

cat('p value for yellow color = ', test_color('yellow'))

p value for yellow color =  0.7795672

So you run t-test for all $20$ jelly bean colors and you observe the following distribution of p-values.

options(repr.plot.width = 12, repr.plot.height = 4)

ttest_data = data.frame('color' = colors, 'pval' = sapply(colors, test_color))
ggplot(ttest_data)+
geom_point(aes(x = color, y = pval, color = color), size = 5)+
scale_color_manual(breaks =  colors, values =  colors)+
geom_hline(aes(yintercept = 0.05), linetype = 'dashed')+
ylim(0, 1)+
xlab('jelly bean color') + 
ylab('p-value') +
theme(legend.position = "none")

You discover that green jelly bean color has significant p-value.

cat('p value for green color = ', test_color('green'))

p value for green color =  0.01605542

You finally publish your breaking news!

Explanation¶

Now let's figure out what is going on here. So far it looks quite counterintuitive: it seems that the way we generated the data should make the acne condition completely independent of eating jelly beans of any color.

Actually, it is not the data that causes the problem, it is the testing procedure!

It turns out that even if there is no effect of eating jelly beans on acne the probability to observe a significant result is equal to $0.05$. This can be explained by the fact that p-value is uniformly distributed under the null; thus, observing p-value less than 0.05 has exactly $5\%$ chance.

It is not hard to derive the formal proof of this fact, however, one can check this statement via simulation as well. To do so, we

generate jelly beans data $10000$ times;
run t-test for each data testing the null "there is no effect of eating jelly beans on acne condition";
check the distribution of p-values. Note that the way we generate the data guarantees the real data distribution to follow the hypothesis.

#special function that generates data
generate_data = function(){
    acne_condition = runif(people)
    eating = sample(c('eating', 'not eating'), people, replace = TRUE, prob = c(p_eat_jelly_bean, 1 - p_eat_jelly_bean))
    data = data.frame('acne_condition' = acne_condition, 'eating' = eating)
    return(data)
}

#compute pvalues for 10000 random data sets
trial = 10000
pvals = replicate(trial, t.test(acne_condition ~ eating, data = generate_data())$p.value)

The histogram demonstrates that p-value follows the uniform distribution. Thus, we conclude that there is $5\%$ chance to observe significant p-value even if the null is true.

options(repr.plot.width = 5, repr.plot.height = 4)

ggplot()+
geom_histogram(aes(x = pvals, y = ..density..), breaks = seq(0,1, 0.1), fill = 'orange', alpha = 0.5)

Next, if there is $5\%$ to observe a significant result under the null, then testing $20$ hypotheses will result in $$1 - 0.95^{20}\approx 64\%$$ chance to observe at least one significant test among these $20$. In other words, running many tests on the same data increased the probability to observe some significant result form $5\%$ to $64\%$! This explaines why checking 20 different jelly bean colors revealed one color that is "associated" with acne.

Conclusion¶

Looking for patterns in data is legitimate. However, you should be very careful when interpreting the results. As we showed above, running multiple tests and presenting only the significant ones without the appropriate context can be very misleading.

There are some remedies for data snooping:

clearly state the testing proceedure and share the details about the number of tests you conducted;
conduct randomized out-of-sample tests or use cross-validation to validate your hypothesis;
record the number of all significance tests conducted during the study and apply some correction (e.g. Bonferroni correction; the less conservative way would be to use Benjamini and Hochberg's false discovery rate).