THE MOTHER-OF-ALL TEST, THE CHI SQUARE TEST!

The mother-of-all tests, the Chi square test!

Quick, name one statistical test!

Yep, right, it’s a scientifically proven fact that 87.3% of you replied the chi square test. (The remaining 13.7% are just smarty pants!… And quick, among those 13.7%, how many realized that we had a total of 101% here? Humhum, that’s what I thought!)

The chi square test will allow you to compare two distributions when dealing with 2 or more categories (for example proportions of Male/Female/Juvenile in a population, or proportions of the population carrying the O/A/B alleles for blood types). You can compare an observed distribution to a theoretical one, this is a test of goodness of fit. Or you can compare 2 or more observed distributions, this is a test of independence.

The chi square test of goodness of fit

 Let’s take a look first at the chi square test of goodness of fit. In this case, you will have a population sorted in 2 or more categories, and try to figure out if your observations match what would be expected under a theoretical framework.

Important note: the chi square test of goodness of fit requires that none of the expected theoretical sizes are lower than 5.

Let’s take an example presented in “R pour Statophobes” by Denis Poinsot:

We have a population of 89 individuals, obtained from cross breeding. We have observed 25 individuals with a AA genotype, 59 with Aa, but only 5 individuals with a aa genotype. Suspicious… Suspicious indeed as we would have expected under Mendelian inheritance to have 25% of AA, 50% of Aa, and 25% of aa.

First things first, let’s make sure that we can use the chi square test:

Lowest theoretical sample size = Lowest expected proportion * Total sample size
0.25*89
[1] 22.25

22.25  >> 5, we’re clear to go!

We just need to input the observed sample sizes for each category, and the corresponding expected proportions (as with the binomial test) as:

observed_samples=c(25,59,5)
expected_proportions=c(0.25,0.5,0.25)
chisq.test(observed_samples,p=expected_proportions)
        Chi-squared test for given probabilities
data:  observed_samples
X-squared = 18.4382, df = 2, p-value = 9.913e-05

The results of our test indicates that something funny is happening in our population if it is actually following Mendelian inheritance. The probability of observing this population, if it were Mendelian, would be lower than 0.0001! Among the returned results, you have the chi square value (X-squared = 18.4382) and the number of degrees of freedom (df = 2), this is what is used to calculate the corresponding p-value. While the function is directly returning the p-value, if you ever need to calculate it yourself from the chi-square value and the number of degrees of freedom, you can use the Chi-squared distribution function “pchisq()” for the given parameters:

Xsquared = 18.4382
df = 2
1-pchisq(Xsquared,df)
9.912786e-05

Purists will forgive the shortcut in the following explanation: “pchisq(Xsquared,df)” tells us if the two elements tested are the same. Therefore, “1-pchisq(Xsquared,df)” returns the probability that the two elements are different. More precisely, we consider that if this last p-value is greater than 0.05 we cannot reject the hypothesis “the two elements are the same”.


The chi square test of independence

Let’s take a look now at the chi square test of independence. We have sampled a population at several different locations (for example), and wish to determine if the proportions of each category of individuals we observed is different between the locations. In order to do that, as previously said, we can use a chi square test… As long as no more than 20% of the theoretical/expected sample sizes are lower than 5 (Cochran’s rule)

And it’s even simpler to do than the goodness of fit test!

We have detected bears in 3 different states: MO, MI, MS (bear with me here…). And we have counted males, females and juveniles respectively:

males=c(0,12,25)
females=c(12,16,45)
juveniles=c(15,43,1)

We need to set it up as a matrix for chisq.test():

barely_bearable=rbind(males,females,juveniles)
colnames(barely_bearable)=c("MO","MI","MS")
barely_bearable
          MO MI MS
males      0 12 25
females   12 16 45
juveniles 15 43  1

Now for the test itself:

chisq.test(barely_bearable)
        Pearson's Chi-squared test
data:  barely_bearable
X-squared = 65.7, df = 4, p-value = 1.832e-13

The probability to observe this table simply by pure randomness is incredibly low (under 1.832e-13), and we can safely declare that there is an imbalance in the sex ratios of the bear populations in those 3 states.

Exercise 3.2

 We wish to compare 3 restaurants depending on the prices in their menu.

 To make things simple, we sorted the items of each menu in 4 categories: really cheap, cheap, expensive, really expensive.

 The first restaurant has 12 really cheap items, 23 cheap items, 42 expensive items, 36 really expensive items

 The second restaurant has 9 really cheap items, 21 cheap items, 22 expensive items, 14 really expensive items

 The third restaurant has 43 really cheap items, 33 cheap items, 17 expensive items, 6 really expensive items

 The fourth restaurant has 25 really cheap items, 10 cheap items, 16 expensive items, 13 really expensive items

 – Create a data frame combining those information

 – Perform a chi square test to see if the prices in those restaurant are different

 

Answer 3.2

rest1=c(12,23,42,36)
rest2=c(9,21,22,14)
rest3=c(43,33,17,6)
rest4=c(25,25,16,13)
restdata=data.frame(rest1,rest2,rest3,rest4)
chisq.test(restdata)

 

[collapse]

 

INTRODUCTION

No, don't run away! It will be fine. Stats are cool.

ANOVA

Comparing the mean of more than two samples

FISHER’S EXACT TEST

Comparing several observed distribution

STUDENT’S T-TESTS

Comparing the mean of two samples

KRUSKAL-WALLIS RANK SUM TEST

Comparing more than two samples with a non-parametric test

CORRELATION AND REGRESSION

Correlation, regression and GLM!

WILCOXON TESTS

Comparing two samples with a non-parametric test

BINOMIAL TEST

Comparing observed percentages to theoretical probabilities

CONCLUSION

After this dreadful interlude, let's make some art!