Tuesday 6 September 2016

Performing ANOVA and Tukey's HSD Tests in R

Sometimes back last month a friend of mine approached me with a problem, he had two data sets with more than 50 variables each with 3 observations. He wanted to know if the means of the variables were significant different from each other.


So the problem is to investigate whether the observed difference in means is too large to be the result of random selection.Nevertheless, the sample means do look different. But what about the population means? 

We will be asking ourselves that is the difference (in the population mean) great enough that you can rule out chance? Note we will investigate this claim use the available sample data.
I will be using ANOVA to test for Equality of all means and later perform Post-Hoc Analysis called Tukey HSD (Honestly Significant Difference) in R. The one-way analysis of variance (ANOVA) is used to determine whether there are any statistically significant differences between the means of three or more independent (unrelated) groups. ANOVA test tells you whether you have an overall difference between your groups, but it does not tell you which specific groups differed – post hoc tests do. Because post hoc tests are run to confirm where the differences occurred between groups, they should only be run when you have a shown an overall statistically significant difference in group means.

ANOVA Test for Equality of All Means
So our Anova procedures tests these Hypothesis: (Ho = Null Hypothesis, H1 = Alternative Hypothesis)
                Ho: m1 = m2 = m3 = m4 – mn all our  sample means are the same
                H1: two or more means are different from the others
Let’s test these hypotheses at the α = 0.05 significance level.

One important thing to note is the format of  data really matters when performing ANOVA in R, and the data must be stacked.
Original Data
Stacked Data

Then we go ahead and perform the ANOVA test and view the result.




Observations:

  •  Note that the mean square between samples  63.43 is much larger than within the samples2.74.

  • The ratio, between-groups mean square over within-groups mean square, is called an F statistic (F = 63.43/2.74 in our case). It tells you how much more variability there is between sample groups than within the sample groups.

      Since the F value is large we are more confident in rejecting the null hypothesis, which was that all means are equal.
Conclusion:
The P-Value which is 0.0000000000000002 (very much close to zero) is below our significance value of 0.05 it would be unlikely to have a p-value this low if there were no real differences among the means of our samples
Therefore we reject H0 and accept H1 , concluding that the mean of all samples is not the same.




Tukey HSD for Post-Hoc Analysis

Since our ANOVA test shows that the means aren’t all equal, our next step is to determine which means are different, using our level of significance that is α =  0.05. To do so I performed a post- Hoc analysis called Tukeys HSD in Agricolae package


Samples sharing the same letter are not significantly different, at the chosen level (default, 5%). While those with different letters are significantly different.