theBioBucket*: Comparing Two Distributions

29 Aug 2011

Comparing Two Distributions

Here I compare two distributions, flowering duration of indigenous and allochtonous plant species. The hypothesis is that alien compared to indigenous plant species exhibit longer flowering periods.

download data*

*Data is courtesy of BiolFlor:
Klotz, S., Kühn, I. & Durka, W. [Hrsg.] (2002): BIOLFLOR - Eine Datenbank zu biologisch-ökologischen Merkmalen der Gefäßpflanzen in Deutschland. - Schriftenreihe für Vegetationskunde 38. Bonn: Bundesamt für Naturschutz.

## comparing flowering time in indigenous and alien species by
## usage of a quantile-quantile plot (qqplot), the ks-test, by
## testing shifts in medians (wilcox test) and by testing
## difference of means (t-test, as we deal with integer and
## not a continous variable it is not the most appropiate choice)
## as well as by a chi-square test:

dat <- read.csv("E:\\R\\Data\\flowering_alien_vs_indigen.csv",
                sep = ";")

library(lattice)

histogram(~ Flowering|Status, data = dat, col = "gray60", layout = c(1, 2),
          xlab = list("Months of flowering"),
          ylab = list("Percentage of total"),
          scales = list(y = list(alternating = F)),
          strip = strip.custom(factor.levels = c("alien", "indigenous")))

qqplot(dat$Flowering[dat$Status == "indigen"],
       dat$Flowering[dat$Status == "Neophyt"])
abline(a = 0, b = 1, lty = 3)

ks.test(dat$Flowering[dat$Status == "indigen"],
        dat$Flowering[dat$Status == "Neophyt"])

wilcox.test(Flowering ~ Status, data = dat)

t.test(Flowering ~ Status, data = dat)

## Note that in the two-sample case the estimator for the
## difference in location parameters does not estimate
## the difference in medians (a common misconception) but
## rather the median of the difference between a sample
## from x and a sample from y.

## as we deal with a limited number of classes (1-12 months)
## and sample size is big enough my favourite would be a
## chi-square test:

m <- table(dat$Status, dat$Flowering)

(Xsq <- chisq.test(m))  # Prints test summary
Xsq$observed   # observed counts (same as M)
Xsq$expected   # expected counts under the null
Xsq$residuals  # Pearson residuals
Xsq$stdres     # standardized residuals

5 comments :

Anonymous12 December 2013 at 01:20
Thank you for posting this! I have a similar type of data set and these comments/codes were vastly helpful.
-JW, California, USA
ReplyDelete
Replies
Anonymous17 March 2014 at 11:51
Hi,

I am unable to access the data you uploaded. Can you check your link please?
ReplyDelete
Replies
Anonymous17 July 2014 at 21:05
Hello, and thank you for your post, it's been very helpful.

I'm also trying to compare two frequency distributions (monkey subgroup-size during dry "S_12" vs. rainy "LL_13" season):

n <- table(sg$sgs, sg$temp) #Contingency table for subgroup size (sgs) per season (temp)

S_12 LL_13
1 182 185
2 230 262
3 190 229
4 190 162
5 94 130
6 79 73
7 56 113
8 30 46
9 34 37
10 7 24
11 7 24
12 4 18
13 2 9
14 1 4

I basically want to see if subgroup size frequencies were different between seasons. After running a chi squared test usign your approach, I got a significant difference:

Xsq <- chisq.test(n)
Xsq

Pearson's Chi-squared test
data: n
X-squared = 52.7027, df = 13, p-value = 1.018e-06

However, I'm unclear on something and I would apprecieate any insight you share with me. From what I understand, when we use the chi-squared test to compare two observed distributions, one of them serves as the "expected" one, and is then compared to the other "observed" one. But if this is the case, why do we get a separate set of expected values when looking at the summary of the test?

This is what I get for expected values:

Xsq$expected # expected counts under the null

S_12 LL_13
1 167.589595 199.410405
2 224.670520 267.329480
3 191.335260 227.664740
4 160.739884 191.260116
5 102.289017 121.710983
6 69.410405 82.589595
7 77.173410 91.826590
8 34.705202 41.294798
9 32.421965 38.578035
10 14.156069 16.843931
11 14.156069 16.843931
12 10.046243 11.953757
13 5.023121 5.976879
14 2.283237 2.716763

I'm worried that whart R is actually doing is comparing both sets of distributions with a model based on the default p value for the function ( p = rep(1/length(x)). Forgive me if this is a dumb question, but I've been going through the "?chisq.test" help file and have gone through other references and still can't clear that up in my head, so any help or guidance will be appreciated.
ReplyDelete
Replies

Add comment

TABS

29 Aug 2011

Comparing Two Distributions

5 comments :