## 29 Aug 2011

### Comparing Two Distributions

Here I compare two distributions, flowering duration of indigenous and allochtonous plant species. The hypothesis is that alien compared to indigenous plant species exhibit longer flowering periods.

*Data is courtesy of BiolFlor:
Klotz, S., Kühn, I. & Durka, W. [Hrsg.] (2002): BIOLFLOR - Eine Datenbank zu biologisch-ökologischen Merkmalen der Gefäßpflanzen in Deutschland. - Schriftenreihe für Vegetationskunde 38. Bonn: Bundesamt für Naturschutz.

```## comparing flowering time in indigenous and alien species by
## usage of a quantile-quantile plot (qqplot), the ks-test, by
## testing shifts in medians (wilcox test) and by testing
## difference of means (t-test, as we deal with integer and
## not a continous variable it is not the most appropiate choice)
## as well as by a chi-square test:

sep = ";")

library(lattice)

histogram(~ Flowering|Status, data = dat, col = "gray60", layout = c(1, 2),
xlab = list("Months of flowering"),
ylab = list("Percentage of total"),
scales = list(y = list(alternating = F)),
strip = strip.custom(factor.levels = c("alien", "indigenous")))

qqplot(dat\$Flowering[dat\$Status == "indigen"],
dat\$Flowering[dat\$Status == "Neophyt"])
abline(a = 0, b = 1, lty = 3)

ks.test(dat\$Flowering[dat\$Status == "indigen"],
dat\$Flowering[dat\$Status == "Neophyt"])

wilcox.test(Flowering ~ Status, data = dat)

t.test(Flowering ~ Status, data = dat)

## Note that in the two-sample case the estimator for the
## difference in location parameters does not estimate
## the difference in medians (a common misconception) but
## rather the median of the difference between a sample
## from x and a sample from y.

## as we deal with a limited number of classes (1-12 months)
## and sample size is big enough my favourite would be a
## chi-square test:

m <- table(dat\$Status, dat\$Flowering)

(Xsq <- chisq.test(m))  # Prints test summary
Xsq\$observed   # observed counts (same as M)
Xsq\$expected   # expected counts under the null
Xsq\$residuals  # Pearson residuals
Xsq\$stdres     # standardized residuals
```

1. Thank you for posting this! I have a similar type of data set and these comments/codes were vastly helpful.
-JW, California, USA

2. Hi,

1. 3. I'm also trying to compare two frequency distributions (monkey subgroup-size during dry "S_12" vs. rainy "LL_13" season):

n <- table(sg\$sgs, sg\$temp) #Contingency table for subgroup size (sgs) per season (temp)

S_12 LL_13
1 182 185
2 230 262
3 190 229
4 190 162
5 94 130
6 79 73
7 56 113
8 30 46
9 34 37
10 7 24
11 7 24
12 4 18
13 2 9
14 1 4

I basically want to see if subgroup size frequencies were different between seasons. After running a chi squared test usign your approach, I got a significant difference:

Xsq <- chisq.test(n)
Xsq

Pearson's Chi-squared test
data: n
X-squared = 52.7027, df = 13, p-value = 1.018e-06

However, I'm unclear on something and I would apprecieate any insight you share with me. From what I understand, when we use the chi-squared test to compare two observed distributions, one of them serves as the "expected" one, and is then compared to the other "observed" one. But if this is the case, why do we get a separate set of expected values when looking at the summary of the test?

This is what I get for expected values:

Xsq\$expected # expected counts under the null

S_12 LL_13
1 167.589595 199.410405
2 224.670520 267.329480
3 191.335260 227.664740
4 160.739884 191.260116
5 102.289017 121.710983
6 69.410405 82.589595
7 77.173410 91.826590
8 34.705202 41.294798
9 32.421965 38.578035
10 14.156069 16.843931
11 14.156069 16.843931
12 10.046243 11.953757
13 5.023121 5.976879
14 2.283237 2.716763

I'm worried that whart R is actually doing is comparing both sets of distributions with a model based on the default p value for the function ( p = rep(1/length(x)). Forgive me if this is a dumb question, but I've been going through the "?chisq.test" help file and have gone through other references and still can't clear that up in my head, so any help or guidance will be appreciated.

1. sgs S_12 LL_13
1 182 185
2 230 262
3 190 229
4 190 162
5 94 130
6 79 73
7 56 113
8 30 46
9 34 37
10 7 24
11 7 24
12 4 18
13 2 9
14 1 4

# I'll reconstructed your dataset here, (best for sharing data is dput() which saves us much headache;)

b <- do.call(cbind, sapply(a[2:15], FUN = function(x) strsplit(x, " ") ))

S_12_sgs <- rep(1:14, as.numeric(b[2,]))
LL_13_sgs <- rep(1:14, as.numeric(b[3,]))
temp <- c(rep("LL_13", length(LL_13_sgs)), rep("S_12", length(S_12_sgs)))

sg <- data.frame(temp=temp, sgs=c(LL_13_sgs, S_12_sgs))

(n <- table(sg\$sgs, sg\$temp))

Xsq <- chisq.test(n)

# Here's how you get the expected frequencies under the null-hypothesis
# which assumes that each set of frequencies were collected from data
# with the same distribution!
# For this you simply calculate row- and column sums and calculate the
# probabilities and frequencies

sgs_margins <- margin.table(n, 1)
LL_13_sum <- margin.table(n, 2)
S_12_sum <- margin.table(n, 2)
total_sum <- margin.table(n)

(sgs_margins / total_sum * LL_13_sum / total_sum) * total_sum
(sgs_margins / total_sum * S_12_sum / total_sum) * total_sum

# the same can be grabbed from our Xsq-object
Xsq\$expected

# the "observed" frequencies are merely the ones collected
Xsq\$observed

# for detail on Chi-squared and p-values just look up any basic statistics-literature