29 Aug 2011

Comparing Two Distributions

Here I compare two distributions, flowering duration of indigenous and allochtonous plant species. The hypothesis is that alien compared to indigenous plant species exhibit longer flowering periods.

download data*

*Data is courtesy of BiolFlor:
Klotz, S., Kühn, I. & Durka, W. [Hrsg.] (2002): BIOLFLOR - Eine Datenbank zu biologisch-ökologischen Merkmalen der Gefäßpflanzen in Deutschland. - Schriftenreihe für Vegetationskunde 38. Bonn: Bundesamt für Naturschutz.

## comparing flowering time in indigenous and alien species by
## usage of a quantile-quantile plot (qqplot), the ks-test, by
## testing shifts in medians (wilcox test) and by testing
## difference of means (t-test, as we deal with integer and
## not a continous variable it is not the most appropiate choice)
## as well as by a chi-square test:

dat <- read.csv("E:\\R\\Data\\flowering_alien_vs_indigen.csv",
                sep = ";")


histogram(~ Flowering|Status, data = dat, col = "gray60", layout = c(1, 2),
          xlab = list("Months of flowering"),
          ylab = list("Percentage of total"),
          scales = list(y = list(alternating = F)),
          strip = strip.custom(factor.levels = c("alien", "indigenous")))

qqplot(dat$Flowering[dat$Status == "indigen"],
       dat$Flowering[dat$Status == "Neophyt"])
abline(a = 0, b = 1, lty = 3)

ks.test(dat$Flowering[dat$Status == "indigen"],
        dat$Flowering[dat$Status == "Neophyt"])

wilcox.test(Flowering ~ Status, data = dat)

t.test(Flowering ~ Status, data = dat)

## Note that in the two-sample case the estimator for the
## difference in location parameters does not estimate
## the difference in medians (a common misconception) but
## rather the median of the difference between a sample
## from x and a sample from y.

## as we deal with a limited number of classes (1-12 months)
## and sample size is big enough my favourite would be a
## chi-square test:

m <- table(dat$Status, dat$Flowering)

(Xsq <- chisq.test(m))  # Prints test summary
Xsq$observed   # observed counts (same as M)
Xsq$expected   # expected counts under the null
Xsq$residuals  # Pearson residuals
Xsq$stdres     # standardized residuals


  1. Thank you for posting this! I have a similar type of data set and these comments/codes were vastly helpful.
    -JW, California, USA

  2. Hi,

    I am unable to access the data you uploaded. Can you check your link please?