theBioBucket*: Use of Classification Trees to Investigate Traits of Invasive Species

Which traits make an alien species invasive?
Due to what traits an alien species becomes established in a foreign flora?

This kind of questions could be analysed by the use of recursive partitioning and classification trees..
(the below example also includes some useful data manipulation techniques)...

data-download*

*data is courtesy of BiolFlor:
Klotz, S., Kühn, I. & Durka, W. [Hrsg.] (2002): BIOLFLOR - Eine Datenbank zu biologisch-ökologischen Merkmalen der Gefäßpflanzen in Deutschland. - Schriftenreihe für Vegetationskunde 38. Bonn: Bundesamt für Naturschutz.
library(coin)
library(zoo)
library(modeltools)
library(sandwich)
library(strucchange)
library(vcd)
library(colorspace)
library(Formula)
library(partykit)

## (!)note: package party should not be loaded at the same time as partykit

## read data, mind to change the path:
dat <- read.csv("E:\\R\\Data\\Alien_Plant_Traits.csv",
                sep = ";")

## delete unneeded variable:
dat$Repr_Literatur <- NULL
dat$Diasporen..bzw..Germinulentyp <- NULL
dat$Heteromorphietyp  <- NULL
dat$Ausprägung <- NULL
dat$Status <- NULL


## source data uses "," as decimal seperator, as R uses
## "." we will change this appropriately:
dat$Länge_MW_mm <- as.numeric(sub(",", ".", dat$Länge_MW_mm))

## explore dataset:
head(dat[,1:10])
str(dat)

# Name                 : species names
# Status               : floristic status; "Neophyt" = alien species, which is the whole set
# Einführung           : mode of introduction
# Einbürg_grad_Max     : level of naturalization
# Lebensform           : lifeform
# Ausprägung           : manifestation of lifeform, "immmer" = always, "manchmal" = sometimes
# Blattanatomie        : leaf anatomy
# Blühdauer_Mon        : flowering duration
# N_Blühphasen         : number of flowering phases
# Länge_MW_mm          : length of diaspore / germinule, "MW_mm" = mean value in milimeters
# Länge_Bezug          : "Germinule" = length given is that of germinule,
#                       "Diaspore" = that of diaspore
# Länge_Bezug_Einschr  : restricted knowledge, maybe erroneus data, "_FALSCH" = false, "_WAHR" = true
# Reproduktionstyp     : reproduction type, "Samen" = seed, "vegetative" = clonal proliferation,
#                       "Samen/Sporen" = both

## ad diaspores / germinules:
## there are 2 cases - (1) a diaspore (= unit of propagation)
## is at the same time the germinule (= unit of seedling germination).
## this is the case when single seeds are propagated.
## (2) the diaspores disintegrates into many germinules.
## this is the cases when fruits or parts of them
## which contain many seeds are propagated.

## length measurments ("Länge_MW_mm") were taken for either
## germinules or diaspores or both (see "Länge_Bezug") and sometimes
## data may be erronous as indicated by the variable "Länge_Bezug_Einschr"
## what i want is the surely correct length-values of each germinules and
## diaspores for each species. therefore i create new variables:

dat1 <-
reshape(direction = "wide", idvar = "Name",
        timevar = "Länge_Bezug_Einschr",
        v.names = "Länge_MW_mm",
        drop =  "Länge_Bezug",
        data = dat)


## there's a warning that tells me i aggregated several variables that
## actually were varying - i solve this by pasting those variables
## into one character string per each species
## (beforehand i have to convert them to character vectors)
dat$Einführung <- as.character(dat$Einführung)
dat$Lebensform <- as.character(dat$Lebensform)
dat$Blattanatomie <- as.character(dat$Blattanatomie)
str(dat)

## make function for pasting multiple character strings of each species
## i want only unique values to be pasted, seperated by a comma:
my.paste <- function(x) {paste(unique(x), collapse = ", ")}
test.vector <- c("a", "a", "b", "c")
my.paste(test.vector)

print(dat2 <-
aggregate(cbind(dat$Einführung, dat$Lebensform, dat$Blattanatomie) ~
          dat$Name, FUN = my.paste))

## reset column names:
colnames(dat2) <- c("Name", "Einführung", "Lebensform", "Blattanatomie")
head(dat2)
## re-bind dataframes, variables with varying entries per species
## falsely aggregated in dat1 will be discarded and instead the
## pasted character strings of dat2, containing all info will be used.
## also variables present in both sub-datasets will be discarded:
head(dat2)

dat3 <- data.frame(dat1[, - which(names(dat1)%in%
                            c("Einführung", "Lebensform", "Blattanatomie"))],
                   dat2)
head(dat3)
## check for consistency (should be 340 species):
## Name-variables of the two sub-datasets should match
head(dat3)
dim(dat3)
dat3$Name == dat3$Name.1


## now i will choose one of the length measurements in the
## dataset - actually i'm interested in propagation (shorter/lighter propagules
## will be distributed easierly than larger/heavy ones) - that said i want
## diaspore weights but i don't have the data for each species.
## as an escape i will use the largest value within the different variables.
## the logic behind this would be that if diaspore = germinule
## the lengths will be the same and a germinule as a part of a diaspore
## will never be longer.
## by choosing the largest value i thus only use the germinule's value
## if the diaspore's value does not exist. this is seldomly the case and the
## length class will very likely still be the same as it would be for the
## diaspore of that plant, if i had the data
## (subsequently i will classify lengths):

dat3$Length_Diasp_new <- rep(NA, nrow(dat3))
dat3[is.na(dat3)] <- 0
for(i in 1:nrow(dat3)) {dat3$Length_Diasp_new[i] <- max(dat3[i, c(6:9)])}

## classify by cut-function:
dat3$Length_Class <- as.ordered(cut(dat3$Length_Diasp_new, c(0, 2, 5, Inf), labels = FALSE))

## i have to convert some character strings to factors
str(dat3)
dat3$Einführung <- as.factor(dat3$Einführung)
dat3$Lebensform <- as.factor(dat3$Lebensform)
dat3$Blattanatomie <- as.factor(dat3$Blattanatomie)
str(dat3)

## number if factor levels is too high for some variables -
## good results will rather be yielded
## if there are not too many levels:

levels(dat3$Lebensform)
## (1) makrophanerophytes will be "Tree"
## (2) chamaephyts, hemiphanerophyts, nanophanerophyts &
##     pseudophanerophyts will be "Shrub"
## (3) therophytes will be "Terr_Herbs_ann" (annual herbs)
## (4) geophytes and hemikryptophytes will be "Terr_Herb_per" (perennial herbs)
## (5) hydrophyts will be "Hydr_Herb"

## sometimes the classification in the original data is unclear -
## i will assign distinct lifeforms following the subsequent rules:
## "Chamaephyt, Therophyt" --> "Terr_Herb_ann"
## "Hemikryptophyt, Therophyt" --> "Terr_Herb_ann"
## "Geophyt, Hemikryptophyt" --> "Terr_Herb_per"
## "Hydrophyt, Geophyt" --> "Hydr_Herb"
## "Hydrophyt, Hemikryptophyt" --> "Hydr_Herb"
## "Makrophanerophyt, Nanophanerophyt" --> "Tree"

dat3$Lifeform <- rep(NA, nrow(dat3))

dat3$Lifeform[dat3$Lebensform == "Makrophanerophyt"] <- "Tree"
dat3$Lifeform[dat3$Lebensform == "Pseudophanerophyt"|
              dat3$Lebensform == "Nanophanerophyt"|
              dat3$Lebensform == "Hemiphanerophyt"|
              dat3$Lebensform == "Chamaephyt"] <- "Scrub"
dat3$Lifeform[dat3$Lebensform == "Therophyt"] <- "Terr_Herb_ann"
dat3$Lifeform[dat3$Lebensform == "Geophyt"|
              dat3$Lebensform == "Hemikryptophyt"] <- "Terr_Herb_per"
dat3$Lifeform[dat3$Lebensform == "Hydrophyt"] <- "Hydr_Herb"

dat3$Lifeform[dat3$Lebensform == "Chamaephyt, Therophyt"] <- "Terr_Herb_ann"
dat3$Lifeform[dat3$Lebensform == "Hemikryptophyt, Therophyt"] <- "Terr_Herb_ann"
dat3$Lifeform[dat3$Lebensform == "Geophyt, Hemikryptophyt"] <- "Terr_Herb_per"
dat3$Lifeform[dat3$Lebensform == "Hydrophyt, Geophyt"] <- "Hydr_Herb"
dat3$Lifeform[dat3$Lebensform == "Hydrophyt, Hemikryptophyt"] <- "Hydr_Herb"
dat3$Lifeform[dat3$Lebensform == "Makrophanerophyt, Nanophanerophyt"] <- "Tree"

## convert to factor:
dat3$Lifeform <- as.factor(dat3$Lifeform)
str(dat3)

levels(dat3$Einführung)
## "Begleiter"; "Begleiter, Verwilderte Nutzpflanze";
## "Begleiter, Verwilderte Zierpflanze"; "Transportbegleiter";
## "Saatgutbegleiter"; "Saatgutbegleiter, Verwilderte Nutzpflanze";      
## "Saatgutbegleiter, Verwilderte Zierpflanze" <-- "accompanying"
## "spontan"; "unbekannt"; "spontan, Verwilderte Nutzpflanze" <-- "spontaneous"
## "Verwilderte Nutzpflanze"; "Verwilderte Nutzpflanze, Verwilderte Zierpflanze";
## "Verwilderte Zierpflanze" <-- "cultivated"

dat3$Introduction <- rep(NA, nrow(dat3))
dat3$Introduction[dat3$Einführung == "Begleiter"|
            dat3$Einführung == "Begleiter, Verwilderte Nutzpflanze"|
            dat3$Einführung == "Begleiter, Verwilderte Zierpflanze"|
            dat3$Einführung == "Transportbegleiter"|
            dat3$Einführung == "Saatgutbegleiter"|
            dat3$Einführung == "Saatgutbegleiter, Verwilderte Nutzpflanze"|
            dat3$Einführung == "Saatgutbegleiter, Verwilderte Zierpflanze"] <- "accompanying"        
dat3$Introduction[dat3$Einführung == "spontan"|
            dat3$Einführung == "unbekannt"|
            dat3$Einführung == "spontan, Verwilderte Nutzpflanze"] <- "spontaneous"
dat3$Introduction[dat3$Einführung == "Verwilderte Nutzpflanze, Verwilderte Zierpflanze"|
            dat3$Einführung == "Verwilderte Nutzpflanze"|
            dat3$Einführung == "Verwilderte Zierpflanze"] <- "agric./ornam."

## convert to factor and check:
dat3$Introduction <- as.factor(dat3$Introduction)
str(dat3)

levels(dat3$Blattanatomie)
## leaf-anatomy: i will leaf this one out in the analysis
## as it is partly correlated with lifeforms

levels(dat3$Reproduktionstyp)
## "meist Samen, selten vegetativ" = mostly seed propagation
## "meist vegetativ, selten Samen" = mostly vegetative propagation
## "Samen und vegetativ" = both to same parts
## "Samen/Sporen" = only seeds                
## "vegetativ" = only vegetative
## "meist Samen, selten vegetativ"; "Samen/Sporen" <-- "mostly seeds"
## "meist vegetativ, selten Samen"; "vegetativ"   <-- "mostly vegetative"
## "Samen und vegetativ" <-- "both equally"
dat3$Reproduction <- rep(NA, nrow(dat3))
dat3$Reproduction[dat3$Reproduktionstyp == "meist Samen, selten vegetativ"|
                  dat3$Reproduktionstyp == "Samen/Sporen"] <- "mostly seeds"
dat3$Reproduction[dat3$Reproduktionstyp == "meist vegetativ, selten Samen"|
                  dat3$Reproduktionstyp == "vegetativ"] <- "mostly vegetativ"
dat3$Reproduction[dat3$Reproduktionstyp == "Samen und vegetativ"] <- "both equally"
dat3$Reproduction <- as.factor(dat3$Reproduction)

levels(dat3$Einbürg_grad_Max)
## this will be the dependent (outcome) variable
## i'll also try to use an outcome variable with
## only two levels:
## "Agriophyt" --> "invasive"
## "Epökophyt" & "Ephemerophyt" --> "not invasive"
dat3$Invasivness <- rep(NA, nrow(dat3))
dat3$Invasivness[dat3$Einbürg_grad_Max == "Epökophyt"|
                 dat3$Einbürg_grad_Max == "Ephemerophyt"] <- "not invasive"
dat3$Invasivness[dat3$Einbürg_grad_Max == "Agriophyt"] <- "invasive"
dat3$Invasivness <- as.factor(dat3$Invasivness)
dat3$Invasivness <- relevel(dat3$Invasivness , ref = "not invasive")

## another approach to this would be to define
## alien taxa that are established (in the sense
## that its population persist independently) and
## not established taxa.

dat3$Establ <- rep(NA, nrow(dat3))
dat3$Establ[dat3$Einbürg_grad_Max == "Epökophyt"|
              dat3$Einbürg_grad_Max == "Agriophyt"] <- "establ."
dat3$Establ[dat3$Einbürg_grad_Max == "Ephemerophyt"] <- "not establ."
dat3$Establ <- as.factor(dat3$Establ)
dat3$Establ <- relevel(dat3$Establ, ref = "not establ.")
str(dat3)

print(ct <- ctree(Einbürg_grad_Max ~ Introduction + Lifeform + Length_Class +
                                     Blühdauer_Mon + Reproduction, data = dat3,
                   control = ctree_control(mincriterion = 0.8, minsplit = 10L,
                                           minbucket = 10L)))

plot(ct, gp = gpar(fontsize = 7))


## the first node divides annula herbs from the rest
## thus it might be more representive to recode lifeforms
## into a lifecycle variable:
dat3$Life_Cycle <- rep(NA, nrow(dat3))
dat3$Life_Cycle[grep("_ann", dat3$Lifeform)] <- "annual"
dat3$Life_Cycle[-grep("_ann", dat3$Lifeform)] <- "perennial"
dat3$Life_Cycle <- as.factor(dat3$Life_Cycle)

## re-run:
print(ct <- ctree(Einbürg_grad_Max ~ Introduction + Life_Cycle + Length_Class +
                                     Blühdauer_Mon + Reproduction, data = dat3,
                   control = ctree_control(mincriterion = 0.8, minsplit = 10L,
                                           minbucket = 10L)))

plot(ct, gp = gpar(fontsize = 7))

## check predictions:
data.frame(Species = dat3$Name,
           Naturalization = dat3$Einbürg_grad_Max, predict(ct),
           Correspondence = dat3$Einbürg_grad_Max == predict(ct))

table(dat3$Einbürg_grad_Max == predict(ct))

## interpreting the result:
##
## (1) there is a large probability for agriophytes (= naturalized/invasive)
##     to occure in the group of perennial plants which
##     exhibit both, vegetative and generative reproduction
##
## (2) ephemerophytes most likely are present if plants are annual
##     and plant were introduced intentionally (ornamental, agricultural
##     purpose. within annuals the group of plants that made there way
##     without intentional help of men are likely to have more potential
##     to get established (-> portion of agriophytes and epoekophytes (=
##     naturalized/not-invasive) is larger here).
##
## also the other dependent nominal variables, invasivness & establishment
## could be investigated similarly..
To cite package ‘partykit’ in publications use:

Torsten Hothorn and Achim Zeileis (2011). partykit: A Toolkit for Recursive
Partytioning. R package version 0.1-0/r179.
http://R-Forge.R-project.org/projects/partykit/
theBioBucket*

TABS

20 Sept 2011

Use of Classification Trees to Investigate Traits of Invasive Species

No comments :

Post a Comment