28 Mar 2012

Applying Same Changes to Multiple Dataframes

How to apply the same changes to several dataframes and
save them to CSV:

# a dataframe
a <- data.frame(x = 1:3, y = 4:6)

# make a list of several dataframes, then apply function (change column names, e.g.):
my.list <- list(a, a)
my.list <- lapply(my.list, function(x) {names(x) <- c("a", "b") ; return(x)})

# save dfs to csv with similar lapply-call:
n <- 1:length(my.list)
lapply(n, function(ni) {
               write.table(file = paste(ni, ".csv", sep = ""), 
               my.list[ni], sep = ";", row.names = F)


I'll extend this to a script that reads several files from a directory, applies changes to the files in the same fashion and finally saves files back to the directory (as HERE)

# clean up
rm(list = ls())

# create some files in tempdir:
a <- data.frame(x = 1:3, y = 4:6)
b <- data.frame(x = 10:13, y = 14:15)
write.csv(a, "file1.csv", row.names = F)
write.csv(b, "file2.csv", row.names = F)

# now read all files to list:
mycsv = dir(pattern=".csv")

n <- length(mycsv)
mylist <- vector("list", n)

for(i in 1:n) mylist[[i]] <- read.csv(mycsv[i])

# now change something in all dfs in list:
mylist <- lapply(mylist, function(x) {names(x) <- c("a", "b") ; return(x)})

# then save back dfs:# then save back dfs:
for(i in 1:n) {
   write.csv(file = sub(".csv", "_new.csv", mycsv[i]),
             mylist[i], row.names = F)

26 Mar 2012

How to Extract Citation from a Body of Text

Say, you have a text and you want to retrieve the cited names and years of publication. You wouldn't want to this by hand, wouldn't you?

Try the following approach:
(the text sample comes from THIS freely available publication)


(txt <- readLines("http://dl.dropbox.com/u/68286640/Test_Doc.txt"))
[1] "1  Introduction"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
[2] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
[3] "Climate projections of the Intergovernmental Panel on Climate Change (IPCC) forecast a general increase of seasonal temperatures in the present century across the temperate zone, aggravated by decreasing amounts of summer rainfall in certain regions at lower latitudes (Christensen et al. 2007). These changes imply serious ecological consequences, especially in biome transition zones (Fischlin et al. 2007). Due to their economic importance, as well as their major contribution to supporting, regulating and cultural ecosystem services, predicted changes and shifts in temperate forest ecosystems receive wide public attention. It’s no surprise that dominant forest tree species are frequently modelled in bioclimatic impact studies (e.g., Sykes et al. 1996; Iverson, Prasad 2001; Rehfeldt et al. 2003; Ohlemüller et al. 2006). However, most studies focus on continental-scale effects of climate change, using low resolution climatic and species distribution data. More detailed regional studies focussing on specific endangered regions are also needed (Benito Garzón et al. 2008). Such regional studies have already been prepared for several European regions, including the Swiss Alps (Bolliger et al. 2000), the British Isles (Berry et al. 2002) and the Iberian Peninsula (Benito Garzón et al. 2008)."                                                                                                                    
[4] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
[5] "In this study, we aim to (1) identify the limiting macroclimatic factors and to (2) predict the future boundaries of beech (Fagus sylvatica L.) and sessile oak (Quercus petraea (Mattuschka) Liebl.) forests in a region highly vulnerable to climatic extremes. Both tree species form extensive zonal forests throughout Central Europe and reach their low altitude/low latitude, xeric (Mátyás et al. 2009) distributional limits within the forest-steppe biome transition zone of Hungary. The rise of temperature, and especially summer rainfall deficits expected for the twenty-first century, may strongly affect both species. Nevertheless, regarding the potential future distribution of these important forest tree species along their xeric boundaries in Central Europe, there has been no detailed regional analysis before. Experimental studies and field survey data suggest a strong decline in beech regeneration (Czajkowski et al. 2005; Penuelas et al. 2007; Lenoir et al. 2009) and increased mortality rates following prolonged droughts (Berki et al. 2009). Mass mortality and range retraction are potential consequences, which have been already sporadically observed in field survey studies (Jump et al. 2009; Allen et al. 2010; Mátyás et al. 2009). With the study, we intend to assist in assessing overall risks, locating potentially affected regions and supporting the formulation of appropriate measures and strategies."
[6] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
[7] "Beech and sessile oak forests of Hungary are to a large extent “trailing edge” populations (Hampe and Petit 2005), which should be preferably modelled using specific modelling strategies (Thuiller et al. 2008). Most modelling studies do not differentiate between leading and trailing edges and rely on assumptions and techniques which are intrinsically more appropriate for “leading edge” situations. Being aware of these challenges, we compiled a statistical methodology customized to yield inference on influential variables and providing robust and reliable predictions for climate-dependent populations near their xeric limits. We laid special emphasis on three features in the course of the modelling process: (1) screening of the occurrence data in order to limit modelling to plausible zonal (i.e. macroclimatically determined) occurrences, (2) avoiding pitfalls of statistical pseudoreplication caused by spatial autocorrelation (a problem to which regional distribution modelling studies are particularly prone; Dormann 2007) and (3) simultaneous use of several initial and boundary conditions in an ensemble modelling framework (Araújo et al. 2005; Araújo and New 2007; Beaumont et al. 2007). "                                                                                                                                                                                                                         

# retrieve text inbetween parantheses:
extr1 <- unlist(str_extract_all(txt, pattern = "\\(.*?\\)"))

# keep only those elements which have four digit strings (years):
extr2 <- extr1[grep("[0-9]{4}", extr1)]

# extract partial strings starting with uppercase letter (name)
# and end in a four digit string (year):
(str_extract(extr2, "[A-Z].*[0-9]"))
 [1] "Christensen et al. 2007"                                                              
 [2] "Fischlin et al. 2007"                                                                 
 [3] "Sykes et al. 1996; Iverson, Prasad 2001; Rehfeldt et al. 2003; Ohlemüller et al. 2006"
 [4] "Benito Garzón et al. 2008"                                                            
 [5] "Bolliger et al. 2000"                                                                 
 [6] "Berry et al. 2002"                                                                    
 [7] "Benito Garzón et al. 2008"                                                            
 [8] "Mátyás et al. 2009"                                                                   
 [9] "Czajkowski et al. 2005; Penuelas et al. 2007; Lenoir et al. 2009"                     
[10] "Berki et al. 2009"                                                                    
[11] "Jump et al. 2009; Allen et al. 2010; Mátyás et al. 2009"                              
[12] "Hampe and Petit 2005"                                                                 
[13] "Thuiller et al. 2008"                                                                 
[14] "Dormann 2007"                                                                         
[15] "Araújo et al. 2005; Araújo and New 2007; Beaumont et al. 2007"                        

# as proposed by a commentator -
# do this if you want each citation seperately:
(unlist(str_extract_all(extr2, "[A-Z].*?[0-9]{4}")))
 [1] "Christensen et al. 2007"   "Fischlin et al. 2007"     
 [3] "Sykes et al. 1996"         "Iverson, Prasad 2001"     
 [5] "Rehfeldt et al. 2003"      "Ohlemüller et al. 2006"   
 [7] "Benito Garzón et al. 2008" "Bolliger et al. 2000"     
 [9] "Berry et al. 2002"         "Benito Garzón et al. 2008"
[11] "Mátyás et al. 2009"        "Czajkowski et al. 2005"   
[13] "Penuelas et al. 2007"      "Lenoir et al. 2009"       
[15] "Berki et al. 2009"         "Jump et al. 2009"         
[17] "Allen et al. 2010"         "Mátyás et al. 2009"       
[19] "Hampe and Petit 2005"      "Thuiller et al. 2008"     
[21] "Dormann 2007"              "Araújo et al. 2005"       
[23] "Araújo and New 2007"       "Beaumont et al. 2007"

25 Mar 2012

Classification Trees and Spatial Autocorrelation

I'm currently trying to model species presence / absence data (N = 523) that were collected over a geographic area and are possibly spatially autocorrelated. Samples come from preferential sites (sea level > 1200 m, obligatory presence of permanent waterbodies, etc). My main goal is to infere on environmental factors determining the occurrence rate of several (amphibian) species and to rule out spatial autocorrelation.

24 Mar 2012

Custom Summary Stats as Dataframe or List

On Stackoverflow I found this useful example on how to apply custom statistics on a dataframe and return the results as list or dataframe:

14 Mar 2012

Creating a Stratified Random Sample of a Dataframe

Expanding on a question on Stack Overflow I'll show how to make a stratified random sample of a certain size:
d <- expand.grid(id = 1:35000, stratum = letters[1:10])

p = 0.1

dsample <- data.frame()

for(i in levels(d$stratum)) {
  dsub <- subset(d, d$stratum == i)
  B = ceiling(nrow(dsub) * p)
  dsub <- dsub[sample(1:nrow(dsub), B), ]
  dsample <- rbind(dsample, dsub) 

# size per stratum in resulting df is 10 % of original size:

13 Mar 2012

R-Function to Read Data from Google Docs Spreadsheets

I used this idea posted on Stack Overflow to plug together a function for reading data from Google Docs spreadsheets into R.