26 Mar 2012

How to Extract Citation from a Body of Text

Say, you have a text and you want to retrieve the cited names and years of publication. You wouldn't want to this by hand, wouldn't you?

Try the following approach:
(the text sample comes from THIS freely available publication)

library(stringr)

(txt <- readLines("http://dl.dropbox.com/u/68286640/Test_Doc.txt"))
[1] "1  Introduction"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
[2] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
[3] "Climate projections of the Intergovernmental Panel on Climate Change (IPCC) forecast a general increase of seasonal temperatures in the present century across the temperate zone, aggravated by decreasing amounts of summer rainfall in certain regions at lower latitudes (Christensen et al. 2007). These changes imply serious ecological consequences, especially in biome transition zones (Fischlin et al. 2007). Due to their economic importance, as well as their major contribution to supporting, regulating and cultural ecosystem services, predicted changes and shifts in temperate forest ecosystems receive wide public attention. It’s no surprise that dominant forest tree species are frequently modelled in bioclimatic impact studies (e.g., Sykes et al. 1996; Iverson, Prasad 2001; Rehfeldt et al. 2003; Ohlemüller et al. 2006). However, most studies focus on continental-scale effects of climate change, using low resolution climatic and species distribution data. More detailed regional studies focussing on specific endangered regions are also needed (Benito Garzón et al. 2008). Such regional studies have already been prepared for several European regions, including the Swiss Alps (Bolliger et al. 2000), the British Isles (Berry et al. 2002) and the Iberian Peninsula (Benito Garzón et al. 2008)."                                                                                                                    
[4] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
[5] "In this study, we aim to (1) identify the limiting macroclimatic factors and to (2) predict the future boundaries of beech (Fagus sylvatica L.) and sessile oak (Quercus petraea (Mattuschka) Liebl.) forests in a region highly vulnerable to climatic extremes. Both tree species form extensive zonal forests throughout Central Europe and reach their low altitude/low latitude, xeric (Mátyás et al. 2009) distributional limits within the forest-steppe biome transition zone of Hungary. The rise of temperature, and especially summer rainfall deficits expected for the twenty-first century, may strongly affect both species. Nevertheless, regarding the potential future distribution of these important forest tree species along their xeric boundaries in Central Europe, there has been no detailed regional analysis before. Experimental studies and field survey data suggest a strong decline in beech regeneration (Czajkowski et al. 2005; Penuelas et al. 2007; Lenoir et al. 2009) and increased mortality rates following prolonged droughts (Berki et al. 2009). Mass mortality and range retraction are potential consequences, which have been already sporadically observed in field survey studies (Jump et al. 2009; Allen et al. 2010; Mátyás et al. 2009). With the study, we intend to assist in assessing overall risks, locating potentially affected regions and supporting the formulation of appropriate measures and strategies."
[6] ""                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
[7] "Beech and sessile oak forests of Hungary are to a large extent “trailing edge” populations (Hampe and Petit 2005), which should be preferably modelled using specific modelling strategies (Thuiller et al. 2008). Most modelling studies do not differentiate between leading and trailing edges and rely on assumptions and techniques which are intrinsically more appropriate for “leading edge” situations. Being aware of these challenges, we compiled a statistical methodology customized to yield inference on influential variables and providing robust and reliable predictions for climate-dependent populations near their xeric limits. We laid special emphasis on three features in the course of the modelling process: (1) screening of the occurrence data in order to limit modelling to plausible zonal (i.e. macroclimatically determined) occurrences, (2) avoiding pitfalls of statistical pseudoreplication caused by spatial autocorrelation (a problem to which regional distribution modelling studies are particularly prone; Dormann 2007) and (3) simultaneous use of several initial and boundary conditions in an ensemble modelling framework (Araújo et al. 2005; Araújo and New 2007; Beaumont et al. 2007). "                                                                                                                                                                                                                         

# retrieve text inbetween parantheses:
extr1 <- unlist(str_extract_all(txt, pattern = "\\(.*?\\)"))

# keep only those elements which have four digit strings (years):
extr2 <- extr1[grep("[0-9]{4}", extr1)]

# extract partial strings starting with uppercase letter (name)
# and end in a four digit string (year):
(str_extract(extr2, "[A-Z].*[0-9]"))
 [1] "Christensen et al. 2007"                                                              
 [2] "Fischlin et al. 2007"                                                                 
 [3] "Sykes et al. 1996; Iverson, Prasad 2001; Rehfeldt et al. 2003; Ohlemüller et al. 2006"
 [4] "Benito Garzón et al. 2008"                                                            
 [5] "Bolliger et al. 2000"                                                                 
 [6] "Berry et al. 2002"                                                                    
 [7] "Benito Garzón et al. 2008"                                                            
 [8] "Mátyás et al. 2009"                                                                   
 [9] "Czajkowski et al. 2005; Penuelas et al. 2007; Lenoir et al. 2009"                     
[10] "Berki et al. 2009"                                                                    
[11] "Jump et al. 2009; Allen et al. 2010; Mátyás et al. 2009"                              
[12] "Hampe and Petit 2005"                                                                 
[13] "Thuiller et al. 2008"                                                                 
[14] "Dormann 2007"                                                                         
[15] "Araújo et al. 2005; Araújo and New 2007; Beaumont et al. 2007"                        

# as proposed by a commentator -
# do this if you want each citation seperately:
(unlist(str_extract_all(extr2, "[A-Z].*?[0-9]{4}")))
 [1] "Christensen et al. 2007"   "Fischlin et al. 2007"     
 [3] "Sykes et al. 1996"         "Iverson, Prasad 2001"     
 [5] "Rehfeldt et al. 2003"      "Ohlemüller et al. 2006"   
 [7] "Benito Garzón et al. 2008" "Bolliger et al. 2000"     
 [9] "Berry et al. 2002"         "Benito Garzón et al. 2008"
[11] "Mátyás et al. 2009"        "Czajkowski et al. 2005"   
[13] "Penuelas et al. 2007"      "Lenoir et al. 2009"       
[15] "Berki et al. 2009"         "Jump et al. 2009"         
[17] "Allen et al. 2010"         "Mátyás et al. 2009"       
[19] "Hampe and Petit 2005"      "Thuiller et al. 2008"     
[21] "Dormann 2007"              "Araújo et al. 2005"       
[23] "Araújo and New 2007"       "Beaumont et al. 2007"


2 comments :

  1. Numbers 3, 5, 11, and 15 have multiple entries. We can also split these up with the str_split().

    ReplyDelete