31 Aug 2012

Follow-Up: Making a Word Cloud for a Search Result from GScholar_Scraper_3.1


Here's a short follow-up on how to produce a word cloud for a search result from GScholarScraper_3.1:






# File-Name: GScholarScraper_3.1.R
# Date: 2012-08-22
# Author: Kay Cichini
# Email: kay.cichini@gmail.com
# Purpose: Scrape Google Scholar search result
# Packages used: XML
# Licence: CC BY-SA-NC
#
# Arguments:
# (1) input:
# A search string as used in Google Scholar search dialog
#
# (2) write:
# Logical, should a table be writen to user default directory?
# if TRUE ("T") a CSV-file with hyperlinks to the publications will be created.
#
# Difference to version 3:
# (3) added "since" argument - define year since when publications should be returned..
# defaults to 1900..
#
# (4) added "citation" argument - logical, if "0" citations are included
# defaults to "1" and no citations will be included..
# added field "YEAR" to output 
#
# Caveat: if a submitted search string gives more than 1000 hits there seem
# to be some problems (I guess I'm being stopped by Google for roboting the site..)
#
# And, there is an issue with this error message:
# > Error in htmlParse(URL): 
# > error in creating parser for http://scholar.google.com/scholar?q
# I haven't figured out his one yet.. most likely also a Google blocking mechanism..
# Reconnecting / new IP-address helps..


GScholar_Scraper <- function(input, since = 1900, write = F, citation = 1) {

    require(XML)

    # putting together the search-URL:
    URL <- paste("http://scholar.google.com/scholar?q=", input, "&as_sdt=1,5&as_vis=", 
                 citation, "&as_ylo=", since, sep = "")
    cat("\nThe URL used is: ", "\n----\n", paste("* ", "http://scholar.google.com/scholar?q=", input, "&as_sdt=1,5&as_vis=", 
                 citation, "&as_ylo=", since, " *", sep = ""))
    
    # get content and parse it:
    doc <- htmlParse(URL)
    
    # number of hits:
    h1 <- xpathSApply(doc, "//div[@id='gs_ab_md']", xmlValue)
    h2 <- strsplit(h1, " ")[[1]][2] 
    num <- as.integer(sub("[[:punct:]]", "", h2))
    cat("\n\nNumber of hits: ", num, "\n----\n", "If this number is far from the returned results\nsomething might have gone wrong..\n\n", sep = "")
    
    # If there are no results, stop and throw an error message:
    if (num == 0 | is.na(num)) {
        stop("\n\n...There is no result for the submitted search string!")
    }
    
    pages.max <- ceiling(num/100)
    
    # 'start' as used in URL:
    start <- 100 * 1:pages.max - 100
    
    # Collect URLs as list:
    URLs <- paste("http://scholar.google.com/scholar?start=", start, "&q=", input, 
                  "&num=100&as_sdt=1,5&as_vis=", citation, "&as_ylo=", since, sep = "")
    
    scraper_internal <- function(x) {
        
        doc <- htmlParse(x, encoding="UTF-8")
        
        # titles:
        tit <- xpathSApply(doc, "//h3[@class='gs_rt']", xmlValue)
        
        # publication:
        pub <- xpathSApply(doc, "//div[@class='gs_a']", xmlValue)
        
        # links:
        lin <- xpathSApply(doc, "//h3[@class='gs_rt']/a", xmlAttrs)
        
        # summaries are truncated, and thus wont be used..  
        # abst <- xpathSApply(doc, '//div[@class='gs_rs']', xmlValue)
        # ..to be extended for individual needs
        options(warn=(-1))
        dat <- data.frame(TITLES = tit, PUBLICATION = pub, 
                          YEAR = as.integer(gsub(".*\\s(\\d{4})\\s.*", "\\1", pub)),
                          LINKS = lin)
        options(warn=0)
        return(dat)
    }

    result <- do.call("rbind", lapply(URLs, scraper_internal))
    if (write == T) {
      result$LINKS <- paste("=Hyperlink(","\"", result$LINKS, "\"", ")", sep = "")
      write.table(result, "GScholar_Output.CSV", sep = ";", 
                  row.names = F, quote = F)
      shell.exec("GScholar_Output.CSV") 
      } else {
      return(result)
    }
}

# EXAMPLE:

input <- "allintitle:amphibian+diversity"
df <- GScholar_Scraper(input, since = 1980, citation = 1)

#install.packages("tm")
library(tm)

#install.packages("wordcloud")
library(wordcloud)

corpus <- Corpus(VectorSource(df$TITLES))
corpus <- tm_map(corpus, function(x)removeWords(x, c(stopwords(), "PDF", "B", "DOC", "HTML", "BOOK", "CITATION")))
corpus <- tm_map(corpus, removePunctuation)
tdm <- TermDocumentMatrix(corpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m), decreasing = TRUE)
d <- data.frame(word = names(v), freq = v)

# remove numbers from strings:
d <- d[-grep("[0-9]", d$word), ]

# print wordcloud:
wordcloud(d$word, d$freq)


5 comments :

  1. Awesome, thank you.

    Going to attempt to build a hist of authors given an intitle search.

    Endless thesis procrastination ahead!

    ReplyDelete
  2. yes, but what use is this? you put in the words you want to find. Of course they will come out the way they did.

    ReplyDelete
    Replies
    1. You simply see which words / topics are frequently co-occuring with your topic (search string) in focus.. for the example above you see that the words forest, distribution or madagascar, i.e., frequently co-occur in titles with the words amphibian & diversity.. the word europe, i.e., is completely missing, from which I deduce that this region was never in the focus of scientific research, regarding amphibian diversity.

      Delete
  3. Hi,
    I tried to replicate the result of your script but I got the following error:
    "failed to load external entity "http://scholar.google.com/scholar?q=allintitle:...."
    Could it be a proxy-related issue ?
    Anyway, thanks for sharing.

    ReplyDelete
    Replies
    1. Sorry, I've got no idea - everthyings running properly on my set up.

      sessionInfo()
      R version 2.15.2 (2012-10-26)
      Platform: i386-w64-mingw32/i386 (32-bit)

      locale:
      [1] LC_COLLATE=German_Austria.1252 LC_CTYPE=German_Austria.1252
      [3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
      [5] LC_TIME=German_Austria.1252

      attached base packages:
      [1] stats graphics grDevices utils datasets methods base

      other attached packages:
      [1] wordcloud_2.0 RColorBrewer_1.0-5 Rcpp_0.9.13 tm_0.5-7.1
      [5] XML_3.9-4.1 devtools_0.8

      loaded via a namespace (and not attached):
      [1] digest_0.5.2 evaluate_0.4.2 httr_0.2 memoise_0.1
      [5] parallel_2.15.2 plyr_1.7.1 RCurl_1.95-3 slam_0.1-24
      [9] stringr_0.6.1 tools_2.15.2 whisker_0.1

      Delete