theBioBucket*: Follow-Up: Making a Word Cloud for a Search Result from GScholar_Scraper

31 Aug 2012

Follow-Up: Making a Word Cloud for a Search Result from GScholar_Scraper_3.1

Here's a short follow-up on how to produce a word cloud for a search result from GScholarScraper_3.1:

# File-Name: GScholarScraper_3.1.R
# Date: 2012-08-22
# Author: Kay Cichini
# Email: kay.cichini@gmail.com
# Purpose: Scrape Google Scholar search result
# Packages used: XML
# Licence: CC BY-SA-NC
#
# Arguments:
# (1) input:
# A search string as used in Google Scholar search dialog
#
# (2) write:
# Logical, should a table be writen to user default directory?
# if TRUE ("T") a CSV-file with hyperlinks to the publications will be created.
#
# Difference to version 3:
# (3) added "since" argument - define year since when publications should be returned..
# defaults to 1900..
#
# (4) added "citation" argument - logical, if "0" citations are included
# defaults to "1" and no citations will be included..
# added field "YEAR" to output 
#
# Caveat: if a submitted search string gives more than 1000 hits there seem
# to be some problems (I guess I'm being stopped by Google for roboting the site..)
#
# And, there is an issue with this error message:
# > Error in htmlParse(URL): 
# > error in creating parser for http://scholar.google.com/scholar?q
# I haven't figured out his one yet.. most likely also a Google blocking mechanism..
# Reconnecting / new IP-address helps..


GScholar_Scraper <- function(input, since = 1900, write = F, citation = 1) {

    require(XML)

    # putting together the search-URL:
    URL <- paste("http://scholar.google.com/scholar?q=", input, "&as_sdt=1,5&as_vis=", 
                 citation, "&as_ylo=", since, sep = "")
    cat("\nThe URL used is: ", "\n----\n", paste("* ", "http://scholar.google.com/scholar?q=", input, "&as_sdt=1,5&as_vis=", 
                 citation, "&as_ylo=", since, " *", sep = ""))
    
    # get content and parse it:
    doc <- htmlParse(URL)
    
    # number of hits:
    h1 <- xpathSApply(doc, "//div[@id='gs_ab_md']", xmlValue)
    h2 <- strsplit(h1, " ")[[1]][2] 
    num <- as.integer(sub("[[:punct:]]", "", h2))
    cat("\n\nNumber of hits: ", num, "\n----\n", "If this number is far from the returned results\nsomething might have gone wrong..\n\n", sep = "")
    
    # If there are no results, stop and throw an error message:
    if (num == 0 | is.na(num)) {
        stop("\n\n...There is no result for the submitted search string!")
    }
    
    pages.max <- ceiling(num/100)
    
    # 'start' as used in URL:
    start <- 100 * 1:pages.max - 100
    
    # Collect URLs as list:
    URLs <- paste("http://scholar.google.com/scholar?start=", start, "&q=", input, 
                  "&num=100&as_sdt=1,5&as_vis=", citation, "&as_ylo=", since, sep = "")
    
    scraper_internal <- function(x) {
        
        doc <- htmlParse(x, encoding="UTF-8")
        
        # titles:
        tit <- xpathSApply(doc, "//h3[@class='gs_rt']", xmlValue)
        
        # publication:
        pub <- xpathSApply(doc, "//div[@class='gs_a']", xmlValue)
        
        # links:
        lin <- xpathSApply(doc, "//h3[@class='gs_rt']/a", xmlAttrs)
        
        # summaries are truncated, and thus wont be used..  
        # abst <- xpathSApply(doc, '//div[@class='gs_rs']', xmlValue)
        # ..to be extended for individual needs
        options(warn=(-1))
        dat <- data.frame(TITLES = tit, PUBLICATION = pub, 
                          YEAR = as.integer(gsub(".*\\s(\\d{4})\\s.*", "\\1", pub)),
                          LINKS = lin)
        options(warn=0)
        return(dat)
    }

    result <- do.call("rbind", lapply(URLs, scraper_internal))
    if (write == T) {
      result$LINKS <- paste("=Hyperlink(","\"", result$LINKS, "\"", ")", sep = "")
      write.table(result, "GScholar_Output.CSV", sep = ";", 
                  row.names = F, quote = F)
      shell.exec("GScholar_Output.CSV") 
      } else {
      return(result)
    }
}

# EXAMPLE:

input <- "allintitle:amphibian+diversity"
df <- GScholar_Scraper(input, since = 1980, citation = 1)

#install.packages("tm")
library(tm)

#install.packages("wordcloud")
library(wordcloud)

corpus <- Corpus(VectorSource(df$TITLES))
corpus <- tm_map(corpus, function(x)removeWords(x, c(stopwords(), "PDF", "B", "DOC", "HTML", "BOOK", "CITATION")))
corpus <- tm_map(corpus, removePunctuation)
tdm <- TermDocumentMatrix(corpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m), decreasing = TRUE)
d <- data.frame(word = names(v), freq = v)

# remove numbers from strings:
d <- d[-grep("[0-9]", d$word), ]

# print wordcloud:
wordcloud(d$word, d$freq)





Labels:

Google Scholar


                                              ,
                                            

GScholarScraper()


                                              ,
                                            

R


                                              ,
                                            

Word-Cloud





5 comments
                                    :
                                  




Anonymous1 September 2012 at 08:12
Awesome, thank you.

Going to attempt to build a hist of authors given an intitle search. 

Endless thesis procrastination ahead!
ReplyDelete
Replies
Reply
Anonymous2 September 2012 at 16:59
yes, but what use is this? you put in the words you want to find. Of course they will come out the way they did.
ReplyDelete
Replies
Kay2 September 2012 at 21:00
You simply see which words / topics are frequently co-occuring with your topic (search string) in focus.. for the example above you see that the words forest, distribution or madagascar, i.e., frequently co-occur in titles with the words amphibian & diversity.. the word europe, i.e., is completely missing, from which I deduce that this region was never in the focus of scientific research, regarding amphibian diversity.
Delete
Replies
Reply
Reply
Anonymous6 December 2012 at 11:55
Hi,
I tried to replicate the result of your script but I got the following error:
"failed to load external entity "http://scholar.google.com/scholar?q=allintitle:...."
Could it be a proxy-related issue ?
Anyway, thanks for sharing.
ReplyDelete
Replies
Kay8 December 2012 at 22:42
Sorry, I've got no idea - everthyings running properly on my set up.

sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252   
[3] LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Austria.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] wordcloud_2.0      RColorBrewer_1.0-5 Rcpp_0.9.13        tm_0.5-7.1        
[5] XML_3.9-4.1        devtools_0.8      

loaded via a namespace (and not attached):
 [1] digest_0.5.2    evaluate_0.4.2  httr_0.2        memoise_0.1    
 [5] parallel_2.15.2 plyr_1.7.1      RCurl_1.95-3    slam_0.1-24    
 [9] stringr_0.6.1   tools_2.15.2    whisker_0.1   
Delete
Replies
Reply
Reply
Add comment
Load more...

5 comments :

Anonymous1 September 2012 at 08:12
Awesome, thank you.

Going to attempt to build a hist of authors given an intitle search.

Endless thesis procrastination ahead!
Anonymous2 September 2012 at 16:59
yes, but what use is this? you put in the words you want to find. Of course they will come out the way they did.
Anonymous6 December 2012 at 11:55
Hi,
I tried to replicate the result of your script but I got the following error:
"failed to load external entity "http://scholar.google.com/scholar?q=allintitle:...."
Could it be a proxy-related issue ?
Anyway, thanks for sharing.

TABS

31 Aug 2012

Follow-Up: Making a Word Cloud for a Search Result from GScholar_Scraper_3.1

5 comments :