20 Jan 2014

Get No. of Google Search Hits with R and XML

UPDATE: Thanks to Max Ghenis for updating my R-script which I wrote a while back - the below R-script can now be used again for pulling the number of hits from Google-Search.

GoogleHits <- function(input)
   {
    require(XML)
    require(RCurl)
    url <- paste("https://www.google.com/search?q=\"",
                 input, "\"", sep = "")
 
    CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
    script <- getURL(url, followlocation = TRUE, cainfo = CAINFO)
    doc <- htmlParse(script)
    res <- xpathSApply(doc, '//*/div[@id="resultStats"]', xmlValue)
    cat(paste("\nYour Search URL:\n", url, "\n", sep = ""))
    cat("\nNo. of Hits:\n")
    return(as.integer(gsub("[^0-9]", "", res)))
   }
 
# Example:
GoogleHits("R%Statistical%Software")
p.s.: If you try to do this in a robot fashion, like:
lapply(list_of_search_terms, GoogleHits)
google will block you after about the 300th recursion!

9 comments :

  1. great script! any way to get around the block of google after the 300th recursion?

    ReplyDelete
  2. I guess there are some changes with google output.
    To make it working I had to remove [[2]] from the following command.
    res <- xpathSApply(doc, "//div[@id='subform_ctrl']/*", xmlValue)[[2]]

    ReplyDelete
  3. Google output changed again, need to change the path to '//*/div[@id="resultStats"]'

    ReplyDelete
  4. Also I don't believe stringr is needed

    ReplyDelete
    Replies
    1. of course (just forgot to remove the line..)

      Delete
  5. Hi, i'm new on R, is there some way to do the same but filtering the search on google by specific date intervals?

    ReplyDelete
    Replies
    1. Yes - just go to the advanced Google Search and put a time interval for your query, then take the URL Google creates and use it with the R-function!

      Delete