9 Nov 2011

R-Function GScholarScraper to Webscrape Google Scholar Search Result

NOTE: You'll find the update HERE and HERE.

NOTE: The script is currently not working because the code of the Google-Scholar site has changed...
I'll see for this as soon as I find some spare time for it!

NOTE: If you try to access GoogleScholar programatically consider this words of caution:
http://stackoverflow.com/questions/7523961/google-scholar-with-matlab/7587994#7587994
...

Based on my previous post on Web Scraping I coded and uploaded the Function "GScholarScraper" HERE for testing!
The function will pull all (!) results, processing pages in chunks of 100 results/titles, and return a file with all titles, links, etc. It will also produce a word cloud using the words in the publication titles.

Please try your own search strings and report errors, etc.!

Build and run properly under:
R version 2.13.0 (2011-04-13) and R version R-2.13.2 (2011-09-30)

Platform: i386-pc-mingw32/i386 (32-bit) locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] stringr_0.5 tm_0.5-6 wordcloud_1.2 Rcpp_0.9.7

loaded via a namespace (and not attached):
[1] plyr_1.5.1 slam_0.1-23

PS: Errors reported lately (see comments) were resolved, the source code was updated..

16 comments :

  1. Interesting, I'm going to have to find some time to try your function out! As an exercise in XPath, I used the XML package to scrape information off Google Scholar and return it into a data frame. You've gone the regular expression path which I'll have to try and find some time to understand (looks good though!)

    Running the script, I hit an error:

    [sourcecode language="r"]
    search.str <- "allintitle:+amphibians+richness+OR+diversity"

    url <- paste("http://scholar.google.com/scholar?start=0&q=", search.str, "&hl=en&lr=lang_en&num=100&as_sdt=1&as_vis=1", sep = "")

    webpage <- readLines(url, warn = F)
    html_str <- paste(webpage, collapse="\n")

    #
    [/sourcecode]

    However I think the following will do the same thing:
    [sourcecode language="r"]
    library(RCurl)
    html <- getURL(url)
    [/sourcecode]

    tested on R2.14.0 on Ubuntu 11.10 x64

    ReplyDelete
  2. When I try to run the script I get the message "Error in substring(string, start, end) : invalid multibyte string at ..."

    ReplyDelete
  3. This is a usefool / fun project

    but simply running your code w/ your example i'm getting this error


    Error in substring(string, start, end) :
    invalid multibyte string at ' * Wi<6c>dlife <

    ReplyDelete
  4. Tony,
    I read your post - funny that we came up with the same thing at the very same day!
    I replaced readLines() by getURL() - thanks for reporting this!

    Sean,
    Sorry, at the moment I can not reproduce the substring() error. I need to see what it means..

    Anonymous,
    it was fun for me doing it - nevertheless, the purpose is evident, without fun.. In scientific research it's vital to know what is going on, that is, to know what's being published!
    My, or similar tools could be a very use*ful* in this regard!

    ReplyDelete
  5. I am very new to this I need to "scrap?" names of articles, author-journal-date (is it possible to sepparate?), and number of citations into spreadsheet (i guess XML?).
    I was not able to find any gscholarscrapper. And i still wonder were i should put The Code.
    Please help!

    ReplyDelete
  6. lol, yeah I thought it was funny too :) Although I personally prefer the XPath approach using the XML package, I've learned quite a bit from your code about stringr which, to me at least, looks like a really cool package manipulating text strings. Good work!

    ReplyDelete
  7. Anonymous,
    I'm afraid you'll need some R first: see, i.e., the links here: http://thebiobucket.blogspot.com/p/starter_19.html

    Tony,
    XML is to prefer because it is more systematic than my picking from strings.. One drawback of your function is that it only retrieves results of the first page.. But I guess there is a way to solve that. Maybe I will remix yours and mine when there is time for it!

    ReplyDelete
  8. I also hit the substring error. Try this: Sys.setlocale(locale="C")
    Cheers, M

    ReplyDelete
  9. @Kay One way around that with my function would be to supply a vector of google scholar urls as follows:

    df = do.call("rbind",
    lapply(gs.urls, get_google_scholar_df))

    this would produce an aggregate dataframe of information from all the pages provided.

    I really like the wordclouds you've produced (I had no idea how to do those before). What I might do is take your function and modify it to accept one of my scraping functions to make it work not only with Google Scholar but also search results from websites like bing.com, yahoo.com and google.com, just for fun :)

    ReplyDelete
  10. # @Kay - I think you could do this for the webpages bit, getting rid of the for loop entirely.

    get_GS_pages <- function(search.str) {
    require(RCurl)
    require(stringr)

    # Initial URL
    url <- paste("http://scholar.google.com/scholar?start=0&q=", search.str,
    "&hl=en&lr=lang_en&num=100&as_sdt=1&as_vis=1",
    sep = "")

    # ...we’re using urls like: http://scholar.google.com/scholar?start=0&q=allintitle:+amphibians+richness+OR+diversity&hl=en&lr=lang_en&num=100&as_sdt=1&as_vis=1
    html_str <- getURL(url)

    # Find html place holders (2 alternatives!) for number of results,
    # and pull the number.
    # (!) Strangely Google Scholar gives different numbers of results
    # dependent on start value.. i.e., a change from 900 to 980 results
    # when changing start = 0 to start = 800
    match_no.res <- str_match(html_str, "Results 1 - (.*?) of (.*?)")
    no.res <- match_no.res[1, max(dim(match_no.res))]

    # stop if no search results found
    if(length(no.res) == 0 | is.na(no.res)){
    match_no.res <- str_match(html_str, "Results 1 - (.*?) of about (.*?)")
    no.res <- match_no.res[1, max(dim(match_no.res))]
    }

    # Remove punctuation (Google uses decimal commas):
    no.res <- as.integer(gsub("[[:punct:]]", "", no.res))

    # If there are no results, stop and throw an error message:
    if(length(no.res) == 0 | is.na(no.res)){stop("\n\n...There is no result for the submitted search string!")}

    # Define number of pages with results to be used subsequently
    # pages.max = maximum number of pages (chunk with 100 results each)
    # to be submitted subsequently.
    # Above it was said that no.res varies, depending on start value.
    # However, we use ceiling and the change will very unlikely be greater
    # than 100, so we may also add one page plus, to be save:
    pages.max <- ceiling(as.integer(no.res)/100)+1

    # "start" as used in url, defines the i-th result to start the page with
    # start = 0 was already used above so we need 100, 200, ...
    start <- c(100*1:(pages.max-1))

    # Collect webpages as list, the first was already retrieved and is assigned to first
    # list-element. the rest will be assigned in th below for loop:
    urls <- paste("http://scholar.google.com/scholar?start=", start[(2:pages.max)-1],
    "&q=", search.str,
    "&hl=en&lr=lang_en&num=100&as_sdt=1&as_vis=1",
    sep = "")

    webpages <- lapply(urls, getURL)

    # return webpages
    return(c(html_str, webpages))
    }

    search.str <- "allintitle:+amphibians+richness+OR+diversity"
    webpages <- get_GS_pages(search.str)

    ReplyDelete
  11. Tony,
    many thanks for your worthily comments! your solution ("Solving by supplying a vector of URLs") sounds perfect!

    You see I'm a lousy programer - apply is not my friend yet - but I hope it will be soon..

    Many thanks for the hint:
    webpages <- lapply(urls, getURL)

    @Anonymous (M),
    many thanks for the tip with the locale!!

    ReplyDelete
  12. Kay,
    If it's any comfort, it took me quite a while to understand how to use the *apply family of functions, but they're quite easy once you get the hang of them!

    Have you thought about putting your code on github? I've just set up an account about an hour ago and it's quite impressive.

    Here's my hack of your function so far:

    https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/get_google_scholar_webpages.R

    Eventually, if I get time, I'll incorporate everything your function does. I hope it's OK that I've basically copied and pasted it (I give credit in the file)? :)

    ReplyDelete
  13. Sure, you're welcome - i'm itching to see how you pimped my function..

    ReplyDelete
  14. Kay, I made an XPath version of your function, now called GScholarXScraper. Full code is here:

    https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/GScholarXScraper/GScholarXScraper.R

    I'll write a blog about it in the next few days, just wanted to let you see what I'd done with it and if you hand any comments? I wanted to make sure I gave credit correctly, hopefully I did!

    This was fun, cheers for making your code public :)

    ReplyDelete
  15. Tony,
    nicely done! ..ran some search strings and it looks good to me - I can't really comment the code - all the xpath stuff is beyond my horizon.

    For the searches I tried I got equal results (as far as I can tell..) as with my function..

    One thing: maybe the original commentary should be adapted in some places?

    ReplyDelete
  16. Kay

    Glad it worked for you too! It also looked the same to me when I compare your 'Regualar Expression' approach again my XPath approach (with stem = FALSE).

    At your suggestion, I have adapted a couple of the original comments to reflect some of the changes I made, such as replacing a For loop with a vectorised alternative (I should have done that before, thanks for point it out!)

    BTW, I notice that you are using google docs as a way to "group-edit" R scripts you've produced. Have you thought about github instead? I'm still very new to it but as I understand it, group editing is one of its features.

    ReplyDelete