theBioBucket*: R-Function GScholarScraper to Webscrape Google Scholar Search Result

9 Nov 2011

R-Function GScholarScraper to Webscrape Google Scholar Search Result

NOTE: You'll find the update HERE and HERE.

NOTE: The script is currently not working because the code of the Google-Scholar site has changed...
I'll see for this as soon as I find some spare time for it!

NOTE: If you try to access GoogleScholar programatically consider this words of caution:
http://stackoverflow.com/questions/7523961/google-scholar-with-matlab/7587994#7587994
...

Based on my previous post on Web Scraping I coded and uploaded the Function "GScholarScraper" HERE for testing!
The function will pull all (!) results, processing pages in chunks of 100 results/titles, and return a file with all titles, links, etc. It will also produce a word cloud using the words in the publication titles.

Please try your own search strings and report errors, etc.!

Build and run properly under:
R version 2.13.0 (2011-04-13) and R version R-2.13.2 (2011-09-30)

Platform: i386-pc-mingw32/i386 (32-bit) locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] stringr_0.5 tm_0.5-6 wordcloud_1.2 Rcpp_0.9.7

loaded via a namespace (and not attached):
[1] plyr_1.5.1 slam_0.1-23

PS: Errors reported lately (see comments) were resolved, the source code was updated..

16 comments :

Anonymous9 November 2011 at 18:09
Interesting, I'm going to have to find some time to try your function out! As an exercise in XPath, I used the XML package to scrape information off Google Scholar and return it into a data frame. You've gone the regular expression path which I'll have to try and find some time to understand (looks good though!)

Running the script, I hit an error:

[sourcecode language="r"]
search.str <- "allintitle:+amphibians+richness+OR+diversity"

url <- paste("http://scholar.google.com/scholar?start=0&q=", search.str, "&hl=en&lr=lang_en&num=100&as_sdt=1&as_vis=1", sep = "")

webpage <- readLines(url, warn = F)
html_str <- paste(webpage, collapse="\n")

#
[/sourcecode]

However I think the following will do the same thing:
[sourcecode language="r"]
library(RCurl)
html <- getURL(url)
[/sourcecode]

tested on R2.14.0 on Ubuntu 11.10 x64
ReplyDelete
Replies
Unknown9 November 2011 at 20:12
When I try to run the script I get the message "Error in substring(string, start, end) : invalid multibyte string at ..."
ReplyDelete
Replies
Anonymous9 November 2011 at 20:29
This is a usefool / fun project

but simply running your code w/ your example i'm getting this error

Error in substring(string, start, end) :
invalid multibyte string at ' * Wi<6c>dlife <
ReplyDelete
Replies
Kay10 November 2011 at 09:12
Tony,
I read your post - funny that we came up with the same thing at the very same day!
I replaced readLines() by getURL() - thanks for reporting this!

Sean,
Sorry, at the moment I can not reproduce the substring() error. I need to see what it means..

Anonymous,
it was fun for me doing it - nevertheless, the purpose is evident, without fun.. In scientific research it's vital to know what is going on, that is, to know what's being published!
My, or similar tools could be a very use*ful* in this regard!
ReplyDelete
Replies
Anonymous10 November 2011 at 11:36
I am very new to this I need to "scrap?" names of articles, author-journal-date (is it possible to sepparate?), and number of citations into spreadsheet (i guess XML?).
I was not able to find any gscholarscrapper. And i still wonder were i should put The Code.
Please help!
ReplyDelete
Replies
Anonymous10 November 2011 at 15:33
lol, yeah I thought it was funny too :) Although I personally prefer the XPath approach using the XML package, I've learned quite a bit from your code about stringr which, to me at least, looks like a really cool package manipulating text strings. Good work!
ReplyDelete
Replies
Kay11 November 2011 at 09:42
Anonymous,
I'm afraid you'll need some R first: see, i.e., the links here: http://thebiobucket.blogspot.com/p/starter_19.html

Tony,
XML is to prefer because it is more systematic than my picking from strings.. One drawback of your function is that it only retrieves results of the first page.. But I guess there is a way to solve that. Maybe I will remix yours and mine when there is time for it!
ReplyDelete
Replies
Anonymous11 November 2011 at 10:25
I also hit the substring error. Try this: Sys.setlocale(locale="C")
Cheers, M
ReplyDelete
Replies
Anonymous11 November 2011 at 12:31
@Kay One way around that with my function would be to supply a vector of google scholar urls as follows:

df = do.call("rbind",
lapply(gs.urls, get_google_scholar_df))

this would produce an aggregate dataframe of information from all the pages provided.

I really like the wordclouds you've produced (I had no idea how to do those before). What I might do is take your function and modify it to accept one of my scraping functions to make it work not only with Google Scholar but also search results from websites like bing.com, yahoo.com and google.com, just for fun :)
ReplyDelete
Replies
Anonymous11 November 2011 at 12:35
# @Kay - I think you could do this for the webpages bit, getting rid of the for loop entirely.

get_GS_pages <- function(search.str) {
require(RCurl)
require(stringr)

# Initial URL
url <- paste("http://scholar.google.com/scholar?start=0&q=", search.str,
"&hl=en&lr=lang_en&num=100&as_sdt=1&as_vis=1",
sep = "")

# ...we’re using urls like: http://scholar.google.com/scholar?start=0&q=allintitle:+amphibians+richness+OR+diversity&hl=en&lr=lang_en&num=100&as_sdt=1&as_vis=1
html_str <- getURL(url)

# Find html place holders (2 alternatives!) for number of results,
# and pull the number.
# (!) Strangely Google Scholar gives different numbers of results
# dependent on start value.. i.e., a change from 900 to 980 results
# when changing start = 0 to start = 800
match_no.res <- str_match(html_str, "Results 1 - (.*?) of (.*?)")
no.res <- match_no.res[1, max(dim(match_no.res))]

# stop if no search results found
if(length(no.res) == 0 | is.na(no.res)){
match_no.res <- str_match(html_str, "Results 1 - (.*?) of about (.*?)")
no.res <- match_no.res[1, max(dim(match_no.res))]
}

# Remove punctuation (Google uses decimal commas):
no.res <- as.integer(gsub("[[:punct:]]", "", no.res))

# If there are no results, stop and throw an error message:
if(length(no.res) == 0 | is.na(no.res)){stop("\n\n...There is no result for the submitted search string!")}

# Define number of pages with results to be used subsequently
# pages.max = maximum number of pages (chunk with 100 results each)
# to be submitted subsequently.
# Above it was said that no.res varies, depending on start value.
# However, we use ceiling and the change will very unlikely be greater
# than 100, so we may also add one page plus, to be save:
pages.max <- ceiling(as.integer(no.res)/100)+1

# "start" as used in url, defines the i-th result to start the page with
# start = 0 was already used above so we need 100, 200, ...
start <- c(100*1:(pages.max-1))

# Collect webpages as list, the first was already retrieved and is assigned to first
# list-element. the rest will be assigned in th below for loop:
urls <- paste("http://scholar.google.com/scholar?start=", start[(2:pages.max)-1],
"&q=", search.str,
"&hl=en&lr=lang_en&num=100&as_sdt=1&as_vis=1",
sep = "")

webpages <- lapply(urls, getURL)

# return webpages
return(c(html_str, webpages))
}

search.str <- "allintitle:+amphibians+richness+OR+diversity"
webpages <- get_GS_pages(search.str)
ReplyDelete
Replies
Kay11 November 2011 at 14:41
Tony,
many thanks for your worthily comments! your solution ("Solving by supplying a vector of URLs") sounds perfect!

You see I'm a lousy programer - apply is not my friend yet - but I hope it will be soon..

Many thanks for the hint:
webpages <- lapply(urls, getURL)

@Anonymous (M),
many thanks for the tip with the locale!!
ReplyDelete
Replies
Anonymous11 November 2011 at 15:53
Kay,
If it's any comfort, it took me quite a while to understand how to use the *apply family of functions, but they're quite easy once you get the hang of them!

Have you thought about putting your code on github? I've just set up an account about an hour ago and it's quite impressive.

Here's my hack of your function so far:

https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/get_google_scholar_webpages.R

Eventually, if I get time, I'll incorporate everything your function does. I hope it's OK that I've basically copied and pasted it (I give credit in the file)? :)
ReplyDelete
Replies
Kay11 November 2011 at 21:08
Sure, you're welcome - i'm itching to see how you pimped my function..
ReplyDelete
Replies
Anonymous13 November 2011 at 19:56
Kay, I made an XPath version of your function, now called GScholarXScraper. Full code is here:

https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/GScholarXScraper/GScholarXScraper.R

I'll write a blog about it in the next few days, just wanted to let you see what I'd done with it and if you hand any comments? I wanted to make sure I gave credit correctly, hopefully I did!

This was fun, cheers for making your code public :)
ReplyDelete
Replies
Kay13 November 2011 at 22:32
Tony,
nicely done! ..ran some search strings and it looks good to me - I can't really comment the code - all the xpath stuff is beyond my horizon.

For the searches I tried I got equal results (as far as I can tell..) as with my function..

One thing: maybe the original commentary should be adapted in some places?
ReplyDelete
Replies
Anonymous14 November 2011 at 00:24
Kay

Glad it worked for you too! It also looked the same to me when I compare your 'Regualar Expression' approach again my XPath approach (with stem = FALSE).

At your suggestion, I have adapted a couple of the original comments to reflect some of the changes I made, such as replacing a For loop with a vectorised alternative (I should have done that before, thanks for point it out!)

BTW, I notice that you are using google docs as a way to "group-edit" R scripts you've produced. Have you thought about github instead? I'm still very new to it but as I understand it, group editing is one of its features.
ReplyDelete
Replies

Subscribe to: Post Comments ( Atom )

TABS

9 Nov 2011

R-Function GScholarScraper to Webscrape Google Scholar Search Result

16 comments :