Comments on theBioBucket*: R-Function GScholarScraper to Webscrape Google Scholar Search Result

Kay Glad it worked for you too! It also looked th...

2011-11-14T00:24:03.900+01:00

Kay

Glad it worked for you too! It also looked the same to me when I compare your 'Regualar Expression' approach again my XPath approach (with stem = FALSE).

At your suggestion, I have adapted a couple of the original comments to reflect some of the changes I made, such as replacing a For loop with a vectorised alternative (I should have done that before, thanks for point it out!)

BTW, I notice that you are using google docs as a way to "group-edit" R scripts you've produced. Have you thought about github instead? I'm still very new to it but as I understand it, group editing is one of its features.

Tony, nicely done! ..ran some search strings and i...

2011-11-13T22:32:11.609+01:00

Tony,
nicely done! ..ran some search strings and it looks good to me - I can't really comment the code - all the xpath stuff is beyond my horizon.

For the searches I tried I got equal results (as far as I can tell..) as with my function..

One thing: maybe the original commentary should be adapted in some places?

Kay, I made an XPath version of your function, now...

2011-11-13T19:56:19.039+01:00

Kay, I made an XPath version of your function, now called GScholarXScraper. Full code is here:

https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/GScholarXScraper/GScholarXScraper.R

I'll write a blog about it in the next few days, just wanted to let you see what I'd done with it and if you hand any comments? I wanted to make sure I gave credit correctly, hopefully I did!

This was fun, cheers for making your code public :)

Sure, you're welcome - i'm itching to see ...

2011-11-11T21:08:49.433+01:00

Sure, you're welcome - i'm itching to see how you pimped my function..

Kay, If it's any comfort, it took me quite a w...

2011-11-11T15:53:06.629+01:00

Kay,
If it's any comfort, it took me quite a while to understand how to use the *apply family of functions, but they're quite easy once you get the hang of them!

Have you thought about putting your code on github? I've just set up an account about an hour ago and it's quite impressive.

Here's my hack of your function so far:

https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/get_google_scholar_webpages.R

Eventually, if I get time, I'll incorporate everything your function does. I hope it's OK that I've basically copied and pasted it (I give credit in the file)? :)

Tony, many thanks for your worthily comments! your...

2011-11-11T14:41:16.249+01:00

Tony,
many thanks for your worthily comments! your solution ("Solving by supplying a vector of URLs") sounds perfect!

You see I'm a lousy programer - apply is not my friend yet - but I hope it will be soon..

Many thanks for the hint:
webpages <- lapply(urls, getURL)

@Anonymous (M),
many thanks for the tip with the locale!!

# @Kay - I think you could do this for the webpage...

2011-11-11T12:35:11.018+01:00

# @Kay - I think you could do this for the webpages bit, getting rid of the for loop entirely.

get_GS_pages <- function(search.str) {
require(RCurl)
require(stringr)

# Initial URL
url <- paste("http://scholar.google.com/scholar?start=0&q=", search.str,
"&hl=en&lr=lang_en&num=100&as_sdt=1&as_vis=1",
sep = "")

# ...we’re using urls like: http://scholar.google.com/scholar?start=0&q=allintitle:+amphibians+richness+OR+diversity&hl=en&lr=lang_en&num=100&as_sdt=1&as_vis=1
html_str <- getURL(url)

# Find html place holders (2 alternatives!) for number of results,
# and pull the number.
# (!) Strangely Google Scholar gives different numbers of results
# dependent on start value.. i.e., a change from 900 to 980 results
# when changing start = 0 to start = 800
match_no.res <- str_match(html_str, "Results 1 - (.*?) of (.*?)")
no.res <- match_no.res[1, max(dim(match_no.res))]

# stop if no search results found
if(length(no.res) == 0 | is.na(no.res)){
match_no.res <- str_match(html_str, "Results 1 - (.*?) of about (.*?)")
no.res <- match_no.res[1, max(dim(match_no.res))]
}

# Remove punctuation (Google uses decimal commas):
no.res <- as.integer(gsub("[[:punct:]]", "", no.res))

# If there are no results, stop and throw an error message:
if(length(no.res) == 0 | is.na(no.res)){stop("\n\n...There is no result for the submitted search string!")}

# Define number of pages with results to be used subsequently
# pages.max = maximum number of pages (chunk with 100 results each)
# to be submitted subsequently.
# Above it was said that no.res varies, depending on start value.
# However, we use ceiling and the change will very unlikely be greater
# than 100, so we may also add one page plus, to be save:
pages.max <- ceiling(as.integer(no.res)/100)+1

# "start" as used in url, defines the i-th result to start the page with
# start = 0 was already used above so we need 100, 200, ...
start <- c(100*1:(pages.max-1))

# Collect webpages as list, the first was already retrieved and is assigned to first
# list-element. the rest will be assigned in th below for loop:
urls <- paste("http://scholar.google.com/scholar?start=", start[(2:pages.max)-1],
"&q=", search.str,
"&hl=en&lr=lang_en&num=100&as_sdt=1&as_vis=1",
sep = "")

webpages <- lapply(urls, getURL)

# return webpages
return(c(html_str, webpages))
}

search.str <- "allintitle:+amphibians+richness+OR+diversity"
webpages <- get_GS_pages(search.str)

@Kay One way around that with my function would be...

2011-11-11T12:31:21.608+01:00

@Kay One way around that with my function would be to supply a vector of google scholar urls as follows:

df = do.call("rbind",
lapply(gs.urls, get_google_scholar_df))

this would produce an aggregate dataframe of information from all the pages provided.

I really like the wordclouds you've produced (I had no idea how to do those before). What I might do is take your function and modify it to accept one of my scraping functions to make it work not only with Google Scholar but also search results from websites like bing.com, yahoo.com and google.com, just for fun :)

I also hit the substring error. Try this: Sys.setl...

2011-11-11T10:25:29.057+01:00

I also hit the substring error. Try this: Sys.setlocale(locale="C")
Cheers, M

Anonymous, I'm afraid you'll need some R f...

2011-11-11T09:42:59.189+01:00

Anonymous,
I'm afraid you'll need some R first: see, i.e., the links here: http://thebiobucket.blogspot.com/p/starter_19.html

Tony,
XML is to prefer because it is more systematic than my picking from strings.. One drawback of your function is that it only retrieves results of the first page.. But I guess there is a way to solve that. Maybe I will remix yours and mine when there is time for it!

lol, yeah I thought it was funny too :) Although I...

2011-11-10T15:33:19.247+01:00

lol, yeah I thought it was funny too :) Although I personally prefer the XPath approach using the XML package, I've learned quite a bit from your code about stringr which, to me at least, looks like a really cool package manipulating text strings. Good work!

I am very new to this I need to "scrap?"...

2011-11-10T11:36:15.322+01:00

I am very new to this I need to "scrap?" names of articles, author-journal-date (is it possible to sepparate?), and number of citations into spreadsheet (i guess XML?).
I was not able to find any gscholarscrapper. And i still wonder were i should put The Code.
Please help!

Tony, I read your post - funny that we came up wit...

2011-11-10T09:12:41.882+01:00

Tony,
I read your post - funny that we came up with the same thing at the very same day!
I replaced readLines() by getURL() - thanks for reporting this!

Sean,
Sorry, at the moment I can not reproduce the substring() error. I need to see what it means..

Anonymous,
it was fun for me doing it - nevertheless, the purpose is evident, without fun.. In scientific research it's vital to know what is going on, that is, to know what's being published!
My, or similar tools could be a very use*ful* in this regard!

This is a usefool / fun project but simply runnin...

2011-11-09T20:29:14.349+01:00

This is a usefool / fun project

but simply running your code w/ your example i'm getting this error

Error in substring(string, start, end) :
invalid multibyte string at ' * Wi<6c>dlife <

When I try to run the script I get the message &qu...

2011-11-09T20:12:22.149+01:00

When I try to run the script I get the message "Error in substring(string, start, end) : invalid multibyte string at ..."

Interesting, I'm going to have to find some ti...

2011-11-09T18:09:40.875+01:00

Interesting, I'm going to have to find some time to try your function out! As an exercise in XPath, I used the XML package to scrape information off Google Scholar and return it into a data frame. You've gone the regular expression path which I'll have to try and find some time to understand (looks good though!)

Running the script, I hit an error:

[sourcecode language="r"]
search.str <- "allintitle:+amphibians+richness+OR+diversity"

url <- paste("http://scholar.google.com/scholar?start=0&q=", search.str, "&hl=en&lr=lang_en&num=100&as_sdt=1&as_vis=1", sep = "")

webpage <- readLines(url, warn = F)
html_str <- paste(webpage, collapse="\n")

#
[/sourcecode]

However I think the following will do the same thing:
[sourcecode language="r"]
library(RCurl)
html <- getURL(url)
[/sourcecode]

tested on R2.14.0 on Ubuntu 11.10 x64