theBioBucket*: Webscraping Google Scholar & Show Result as Word Cloud Using R

1 Nov 2011

Webscraping Google Scholar & Show Result as Word Cloud Using R

NOTE: Please see the update HERE and HERE!

...When reading Scott Chemberlain's last post about web-scraping I felt it was time to pick up and complete an idea that I was brooding over for some time now:

When a scientist aims out for a new project the first thing to do is to evaluate if other people already have come along to answer the very questions he is about to work on. I.e., I was interested if there has been done any research regarding amphibian diversity at regional/geographical scales correlated to environmental/landscape parameters. Usually I would got to Google-Scholar and search something like - intitle:amphibians AND intitle:richness OR intitle:diversity AND environment OR landscape - and then browse thru the results. But, this is often tedious and a way for a quick visual examination would be of great benefit.

The code I present will solve this task. It may be awkward in places and there might be a more effective way to yield the same result - but it may serve as a starter and I would very much appreciate people more literate than me picking up the torch...

For my example-search it is shown that there has not been very much going on regarding amphibian diversity correlated to environment and landscape...

See code HERE.

PS: I'd be happy about collaboration / tips / editing - so feel free to contact me and I will add you to the list of editors - you then could edit / comment / add to the script on Google Docs.

...some drawbacks need to be considered:

Maximum no. of search results = 100
Only titles are considered. Additionally considering abstracts may yield more representative results.. but abstracts are truncated in the search result and I don't know if it is possible to retrieve the full abstracts.
Also, long titles may be truncated...
A more illustrative result would be achieved if one could get rid of all other words than nouns, verbs and adjectives - don't know how to do this, but I am sure this is possible.
more drawbacks? you tell..

11 comments :

Tom O'Hara1 November 2011 at 23:51
Neat! That will definitely be useful.

Why do you omit the PDF entries?

Tom

p.s., [FYI for others] The PDF entries can be included along with articles, books, and citations by the following modification:
corpus <- Corpus(DataframeSource(result[, 1:3]))
=>
corpus <- Corpus(DataframeSource(result[, 1:4]))
ReplyDelete
Replies
Kay2 November 2011 at 10:08
Thanks Tom,

of course pdfs should be included - my mistake..

Best,
Kay
ReplyDelete
Replies
Ben Bolker2 November 2011 at 15:06
This is great. I whined previously about there being no easy ways to scrape Google Scholar: http://bmb-common.blogspot.com/2011/02/does-google-scholar-suck-or-am-i-just.html -- but maybe this starting point will get me off my butt to try to help improve the tools
ReplyDelete
Replies
Kay2 November 2011 at 16:20
I very much appreciate your compliment and hope for improvements to the script (see the drawbacks I added recently to the post).

Best,
Kay
ReplyDelete
Replies
Tal Galili2 November 2011 at 22:00
Very cool idea. I hope you/others will take this further :)

BTW, have a look at this:
http://www.drewconway.com/zia/?p=2624
ReplyDelete
Replies
Kay3 November 2011 at 10:19
Thanks Tal,

..very interesting link! - Drew's work is really impressive: what he says about word clouds is absolutely true. His approach to beef up the word cloud concept is slick..

Best,
Kay
ReplyDelete
Replies
Tal Galili3 November 2011 at 12:11
Hi Kay,

I also encourage you to go through the comments on Drew's post - most interesting.

BTW, any chance you'd wrap your code as a function?
ReplyDelete
Replies
Kay3 November 2011 at 13:48
I'm afraid my time resources are too scarce these days... But I would also would like to see this being continued.

(..Maybe there are some helping hands?)
ReplyDelete
Replies
Anonymous5 November 2011 at 22:22
Running your code I get a warning:

webpage <- readLines(url)
Warnmeldung:
In readLines(url) :
unvollständige letzte Zeile in 'http://scholar.google.com/scholar?

Any Idea?
ReplyDelete
Replies
Kay7 November 2011 at 15:51
..the last line obviously causes some trouble - but as it holds nothing we need we can ignore the warning.
ReplyDelete
Replies
FroggyDew7 November 2011 at 22:19
I wish Google could implement a word cloud for filtering searches. I'll try it building up on your script, by using the identify() function to get the word clicked and then re-fetch the Google search by adding the word with a "-" sign in front of the word in the query to exclude results containing this word. Should work...
ReplyDelete
Replies

Add comment

TABS

1 Nov 2011

Webscraping Google Scholar & Show Result as Word Cloud Using R

11 comments :