...When reading Scott Chemberlain's last post about web-scraping I felt it was time to pick up and complete an idea that I was brooding over for some time now:
When a scientist aims out for a new project the first thing to do is to evaluate if other people already have come along to answer the very questions he is about to work on. I.e., I was interested if there has been done any research regarding amphibian diversity at regional/geographical scales correlated to environmental/landscape parameters. Usually I would got to Google-Scholar and search something like - intitle:amphibians AND intitle:richness OR intitle:diversity AND environment OR landscape - and then browse thru the results. But, this is often tedious and a way for a quick visual examination would be of great benefit.
The code I present will solve this task. It may be awkward in places and there might be a more effective way to yield the same result - but it may serve as a starter and I would very much appreciate people more literate than me picking up the torch...
For my example-search it is shown that there has not been very much going on regarding amphibian diversity correlated to environment and landscape...
See code HERE.
PS: I'd be happy about collaboration / tips / editing - so feel free to contact me and I will add you to the list of editors - you then could edit / comment / add to the script on Google Docs.
...some drawbacks need to be considered:
- Maximum no. of search results = 100
- Only titles are considered. Additionally considering abstracts may yield more representative results.. but abstracts are truncated in the search result and I don't know if it is possible to retrieve the full abstracts.
- Also, long titles may be truncated...
- A more illustrative result would be achieved if one could get rid of all other words than nouns, verbs and adjectives - don't know how to do this, but I am sure this is possible.
- more drawbacks? you tell..
Neat! That will definitely be useful.
ReplyDeleteWhy do you omit the PDF entries?
Tom
p.s., [FYI for others] The PDF entries can be included along with articles, books, and citations by the following modification:
corpus <- Corpus(DataframeSource(result[, 1:3]))
=>
corpus <- Corpus(DataframeSource(result[, 1:4]))
Thanks Tom,
ReplyDeleteof course pdfs should be included - my mistake..
Best,
Kay
This is great. I whined previously about there being no easy ways to scrape Google Scholar: http://bmb-common.blogspot.com/2011/02/does-google-scholar-suck-or-am-i-just.html -- but maybe this starting point will get me off my butt to try to help improve the tools
ReplyDeleteI very much appreciate your compliment and hope for improvements to the script (see the drawbacks I added recently to the post).
ReplyDeleteBest,
Kay
Very cool idea. I hope you/others will take this further :)
ReplyDeleteBTW, have a look at this:
http://www.drewconway.com/zia/?p=2624
Thanks Tal,
ReplyDelete..very interesting link! - Drew's work is really impressive: what he says about word clouds is absolutely true. His approach to beef up the word cloud concept is slick..
Best,
Kay
Hi Kay,
ReplyDeleteI also encourage you to go through the comments on Drew's post - most interesting.
BTW, any chance you'd wrap your code as a function?
I'm afraid my time resources are too scarce these days... But I would also would like to see this being continued.
ReplyDelete(..Maybe there are some helping hands?)
Running your code I get a warning:
ReplyDeletewebpage <- readLines(url)
Warnmeldung:
In readLines(url) :
unvollständige letzte Zeile in 'http://scholar.google.com/scholar?
Any Idea?
..the last line obviously causes some trouble - but as it holds nothing we need we can ignore the warning.
ReplyDeleteI wish Google could implement a word cloud for filtering searches. I'll try it building up on your script, by using the identify() function to get the word clicked and then re-fetch the Google search by adding the word with a "-" sign in front of the word in the query to exclude results containing this word. Should work...
ReplyDelete