I have updated the Google Scholar Web-Scraper Function GScholarScaper_2 to GScholarScraper_3 (and GScholarScaper_3.1) as it was outdated due to changes in the Google Scholar html-code. The new script is more slender and faster. It returns a dataframe or optionally a CSV-file with the titles, authors, publications & links. Feel free to report bugs, etc.
Update 11-07-2013: bug fixes due to google scholar code changes - https://github.com/gimoya/theBioBucket-Archives/blob/master/R/Functions/GScholarScraper_3.2.R. Note that since lately your IP will be blocked by Google at about the 1000th search result (cumulated) - so there's not much fun when you want to do some extensive bibliometrics..
Got a error message:
ReplyDeleteErro em htmlParse(url) :
error in creating parser for http://scholar.google.com/scholar?q=allintitle:pantanal&num=1&as_sdt=1&as_vis=1
I could not solve the problem.
Anyway, its an interesting function :)
Ah, i use Tinn-R, windows 7 and R 2.15.1 if you could figure out the problem ^^.
Sry, I can't reproduce the error.. As you only search for one word in the titles you could use "intitle:pantanal" - however, it works for me also with "allintitle:pantanal"..
DeleteWell, i was trying to do something like this. to produce a figure to show how some theory for example got more citations.
Deleteinput<-paste("metapopulation&as_ylo=",1980:2012,"as_yhi=",1980:2012,sep="")
anos<-1980:2012
resultados<-rep(NA,length(anos))
for(i in 1:length(anos)) {
resultados[i]<-length(GScholar_Scraper(input[i],write=F)$PUBLICATION )
}
Make many searchs for year, it work sometimes then stop working and start giveing the error i said before
Please see the follow-up posting http://thebiobucket.blogspot.co.at/2012/08/toy-example-with-gscholarscraper31.html - maybe this will help!
DeleteHowever, there is an issue with Google blocking automated searches which arises for search strings giving more than 1000 results. And, occasionally you're IP seems to be blocked generally.. I'm afraid there is no quick solution for this (changing your IP / resetting modem, etc. fixes the problem, however, not very elegantly..).
So cool!
ReplyDeleteThanks very much!
Thank you for this. I previously spent many hours working out how to scrape data from Google Scholar. Sadly, once I got a working program, I found Google Scholar locked me out after I had retrieved around 100 records. Correspondence with them got me nowhere: they basically accuse you of unethical behavior if you try to automate searches. I can't understand their logic and they don't explain it.
ReplyDeleteIt's very disappointing for those of us who want to do serious research using bibliometrics.
I haven't tried your program but assume it would hit the same snag?
..with my function which utilizes htmlParse(url) from the XML-package it works for search strings that give less than 1000 hits. Then it seems to be blocked.
DeleteReally cool application. Could you please provide a brief example of how to produce a wordcloud with the dataframe returned by GScholar_Scraper_3.1
ReplyDeleteI attempted following the example shown in GScholar_Scraper_2 but keep getting wordclouds of publication years and removing numeric's leaves an empty dataframe. I'm missing something simple in corpus <- Corpus(DataframeSource(df[, 1:2])) but cant see what
Thanks again
Check the follow-up (http://thebiobucket.blogspot.com/2012/08/follow-up-making-word-cloud-for-search.html)..
DeleteHi, is this still live? I read elsewhere on your site that Google had changed their code. Thanks so much
ReplyDeletePleaser try version 3.1. and report if there are any issues!
DeleteI really appreciate your efforts here. However, by 100 hits you mean 1000 returned results? If that is the case, this code has a very very limited usage. I searched for "authenticity" as my keyword and 280,000+ results returned. Obviously, the code didn't work. At least, you can add an argument enabling the user to limit the fetched results to the first 1000 results.
ReplyDelete