Rankings are an inevitable tool to keep the human rat race going. In this regard I'll pick up my last two posts (HERE & HERE) and have some fun with it by using it to analyse R-Bloggers' web presence. I will use number of hits in Google Search as an indicator.
I searched for URLs like this: https://www.google.com/search?q="http://www.twotorials.com" - meaning that only the exact blog-URL is searched.
Blogs | NoHits |
---|---|
http://google-opensource.blogspot.com | 82300 |
http://www.programmingr.com | 73500 |
http://googleresearch.blogspot.com | 58000 |
http://dirk.eddelbuettel.com | 53000 |
http://borasky-research.net | 33100 |
http://casoilresource.lawr.ucdavis.edu | 32500 |
http://andrewgelman.com | 30000 |
http://yihui.name | 29600 |
http://xianblog.wordpress.com | 27900 |
http://nsaunders.wordpress.com | 27600 |
http://chem-bla-ics.blogspot.com | 26600 |
http://plindenbaum.blogspot.com | 24600 |
http://blog.ouseful.info | 24300 |
http://www.vcasmo.com | 24200 |
http://yz.mit.edu | 23500 |
http://romainfrancois.blog.free.fr | 22700 |
http://blog.revolutionanalytics.com | 21000 |
http://robjhyndman.com | 18400 |
http://freakonometrics.blog.free.fr | 16100 |
http://perfdynamics.blogspot.com | 15400 |
http://www.stubbornmule.net | 14800 |
http://zoonek.free.fr | 14800 |
http://jackman.stanford.edu | 13900 |
http://www.bytemining.com | 13700 |
http://learnr.wordpress.com | 12600 |
http://tommy.chheng.com | 12200 |
http://mazamascience.com | 12000 |
http://www.investuotojas.eu | 11500 |
http://www.r-statistics.com | 11300 |
http://www.franklincenterhq.org | 10800 |
http://gettinggeneticsdone.blogspot.com | 10700 |
http://mpastell.com | 9930 |
http://pineda-krch.com | 9780 |
http://blog.saush.com | 9220 |
http://www.premiersoccerstats.com | 8950 |
http://developmentality.wordpress.com | 7250 |
http://www.dataspora.com | 7200 |
http://blog.hiremebecauseimsmart.com | 7050 |
http://isomorphismes.tumblr.com | 7040 |
http://www.mathfinance.cn | 6930 |
http://blog.nguyenvq.com | 6150 |
http://www.drewconway.com | 5970 |
http://www.carlboettiger.info | 5520 |
http://www.statisticsblog.com | 5110 |
http://www.decisionsciencenews.com | 4950 |
http://www.r-chart.com | 4810 |
http://chartsgraphs.wordpress.com | 4480 |
http://www.portfolioprobe.com | 4410 |
http://procomun.wordpress.com | 4330 |
http://jeromyanglim.blogspot.com | 4080 |
http://spatialanalysis.co.uk | 4080 |
http://www.theresearchkitchen.com | 4080 |
http://www.forex-bloggers.com | 4070 |
https://www.rmetrics.org | 4050 |
http://princeofslides.blogspot.com | 3900 |
http://www.cybaea.net | 3740 |
http://www.cerebralmastication.com | 3710 |
http://ygc.name | 3670 |
http://ryouready.wordpress.com | 3450 |
http://jeffreybreen.wordpress.com | 3410 |
http://systematicinvestor.wordpress.com | 3400 |
http://sgsong.blogspot.com | 3310 |
http://industrialengineertools.blogspot.com | 3290 |
http://www.r-tutor.com | 3270 |
http://fishlab.ucdavis.edu | 3270 |
http://ggorjan.blogspot.com | 3250 |
http://blog.ynada.com | 3220 |
http://farmacokratia.blogspot.com | 3170 |
http://4dpiecharts.com | 3130 |
http://heuristically.wordpress.com | 3040 |
http://blog.rtwilson.com | 2910 |
http://www.wekaleamstudios.co.uk | 2890 |
http://www.dataists.com | 2840 |
http://ikanb.wordpress.com | 2750 |
http://shape-of-code.coding-guidelines.com | 2730 |
http://onertipaday.blogspot.com | 2710 |
http://blog.fosstrading.com | 2700 |
http://blog.echen.me | 2690 |
http://www.theusrus.de | 2670 |
http://cloudnumbers.com | 2630 |
http://paulbutler.org | 2620 |
http://biostatmatt.com | 2460 |
http://www.johnmyleswhite.com | 2430 |
http://dataninja.wordpress.com | 2360 |
http://realizationsinbiostatistics.blogspot.com | 2340 |
http://statisfaction.wordpress.com | 2300 |
http://uxblog.idvsolutions.com | 2250 |
http://timelyportfolio.blogspot.com | 2210 |
http://radfordneal.wordpress.com | 2200 |
http://sas-and-r.blogspot.com | 2200 |
http://pairach.com | 2110 |
http://yusung.blogspot.com | 2050 |
http://blog.flacso.edu.mx | 2010 |
http://www.rensenieuwenhuis.nl | 2000 |
http://michaeldhealy.com | 1990 |
http://freigeist.devmag.net | 1950 |
http://www.fernandohrosa.com.br | 1920 |
http://statbandit.wordpress.com | 1870 |
http://www.win-vector.com | 1840 |
http://lukemiller.org | 1830 |
http://ropensci.org | 1720 |
http://www.eggwall.com | 1650 |
http://benmazzotta.wordpress.com | 1620 |
http://bms.zeugner.eu | 1610 |
http://cartesianfaith.wordpress.com | 1580 |
http://linkedscience.org | 1570 |
http://stevemosher.wordpress.com | 1550 |
http://intelligenttradingtech.blogspot.com | 1520 |
http://www.imachordata.com | 1480 |
http://blog.diegovalle.net | 1470 |
http://jermdemo.blogspot.com | 1430 |
http://nortalktoowise.com | 1420 |
http://ekonometrics.blogspot.com | 1340 |
http://digitheadslabnotebook.blogspot.com | 1320 |
http://flyordie.sin.khk.be | 1310 |
http://schamberlain.github.com | 1230 |
http://gribblelab.org | 1180 |
http://www.quantf.com | 1130 |
http://offensivepolitics.net | 1020 |
http://www.markmfredrickson.com | 981 |
http://blog.mckuhn.de | 948 |
http://erehweb.wordpress.com | 889 |
http://confounding.net | 886 |
http://simplystatistics.tumblr.com | 875 |
http://www.babelgraph.org | 859 |
http://csgillespie.wordpress.com | 857 |
http://joewheatley.net | 844 |
http://helmingstay.blogspot.com | 843 |
http://theaverageinvestor.wordpress.com | 825 |
http://quantitative-ecology.blogspot.com | 785 |
http://zvfak.blogspot.com | 776 |
http://ucfagls.wordpress.com | 766 |
http://opendatagroup.com | 760 |
http://cameron.bracken.bz | 740 |
http://rtutorialseries.blogspot.com | 738 |
http://opencpu.org | 708 |
http://novicemetrics.blogspot.com | 700 |
http://lamages.blogspot.com | 680 |
http://nir-quimiometria.blogspot.com | 679 |
http://tonybreyal.wordpress.com | 677 |
http://brokeringclosure.wordpress.com | 658 |
http://socialdatablog.com | 643 |
http://dancingeconomist.blogspot.com | 629 |
http://www.rtexttools.com | 603 |
http://danganothererror.wordpress.com | 589 |
http://thebiobucket.blogspot.com | 567 |
http://holtmeier.de | 531 |
http://val-systems.blogspot.com | 519 |
http://thelogcabin.wordpress.com | 489 |
http://dcemri.blogspot.com | 484 |
http://rdatamining.wordpress.com | 477 |
http://bridgewater.wordpress.com | 460 |
http://www.rcasts.com | 444 |
http://dsparks.wordpress.com | 436 |
http://pr.cloudst.at | 422 |
http://polstat.org | 409 |
http://www.compmath.com | 401 |
http://techno-realism.blogspot.com | 399 |
http://www.backsidesmack.com | 395 |
http://geotheory.org | 393 |
http://miraisolutions.wordpress.com | 367 |
http://econometricsense.blogspot.com | 352 |
http://blog.binfalse.de | 344 |
http://rforcancer.drupalgardens.com | 317 |
http://blog.rstudio.org | 316 |
http://mcfromnz.wordpress.com | 309 |
http://www.quantumforest.com | 309 |
http://blog.quanttrader.org | 303 |
http://chrisladroue.com | 293 |
http://www.michaelbommarito.com | 289 |
http://procrun.com | 280 |
http://mikeksmith.posterous.com | 279 |
http://bio7.org | 278 |
http://kbroman.wordpress.com | 278 |
http://martynplummer.wordpress.com | 272 |
http://bryer.org | 268 |
http://www.funjackals.com | 265 |
http://www.harlan.harris.name | 252 |
http://www.milktrader.net | 248 |
http://www.surefoss.org | 241 |
http://rigorousanalytics.blogspot.com | 231 |
http://www.jameskeirstead.ca | 229 |
http://programming-r-pro-bro.blogspot.com | 225 |
http://plausibel.blogspot.com | 224 |
http://statistic-on-air.blogspot.com | 217 |
http://mintgene.wordpress.com | 212 |
http://moderntoolmaking.blogspot.com | 205 |
http://quantitativeecology.blogspot.com | 199 |
http://www.sigmafield.org | 199 |
http://www.ancienteco.com | 194 |
http://worldofrcraft.blogspot.com | 191 |
http://rappster.wordpress.com | 190 |
http://stotastic.com | 189 |
http://evolvingspaces.blogspot.com | 184 |
http://strugglingthroughproblems.blogspot.com | 166 |
http://sharpstatistics.co.uk | 161 |
http://leftcensored.skepsi.net | 160 |
http://omegahat.wordpress.com | 156 |
http://drunks-and-lampposts.com | 155 |
http://amathew.com | 152 |
http://onlinelabor.blogspot.com | 147 |
http://johnramey.net | 144 |
http://gossetsstudent.wordpress.com | 138 |
http://tomhopper.wordpress.com | 135 |
http://ggobi.blogspot.com | 134 |
http://blog.fellstat.com | 131 |
http://www.openanalytics.eu | 130 |
http://www.numbertheory.nl | 127 |
http://stats.blogoverflow.com | 127 |
http://the-praise-of-insects.blogspot.com | 122 |
http://lpenz.github.com | 118 |
http://christophergandrud.blogspot.com | 118 |
http://f.giorlando.org | 112 |
http://bayesianbiologist.com | 110 |
http://www.graphoftheweek.org | 109 |
http://oneliner.soma20.com | 109 |
http://inundata.org | 107 |
http://geokook.wordpress.com | 104 |
http://blog.datapunks.com | 102 |
http://eranraviv.com | 102 |
http://eranraviv.com | 102 |
http://www.compbiome.com | 101 |
http://www.techpolicy.ca | 99 |
http://www.psychwire.co.uk | 97 |
http://blog.carlislerainey.com | 93 |
http://vasishth-statistics.blogspot.com | 93 |
http://www.statsravingmad.com | 93 |
http://using-r-project.blogspot.com | 93 |
http://www.nikhilgopal.com | 92 |
http://thedatamonkey.blogspot.com | 92 |
http://jeffreyhorner.tumblr.com | 90 |
http://menugget.blogspot.com | 88 |
http://www.twotorials.com | 88 |
http://dataexcursions.wordpress.com | 84 |
http://viksalgorithms.blogspot.com | 83 |
http://exploringdatablog.blogspot.com | 81 |
http://sachaepskamp.com | 81 |
http://aphysicistinwallstreet.blogspot.com | 77 |
http://lastresortsoftware.blogspot.com | 75 |
http://www.nomad.priv.at | 72 |
http://applyr.blogspot.com | 71 |
http://www.knowledgediscovery.jp | 71 |
http://weitaiyun.blogspot.com | 71 |
http://xmphforex.wordpress.com | 71 |
http://statsadventure.blogspot.com | 70 |
http://davenportspatialanalytics.squarespace.com | 70 |
http://anandram.wordpress.com | 69 |
http://rpint.wordpress.com | 68 |
http://datadebrief.blogspot.com | 66 |
http://blog.cloudstat.org | 64 |
http://www.r-podcast.org | 64 |
http://rmkrug.wordpress.com | 62 |
http://denishaine.wordpress.com | 61 |
http://expansed.com | 58 |
http://r.andrewredd.us | 57 |
http://isseing333.blogspot.com | 57 |
http://solomonmessing.wordpress.com | 57 |
http://rtricks.wordpress.com | 57 |
http://anrprogrammer.wordpress.com | 56 |
http://arungaikwad.wordpress.com | 56 |
http://geolabs.wordpress.com | 55 |
http://lookingatdata.blogspot.com | 55 |
http://factbased.blogspot.com | 54 |
http://severity.blogspot.com | 54 |
http://swordofcrom.wordpress.com | 53 |
http://librestats.wordpress.com | 51 |
http://marcinkula.wordpress.com | 51 |
http://gsoc2010r.wordpress.com | 47 |
http://psyccomputing.blogspot.com | 46 |
http://fabiomarroni.wordpress.com | 45 |
http://jedifran.com | 45 |
http://alstatr.blogspot.com | 43 |
http://r-video-tutorial.blogspot.com | 42 |
http://alexfarquhar.posterous.com | 40 |
http://bmb-common.blogspot.com | 40 |
http://rdataviz.wordpress.com | 40 |
http://mypapertrades.blogspot.com | 38 |
http://pitchrx.blogspot.com | 38 |
http://simonmueller.net | 38 |
http://statisfactions.wordpress.com | 37 |
http://nzprimarysectortrade.wordpress.com | 36 |
http://seanmulcahy.blogspot.com | 36 |
http://www.speakingstatistically.com | 35 |
http://joshpaulson.wordpress.com | 34 |
http://learningrbasic.blogspot.com | 34 |
http://mockquant.blogspot.com | 33 |
http://costaleconomist.blogspot.com | 32 |
http://rsnippets.blogspot.com | 31 |
http://statmethods.wordpress.com | 29 |
http://aviadklein.wordpress.com | 28 |
http://obeautifulcode.com | 28 |
http://blog.cloudst.at | 24 |
http://rstats.posterous.com | 23 |
http://notebookonthewebs.tumblr.com | 22 |
http://0utlier.blogspot.com | 21 |
http://gjkerns.github.com | 21 |
http://eigensomething.blogspot.com | 10 |
http://brocktibert.wordpress.com | 9 |
http://toddjobe.blogspot.com | 9 |
http://mickeymousemodels.blogspot.com | 9 |
http://forgetfulfunctor.blogspot.com | 9 |
http://rocknrblog.wordpress.com | 9 |
http://dmbates.blogspot.com | 8 |
http://blog.nextbiomotif.com | 8 |
http://indiacrunchin.wordpress.com | 8 |
http://blog.trenthauck.com | 8 |
http://mikescnc.blogspot.com | 8 |
http://jeroldhaas.blogspot.com | 8 |
http://tlevine.tumblr.com | 8 |
http://empty-moon-9726.heroku.com | 8 |
http://www.proc-x.com | 7 |
http://jointposterior.blogspot.com | 7 |
http://gastonsanchez.wordpress.com | 7 |
http://mlt-thinks.blogspot.com | 7 |
http://rstats.wordpress.com | 7 |
http://playingwithr.blogspot.com | 7 |
http://scottmutchler.blogspot.com | 6 |
http://iamdata.wordpress.com | 6 |
http://sfchaos.blogspot.com | 6 |
http://nightlordtw.wordpress.com | 5 |
http://pleasepasstheroc.blogspot.com | 5 |
http://wiekvoet.blogspot.com | 5 |
http://d7.stattler.com | 4 |
http://yetanotherrblog.blogspot.com | 4 |
http://blog.iwanluijks.nl:80 | 3 |
https://rlearner.wordpress.com | 3 |
http://margintale.blogspot.com | 1 |
When checking the results manually I discovered slight deviations in the numbers and admittedly have no clue why this is.. Sorry if any blog is under- overrepresented due to such an error - please report!
Here is the R-script:
require(XML)
library(stringr)
library(RCurl)
library(xtable)
GoogleHits.1 <- function(input)
{
url <- paste("https://www.google.com/search?q=\"",
input, "\"", sep = "")
CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
script <- getURL(url, followlocation = TRUE, cainfo = CAINFO)
doc <- htmlParse(script)
res <- xpathSApply(doc, "//div[@id='subform_ctrl']/*", xmlValue)[[2]]
return(as.integer(gsub("[^0-9]", "", res)))
}
# Example:
GoogleHits.1("R%Statistical%Software")
###################### Begin get r-blogger's URLs: ###########################################
# get blogger urls with XML:
script <- getURL("www.r-bloggers.com")
doc <- htmlParse(script)
li <- getNodeSet(doc, "//ul[@class='xoxo blogroll']//a")
urls <- sapply(li, xmlGetAttr, "href")
# extract sensible blog urls:
# get ids for those with only 2 slashes (no 3rd in the end):
id <- which(nchar(gsub("[^/]", "", urls )) == 2)
slash_2 <- urls[id]
# find position of 3rd slash occurrence in strings:
slash_stop <- unlist(lapply(str_locate_all(urls, "/"),"[[", 3))
slash_3 <- substring(urls, first = 1, last = slash_stop - 1)
# replace the ones with 2 slashes:
blogs <- slash_3; blogs[id] <- slash_2
# dismiss:
blogs <- blogs[blogs != "http://domain"]
###################### End get r-blogger's URLs: #############################
###################### Begin Google Search: ##################################
# with lapply google mocks about roboting the site..
# I'm blocked on the 300th recursion..
# unlist(lapply(blogs, GoogleHits.1))
# try splitting, doesn't work (blocked the same as before)
res1 <- unlist(lapply(blogs[1:170], GoogleHits.1))
res2 <- unlist(lapply(blogs[171:334], GoogleHits.1))
# try to do it in 2 sessions (saving first result), or manually re-connnect host before second try:
df1 <- data.frame(Blogs = blogs[1:170], NoHits = res1, row.names = NULL)
save(df1, file = "df1.R")
load("df1.RData"); unlink("df1.RData")
# second run:
df2 <- data.frame(Blogs = blogs[171:334], NoHits = res2, row.names = NULL)
# bind dfs, sort by NoHits:
finres <- as.data.frame(rbind(df1, df2)); finres$Blogs <- as.character(finres$Blogs)
(finres <- finres[order(finres$NoHits, decreasing = T), ])
htmltab <- xtable(finres)
print(htmltab, type = "html", include.rownames=FALSE, file = "Bloggers.Google.Hits.htm")
###################### End Google Search #####################################
###################### Begin Plot: ###########################################
pdf("RBloggersWebPresence.pdf")
par(mar = c(4.5, 4.5, 3, 2), ylog = F)
plot(finres$NoHits, cex = 0.5, col = 3,
ylab = "No. of Hits in Google Search",
xlab = "Blogs", log = "y")
set.seed(19)
rid <- sample(13:nrow(finres), 15)
text(x = rid, y = finres$NoHits[rid],
labels = finres$Blogs[rid],
cex = 0.75, srt = 90, pos = 4, offset = -1)
title(main = "R-Bloggers' Web Presence")
dev.off()
###################### End Plot ##############################################
A quick glance shows many, many sites that are not especially R-related. Some could hardly even be described as blogs. For instance, http://www.stanford.edu ?
ReplyDeleteI don't think there's much value in the data, though the code is interesting.
..I removed those that obviously are not blogs.
DeleteAnd, admittedly the type of processing I used to get the final URLs for the Google search was not sensible in some cases.. I'd need to check for that. (And, bear in mind that this a fun thing intended to illustrate the use of some of my latest code snipptes).
DeleteMeanwhile Google seem to have changed something. However if just just remove the [[2]] in the following line, it still works:
ReplyDeleteres <- xpathSApply(doc, "//div[@id='subform_ctrl']/*", xmlValue)[[2]]
Very interesting. I'm pleasantly surprised to see that ProgrammingR is the runner-up.
ReplyDelete