6 Apr 2012

R-Bloggers' Web-Presence

We love them, we hate them: RANKINGS!

Rankings are an inevitable tool to keep the human rat race going. In this regard I'll pick up my last two posts (HERE & HERE) and have some fun with it by using it to analyse R-Bloggers' web presence. I will use number of hits in Google Search as an indicator.

I searched for URLs like this: https://www.google.com/search?q="http://www.twotorials.com" - meaning that only the exact blog-URL is searched.

Blogs NoHits
http://google-opensource.blogspot.com 82300
http://www.programmingr.com 73500
http://googleresearch.blogspot.com 58000
http://dirk.eddelbuettel.com 53000
http://borasky-research.net 33100
http://casoilresource.lawr.ucdavis.edu 32500
http://andrewgelman.com 30000
http://yihui.name 29600
http://xianblog.wordpress.com 27900
http://nsaunders.wordpress.com 27600
http://chem-bla-ics.blogspot.com 26600
http://plindenbaum.blogspot.com 24600
http://blog.ouseful.info 24300
http://www.vcasmo.com 24200
http://yz.mit.edu 23500
http://romainfrancois.blog.free.fr 22700
http://blog.revolutionanalytics.com 21000
http://robjhyndman.com 18400
http://freakonometrics.blog.free.fr 16100
http://perfdynamics.blogspot.com 15400
http://www.stubbornmule.net 14800
http://zoonek.free.fr 14800
http://jackman.stanford.edu 13900
http://www.bytemining.com 13700
http://learnr.wordpress.com 12600
http://tommy.chheng.com 12200
http://mazamascience.com 12000
http://www.investuotojas.eu 11500
http://www.r-statistics.com 11300
http://www.franklincenterhq.org 10800
http://gettinggeneticsdone.blogspot.com 10700
http://mpastell.com 9930
http://pineda-krch.com 9780
http://blog.saush.com 9220
http://www.premiersoccerstats.com 8950
http://developmentality.wordpress.com 7250
http://www.dataspora.com 7200
http://blog.hiremebecauseimsmart.com 7050
http://isomorphismes.tumblr.com 7040
http://www.mathfinance.cn 6930
http://blog.nguyenvq.com 6150
http://www.drewconway.com 5970
http://www.carlboettiger.info 5520
http://www.statisticsblog.com 5110
http://www.decisionsciencenews.com 4950
http://www.r-chart.com 4810
http://chartsgraphs.wordpress.com 4480
http://www.portfolioprobe.com 4410
http://procomun.wordpress.com 4330
http://jeromyanglim.blogspot.com 4080
http://spatialanalysis.co.uk 4080
http://www.theresearchkitchen.com 4080
http://www.forex-bloggers.com 4070
https://www.rmetrics.org 4050
http://princeofslides.blogspot.com 3900
http://www.cybaea.net 3740
http://www.cerebralmastication.com 3710
http://ygc.name 3670
http://ryouready.wordpress.com 3450
http://jeffreybreen.wordpress.com 3410
http://systematicinvestor.wordpress.com 3400
http://sgsong.blogspot.com 3310
http://industrialengineertools.blogspot.com 3290
http://www.r-tutor.com 3270
http://fishlab.ucdavis.edu 3270
http://ggorjan.blogspot.com 3250
http://blog.ynada.com 3220
http://farmacokratia.blogspot.com 3170
http://4dpiecharts.com 3130
http://heuristically.wordpress.com 3040
http://blog.rtwilson.com 2910
http://www.wekaleamstudios.co.uk 2890
http://www.dataists.com 2840
http://ikanb.wordpress.com 2750
http://shape-of-code.coding-guidelines.com 2730
http://onertipaday.blogspot.com 2710
http://blog.fosstrading.com 2700
http://blog.echen.me 2690
http://www.theusrus.de 2670
http://cloudnumbers.com 2630
http://paulbutler.org 2620
http://biostatmatt.com 2460
http://www.johnmyleswhite.com 2430
http://dataninja.wordpress.com 2360
http://realizationsinbiostatistics.blogspot.com 2340
http://statisfaction.wordpress.com 2300
http://uxblog.idvsolutions.com 2250
http://timelyportfolio.blogspot.com 2210
http://radfordneal.wordpress.com 2200
http://sas-and-r.blogspot.com 2200
http://pairach.com 2110
http://yusung.blogspot.com 2050
http://blog.flacso.edu.mx 2010
http://www.rensenieuwenhuis.nl 2000
http://michaeldhealy.com 1990
http://freigeist.devmag.net 1950
http://www.fernandohrosa.com.br 1920
http://statbandit.wordpress.com 1870
http://www.win-vector.com 1840
http://lukemiller.org 1830
http://ropensci.org 1720
http://www.eggwall.com 1650
http://benmazzotta.wordpress.com 1620
http://bms.zeugner.eu 1610
http://cartesianfaith.wordpress.com 1580
http://linkedscience.org 1570
http://stevemosher.wordpress.com 1550
http://intelligenttradingtech.blogspot.com 1520
http://www.imachordata.com 1480
http://blog.diegovalle.net 1470
http://jermdemo.blogspot.com 1430
http://nortalktoowise.com 1420
http://ekonometrics.blogspot.com 1340
http://digitheadslabnotebook.blogspot.com 1320
http://flyordie.sin.khk.be 1310
http://schamberlain.github.com 1230
http://gribblelab.org 1180
http://www.quantf.com 1130
http://offensivepolitics.net 1020
http://www.markmfredrickson.com 981
http://blog.mckuhn.de 948
http://erehweb.wordpress.com 889
http://confounding.net 886
http://simplystatistics.tumblr.com 875
http://www.babelgraph.org 859
http://csgillespie.wordpress.com 857
http://joewheatley.net 844
http://helmingstay.blogspot.com 843
http://theaverageinvestor.wordpress.com 825
http://quantitative-ecology.blogspot.com 785
http://zvfak.blogspot.com 776
http://ucfagls.wordpress.com 766
http://opendatagroup.com 760
http://cameron.bracken.bz 740
http://rtutorialseries.blogspot.com 738
http://opencpu.org 708
http://novicemetrics.blogspot.com 700
http://lamages.blogspot.com 680
http://nir-quimiometria.blogspot.com 679
http://tonybreyal.wordpress.com 677
http://brokeringclosure.wordpress.com 658
http://socialdatablog.com 643
http://dancingeconomist.blogspot.com 629
http://www.rtexttools.com 603
http://danganothererror.wordpress.com 589
http://thebiobucket.blogspot.com 567
http://holtmeier.de 531
http://val-systems.blogspot.com 519
http://thelogcabin.wordpress.com 489
http://dcemri.blogspot.com 484
http://rdatamining.wordpress.com 477
http://bridgewater.wordpress.com 460
http://www.rcasts.com 444
http://dsparks.wordpress.com 436
http://pr.cloudst.at 422
http://polstat.org 409
http://www.compmath.com 401
http://techno-realism.blogspot.com 399
http://www.backsidesmack.com 395
http://geotheory.org 393
http://miraisolutions.wordpress.com 367
http://econometricsense.blogspot.com 352
http://blog.binfalse.de 344
http://rforcancer.drupalgardens.com 317
http://blog.rstudio.org 316
http://mcfromnz.wordpress.com 309
http://www.quantumforest.com 309
http://blog.quanttrader.org 303
http://chrisladroue.com 293
http://www.michaelbommarito.com 289
http://procrun.com 280
http://mikeksmith.posterous.com 279
http://bio7.org 278
http://kbroman.wordpress.com 278
http://martynplummer.wordpress.com 272
http://bryer.org 268
http://www.funjackals.com 265
http://www.harlan.harris.name 252
http://www.milktrader.net 248
http://www.surefoss.org 241
http://rigorousanalytics.blogspot.com 231
http://www.jameskeirstead.ca 229
http://programming-r-pro-bro.blogspot.com 225
http://plausibel.blogspot.com 224
http://statistic-on-air.blogspot.com 217
http://mintgene.wordpress.com 212
http://moderntoolmaking.blogspot.com 205
http://quantitativeecology.blogspot.com 199
http://www.sigmafield.org 199
http://www.ancienteco.com 194
http://worldofrcraft.blogspot.com 191
http://rappster.wordpress.com 190
http://stotastic.com 189
http://evolvingspaces.blogspot.com 184
http://strugglingthroughproblems.blogspot.com 166
http://sharpstatistics.co.uk 161
http://leftcensored.skepsi.net 160
http://omegahat.wordpress.com 156
http://drunks-and-lampposts.com 155
http://amathew.com 152
http://onlinelabor.blogspot.com 147
http://johnramey.net 144
http://gossetsstudent.wordpress.com 138
http://tomhopper.wordpress.com 135
http://ggobi.blogspot.com 134
http://blog.fellstat.com 131
http://www.openanalytics.eu 130
http://www.numbertheory.nl 127
http://stats.blogoverflow.com 127
http://the-praise-of-insects.blogspot.com 122
http://lpenz.github.com 118
http://christophergandrud.blogspot.com 118
http://f.giorlando.org 112
http://bayesianbiologist.com 110
http://www.graphoftheweek.org 109
http://oneliner.soma20.com 109
http://inundata.org 107
http://geokook.wordpress.com 104
http://blog.datapunks.com 102
http://eranraviv.com 102
http://eranraviv.com 102
http://www.compbiome.com 101
http://www.techpolicy.ca 99
http://www.psychwire.co.uk 97
http://blog.carlislerainey.com 93
http://vasishth-statistics.blogspot.com 93
http://www.statsravingmad.com 93
http://using-r-project.blogspot.com 93
http://www.nikhilgopal.com 92
http://thedatamonkey.blogspot.com 92
http://jeffreyhorner.tumblr.com 90
http://menugget.blogspot.com 88
http://www.twotorials.com 88
http://dataexcursions.wordpress.com 84
http://viksalgorithms.blogspot.com 83
http://exploringdatablog.blogspot.com 81
http://sachaepskamp.com 81
http://aphysicistinwallstreet.blogspot.com 77
http://lastresortsoftware.blogspot.com 75
http://www.nomad.priv.at 72
http://applyr.blogspot.com 71
http://www.knowledgediscovery.jp 71
http://weitaiyun.blogspot.com 71
http://xmphforex.wordpress.com 71
http://statsadventure.blogspot.com 70
http://davenportspatialanalytics.squarespace.com 70
http://anandram.wordpress.com 69
http://rpint.wordpress.com 68
http://datadebrief.blogspot.com 66
http://blog.cloudstat.org 64
http://www.r-podcast.org 64
http://rmkrug.wordpress.com 62
http://denishaine.wordpress.com 61
http://expansed.com 58
http://r.andrewredd.us 57
http://isseing333.blogspot.com 57
http://solomonmessing.wordpress.com 57
http://rtricks.wordpress.com 57
http://anrprogrammer.wordpress.com 56
http://arungaikwad.wordpress.com 56
http://geolabs.wordpress.com 55
http://lookingatdata.blogspot.com 55
http://factbased.blogspot.com 54
http://severity.blogspot.com 54
http://swordofcrom.wordpress.com 53
http://librestats.wordpress.com 51
http://marcinkula.wordpress.com 51
http://gsoc2010r.wordpress.com 47
http://psyccomputing.blogspot.com 46
http://fabiomarroni.wordpress.com 45
http://jedifran.com 45
http://alstatr.blogspot.com 43
http://r-video-tutorial.blogspot.com 42
http://alexfarquhar.posterous.com 40
http://bmb-common.blogspot.com 40
http://rdataviz.wordpress.com 40
http://mypapertrades.blogspot.com 38
http://pitchrx.blogspot.com 38
http://simonmueller.net 38
http://statisfactions.wordpress.com 37
http://nzprimarysectortrade.wordpress.com 36
http://seanmulcahy.blogspot.com 36
http://www.speakingstatistically.com 35
http://joshpaulson.wordpress.com 34
http://learningrbasic.blogspot.com 34
http://mockquant.blogspot.com 33
http://costaleconomist.blogspot.com 32
http://rsnippets.blogspot.com 31
http://statmethods.wordpress.com 29
http://aviadklein.wordpress.com 28
http://obeautifulcode.com 28
http://blog.cloudst.at 24
http://rstats.posterous.com 23
http://notebookonthewebs.tumblr.com 22
http://0utlier.blogspot.com 21
http://gjkerns.github.com 21
http://eigensomething.blogspot.com 10
http://brocktibert.wordpress.com 9
http://toddjobe.blogspot.com 9
http://mickeymousemodels.blogspot.com 9
http://forgetfulfunctor.blogspot.com 9
http://rocknrblog.wordpress.com 9
http://dmbates.blogspot.com 8
http://blog.nextbiomotif.com 8
http://indiacrunchin.wordpress.com 8
http://blog.trenthauck.com 8
http://mikescnc.blogspot.com 8
http://jeroldhaas.blogspot.com 8
http://tlevine.tumblr.com 8
http://empty-moon-9726.heroku.com 8
http://www.proc-x.com 7
http://jointposterior.blogspot.com 7
http://gastonsanchez.wordpress.com 7
http://mlt-thinks.blogspot.com 7
http://rstats.wordpress.com 7
http://playingwithr.blogspot.com 7
http://scottmutchler.blogspot.com 6
http://iamdata.wordpress.com 6
http://sfchaos.blogspot.com 6
http://nightlordtw.wordpress.com 5
http://pleasepasstheroc.blogspot.com 5
http://wiekvoet.blogspot.com 5
http://d7.stattler.com 4
http://yetanotherrblog.blogspot.com 4
http://blog.iwanluijks.nl:80 3
https://rlearner.wordpress.com 3
http://margintale.blogspot.com 1

When checking the results manually I discovered slight deviations in the numbers and admittedly have no clue why this is.. Sorry if any blog is under- overrepresented due to such an error - please report!

Here is the R-script:

require(XML)
library(stringr)
library(RCurl)
library(xtable)

GoogleHits.1 <- function(input)
   {
    url <- paste("https://www.google.com/search?q=\"",
                 input, "\"", sep = "")
 
    CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
    script <- getURL(url, followlocation = TRUE, cainfo = CAINFO)
    doc <- htmlParse(script)
    res <- xpathSApply(doc, "//div[@id='subform_ctrl']/*", xmlValue)[[2]]
    return(as.integer(gsub("[^0-9]", "", res)))
   }

# Example:
GoogleHits.1("R%Statistical%Software")

###################### Begin get r-blogger's URLs: ###########################################
# get blogger urls with XML:
script <- getURL("www.r-bloggers.com")
doc <- htmlParse(script)
li <- getNodeSet(doc, "//ul[@class='xoxo blogroll']//a")
urls <- sapply(li, xmlGetAttr, "href")

# extract sensible blog urls:
# get ids for those with only 2 slashes (no 3rd in the end):
id <- which(nchar(gsub("[^/]", "", urls )) == 2)
slash_2 <- urls[id]

# find position of 3rd slash occurrence in strings:
slash_stop <- unlist(lapply(str_locate_all(urls, "/"),"[[", 3))
slash_3 <- substring(urls, first = 1, last = slash_stop - 1)

# replace the ones with 2 slashes:
blogs <- slash_3; blogs[id] <- slash_2

# dismiss:
blogs <- blogs[blogs != "http://domain"]
###################### End get r-blogger's URLs: #############################

###################### Begin Google Search: ##################################
# with lapply google mocks about roboting the site..
# I'm blocked on the 300th recursion..
# unlist(lapply(blogs, GoogleHits.1))

# try splitting, doesn't work (blocked the same as before)
res1 <- unlist(lapply(blogs[1:170], GoogleHits.1))
res2 <- unlist(lapply(blogs[171:334], GoogleHits.1))

# try to do it in 2 sessions (saving first result), or manually re-connnect host before second try:
df1 <- data.frame(Blogs = blogs[1:170], NoHits = res1, row.names = NULL)
save(df1, file = "df1.R")
load("df1.RData"); unlink("df1.RData")

# second run:
df2 <- data.frame(Blogs = blogs[171:334], NoHits = res2, row.names = NULL)

# bind dfs, sort by NoHits:
finres <- as.data.frame(rbind(df1, df2)); finres$Blogs <- as.character(finres$Blogs)
(finres <- finres[order(finres$NoHits, decreasing = T), ])

htmltab <- xtable(finres)
print(htmltab, type = "html", include.rownames=FALSE, file = "Bloggers.Google.Hits.htm")
###################### End Google Search #####################################

###################### Begin Plot: ###########################################
pdf("RBloggersWebPresence.pdf")
par(mar = c(4.5, 4.5, 3, 2), ylog = F)
plot(finres$NoHits, cex = 0.5, col = 3, 
     ylab = "No. of Hits in Google Search",
     xlab = "Blogs", log = "y")
set.seed(19)
rid <- sample(13:nrow(finres), 15)
text(x = rid, y = finres$NoHits[rid], 
     labels = finres$Blogs[rid],
     cex = 0.75, srt = 90, pos = 4, offset = -1) 
title(main = "R-Bloggers' Web Presence")
dev.off()
###################### End Plot ##############################################

5 comments :

  1. A quick glance shows many, many sites that are not especially R-related. Some could hardly even be described as blogs. For instance, http://www.stanford.edu ?

    I don't think there's much value in the data, though the code is interesting.

    ReplyDelete
    Replies
    1. ..I removed those that obviously are not blogs.

      Delete
    2. And, admittedly the type of processing I used to get the final URLs for the Google search was not sensible in some cases.. I'd need to check for that. (And, bear in mind that this a fun thing intended to illustrate the use of some of my latest code snipptes).

      Delete
  2. Meanwhile Google seem to have changed something. However if just just remove the [[2]] in the following line, it still works:
    res <- xpathSApply(doc, "//div[@id='subform_ctrl']/*", xmlValue)[[2]]

    ReplyDelete
  3. Very interesting. I'm pleasantly surprised to see that ProgrammingR is the runner-up.

    ReplyDelete