# get the page's html-code
web_page <- readLines("http://www.r-bloggers.com")
# extract relevant part of web page:
# missing line added on oct. 24th:
ul_tags <- grep("ul>", web_page)
pos_1 <- grep("Contributing Blogs", web_page) + 2
pos_2 <- ul_tags[which(ul_tags > pos_1)[1]] - 2
blog_list_1 <- web_page[pos_1:pos_2]
# extract 2nd element of sublists produced by stringsplit:
blog_list_2 <- unlist(lapply(strsplit(blog_list_1, "\""), "[[", 2))
# exclude elememts without propper address:
blog_list_3 <- blog_list_2[grep("http:", blog_list_2)]
# plot results:
len <- length(blog_list_3)
x <- rep(1:3, ceiling(len/3))[1:len]
y <- 1:len
par(mar = c(0, 5, 0, 5), xpd = T)
plot(x, y, ylab = "", xlab = "", type = "n",
bty = "n", axes = F)
text(x, y, blog_list_3, cex = 0.5)
23 Oct 2011
A Little Webscraping-Exercise...
In R it's quite easy to pull out anything from a webpage and I'll show a little exercise in doing so.
Here I retrieve all blog addresses from R-bloggers by the function readLines() and some subsequent data processing.
Subscribe to:
Post Comments (Atom)

I tried your code: seems like there is a line missing to define ul_tags
ReplyDelete> pos_2 <- ul_tags[which(ul_tags > pos_1)[1]] - 2
Error: object 'ul_tags' not found
I think you forgot to define ul_tags in the code above.
ReplyDeleteHere's an alternative to getting the same data using XPath:
# RSTART
library(RCurl)
library(XML)
doc <- getURL('http://www.r-bloggers.com')
html <- htmlTreeParse(doc, useInternalNodes = TRUE)
atts <- xpathApply(html, '//ul[@class="xoxo blogroll"]//a[@href]', xmlToList)
df <- data.frame(name = sapply(atts, function(x) x$text),
url = sapply(atts, function(x) x$.attrs[[1]]),
stringsAsFactors = FALSE)
> df[1:5,]
name url
1 Forex blogs http://www.forex-bloggers.com/
2 SAS blogs http://www.sas-x.com/
3 “R” you ready? http://ryouready.wordpress.com
4 4D Pie Charts » R http://4dpiecharts.com
5 @yaaang’s blog » R http://yz.mit.edu/wp
# REND
Tony Breyal
A line of code is missing from your script after
ReplyDeleteweb_page <- readLines("http://www.r-bloggers.com")
# missing
ul_tags <- grep("ul>", web_page)
Thanks Christos! ..indeed the line from your comment was missing.
ReplyDeleteBlake, thanks for pointing up another approach!