# get the page's html-code web_page <- readLines("http://www.r-bloggers.com") # extract relevant part of web page: # missing line added on oct. 24th: ul_tags <- grep("ul>", web_page) pos_1 <- grep("Contributing Blogs", web_page) + 2 pos_2 <- ul_tags[which(ul_tags > pos_1)[1]] - 2 blog_list_1 <- web_page[pos_1:pos_2] # extract 2nd element of sublists produced by stringsplit: blog_list_2 <- unlist(lapply(strsplit(blog_list_1, "\""), "[[", 2)) # exclude elememts without propper address: blog_list_3 <- blog_list_2[grep("http:", blog_list_2)] # plot results: len <- length(blog_list_3) x <- rep(1:3, ceiling(len/3))[1:len] y <- 1:len par(mar = c(0, 5, 0, 5), xpd = T) plot(x, y, ylab = "", xlab = "", type = "n", bty = "n", axes = F) text(x, y, blog_list_3, cex = 0.5)
23 Oct 2011
A Little Webscraping-Exercise...
In R it's quite easy to pull out anything from a webpage and I'll show a little exercise in doing so.
Here I retrieve all blog addresses from R-bloggers by the function readLines() and some subsequent data processing.
Subscribe to:
Post Comments
(
Atom
)
I tried your code: seems like there is a line missing to define ul_tags
ReplyDelete> pos_2 <- ul_tags[which(ul_tags > pos_1)[1]] - 2
Error: object 'ul_tags' not found
This comment has been removed by the author.
ReplyDeleteA line of code is missing from your script after
ReplyDeleteweb_page <- readLines("http://www.r-bloggers.com")
# missing
ul_tags <- grep("ul>", web_page)
Thanks Christos! ..indeed the line from your comment was missing.
ReplyDeleteBlake, thanks for pointing up another approach!
great thanks for sharing
ReplyDeletei should digg this article mate
Nice blog and great information about web scrapping. Recently came through the tool mobito for web scrapping.
ReplyDelete