theBioBucket*: A Little Webscraping-Exercise...

23 Oct 2011

A Little Webscraping-Exercise...

In R it's quite easy to pull out anything from a webpage and I'll show a little exercise in doing so. Here I retrieve all blog addresses from R-bloggers by the function readLines() and some subsequent data processing.

# get the page's html-code
web_page <- readLines("http://www.r-bloggers.com")

# extract relevant part of web page:
# missing line added on oct. 24th:
ul_tags <- grep("ul>", web_page) 

pos_1 <- grep("Contributing Blogs", web_page) + 2
pos_2 <- ul_tags[which(ul_tags > pos_1)[1]] - 2

blog_list_1 <- web_page[pos_1:pos_2]

# extract 2nd element of sublists produced by stringsplit:
blog_list_2 <- unlist(lapply(strsplit(blog_list_1, "\""), "[[", 2))

# exclude elememts without propper address:
blog_list_3 <- blog_list_2[grep("http:", blog_list_2)]

# plot results:
len <- length(blog_list_3)
x <- rep(1:3, ceiling(len/3))[1:len]
y <- 1:len

par(mar = c(0, 5, 0, 5), xpd = T)
plot(x, y, ylab = "", xlab = "", type = "n",
     bty = "n", axes = F)
text(x, y, blog_list_3, cex = 0.5)

6 comments :

xi'an23 October 2011 at 07:03
I tried your code: seems like there is a line missing to define ul_tags

> pos_2 <- ul_tags[which(ul_tags > pos_1)[1]] - 2
Error: object 'ul_tags' not found
ReplyDelete
Replies
Blake Dale23 October 2011 at 19:14
This comment has been removed by the author.
ReplyDelete
Replies
Christos24 October 2011 at 05:23
A line of code is missing from your script after
web_page <- readLines("http://www.r-bloggers.com")

# missing
ul_tags <- grep("ul>", web_page)
ReplyDelete
Replies
Kay24 October 2011 at 10:16
Thanks Christos! ..indeed the line from your comment was missing.

Blake, thanks for pointing up another approach!
ReplyDelete
Replies
vps forex28 January 2014 at 07:36
great thanks for sharing
i should digg this article mate
ReplyDelete
Replies
Anu13 October 2017 at 12:14
Nice blog and great information about web scrapping. Recently came through the tool mobito for web scrapping.
ReplyDelete
Replies

Add comment

TABS

23 Oct 2011

A Little Webscraping-Exercise...

6 comments :