Friday 17 April 2015

Webscraping with rvest

There hasn’t been much activity here in the past month or two, primarily as I moved to Amsterdam. So this post will be about something close to the minds of anyone planning a move to another city: house prices. I’m going to show how you can simply scrape data from the web using the rvest package.

Since I’m in Amsterdam I’m going to scrape search results just from Amsterdam using this website: http://www.funda.nl.

# Import useful libraries
library(rvest)

# Set up an html session for my page I want to scrape
s = html_session("http://www.funda.nl/koop/amsterdam/+15km/sorteer-datum-af/")

# Parse the page
page=html(s)

The function html_nodes requires either a css path or xpath in order to scrape the data we are after, and selecting what you want can be kinda tricky. Luckily an extremely helpful tool exists to help us determine the CSS tag we wish to scrape: http://selectorgadget.com/. You can easily use the tool to select the address and price:

# Select the address of the properties
ADDRESS=html_text(html_nodes(page,".object-street"),trim=TRUE)

# Select the price of the properties
PRICE=html_text(html_nodes(page,".price"),trim=TRUE)

These two vectors can now be stored or written to a file for whatever purposes you have. It’s as simple as that!

Another nice features is the follow_link function. This can be used to scroll to the next page in the session you created. It’s so easy in fact it’s a one-liner:

s=follow_link(s,css=".next")

And there you go! Webscraping at it’s absolute easiest! With a little extra code you can, for example, create a function to iterate over all pages for Amsterdam to create a histogram of house prices (below).

Cheers