Wednesday, 10 September 2014

Read big data into R

Read big data into R

So this is a neat solution to a problem that I find comes up often 'in the field'. Generally the question is something like:

  • I have a really big csv, how can I view it in R?
  • How can I quickly run stats on some big data?
  • What's the sum of 'a column from a big file' that we can't open in MS Excel?

Or, a more technical question:

  • How can I read a csv into R that doesn't fit into RAM?

The latter question hits the nail on the head. The issue is that computers use RAM to open and read files, and if the data is bigger than what is available you're often stuck. What happens? MS Excel will not load properly, notepad/gedit/whatever will 'freeze' or you'll start to go a little crazy waiting for your hard-drive play catch up swapping data in and out of memory. R is no exception - you can't read in data that is greater than the available memory all at once. Just not gonna happen.

Now I use a relatively old machine at home; an Intel i3 dual core with 4gb of ram. I could use something more powerful or even one of many online solutions like elastic cloud (or whatever they're called now), but I get to play with some cool servers at work and a new computer also costs money! It's also good to be forced to consider the computational complexity of what you're doing every now and then. Where would be the fun of doing everything on a supercomputer? ...Basically I'm just too cheap to buy something new.

Anyway, back on track: there is a simple, elegant way to handle the memory issue by way of the ff package. Assuming you have R correctly installed and you have internet access, you can install the ff package very easily:

You'll need to import the library once you've installed it. The code to do both is simple:

# Install and load the ff package
install.packages("ff",repos="http://cran.rstudio.com/")
library(ff)

I'm going to read in some data from an old Kaggle competition. The file is named transactions.csv. You can find the link to it here (just be warned: it's just under 3gb compressed and 21gb uncompressed).

Now this example isn't amazingly fast - but it gets the job done. I'm going to read in the first 10 million rows in this example...

# read ff object into R with a chunk size of 2,000,000
data = read.table.ffdf(
  file = "transactions.csv",
  fileEncoding = "utf8",
  sep = ",",
  header = TRUE,
  nrows = 10000000,
  next.rows = 2000000,
  VERBOSE = TRUE)
## read.table.ffdf 1..2000000 (2000000)  csv-read=15.77sec ffdf-write=1.308sec
## read.table.ffdf 2000001..4000000 (2000000)  csv-read=10.5sec ffdf-write=0.663sec
## read.table.ffdf 4000001..6000000 (2000000)  csv-read=10.69sec ffdf-write=0.742sec
## read.table.ffdf 6000001..8000000 (2000000)  csv-read=10.88sec ffdf-write=0.727sec
## read.table.ffdf 8000001..10000000 (2000000)  csv-read=10.64sec ffdf-write=0.966sec
##  csv-read=58.49sec  ffdf-write=4.406sec  TOTAL=62.89sec

It's not lightening fast, it took me just over a minute to read in 10 million rows, but hey - everything is working fine and I'm no where near filling up my memory quotient. Awesome.

As you can see, there are a few parameters you can modify to suit you needs. I've specified the file encoding to UTF-8 here, it's not necessary for this dataset but it can often be an issue on other datasets. Notice that next.rows = 2000000 specifies that we will read in chunks of 2 million rows at a time, kinda like a buffer. You should experiment with this to work out what is best for your machine/data combination.

Now from this point on we can treat the ff object almost the same as we would a data frame.

We can run basic statistics on the dataset:

# Summary statistics of 'purchaseamount' column
summary(data[,"purchaseamount"])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -266.0     2.3     3.5     4.6     5.5  3000.0

Just remember that some things don't behave as we might expect. Let's try this:

# Summary of all columns?
summary(data)
##                  Length   Class     Mode
## id               10000000 ff_vector list
## chain            10000000 ff_vector list
## dept             10000000 ff_vector list
## category         10000000 ff_vector list
## company          10000000 ff_vector list
## brand            10000000 ff_vector list
## date             10000000 ff_vector list
## productsize      10000000 ff_vector list
## productmeasure   10000000 ff_vector list
## purchasequantity 10000000 ff_vector list
## purchaseamount   10000000 ff_vector list

Yeh... it doesn't do what we'd expect. Let's try a simple little trick: sapply.Unless you're completely new to R, you'll know for loops are horrendously slow. So yeh - just don't do it.

You can get around writing for loops (perhaps via paradigm shift/head trauma if you're totally not used to R) by calling a function within the sapply function. It looks like this (I'm just running from the 2nd to 6th column):

# Use sapply to iterate summary over every column
sapply(2:6,FUN=function(x) summary(data[,x]))
##          [,1] [,2]  [,3]     [,4]   [,5]
## Min.      2.0  0.0     0 1.00e+04      0
## 1st Qu.  14.0 18.0  1890 1.02e+08   7850
## Median   15.0 36.0  3630 1.04e+08  13300
## Mean     26.3 41.7  4180 2.20e+08  15900
## 3rd Qu.  18.0 60.0  6010 1.07e+08  17100
## Max.    217.0 99.0 10000 1.10e+10 109000

Most other basic stats run per usual. Just be wary that you're not dealing with a data-frame. Histograms are (as expected) simple:

# Produce a histogram of the productsize
hist(log(data[,"purchaseamount"]),xlab = "purchaseamount",main = "Histogram of log of Purchase Amount")

plot of chunk unnamed-chunk-7

As with barplots:

# Top 5 occuring dates using summary
dates_top5 = head(summary(data[,"date"]),5)

# Plot the top five dates
barplot(dates_top5,main = "Top five occuring dates")

plot of chunk unnamed-chunk-8

Whilst things aren't instant, within just a few minutes you can do get some vital numbers on files that seemingly wouldn't load in memory. This really gives you enough firepower to tackle most problems that you should come across. If you're looking to do this frequently then I would suggest investing in a database or server time. Nevertheless, there are plenty more advanced tricks available, and many exist within the ff package. This is just a very quick and basic introduction into a really awesome package.

Cheers!

No comments:

Post a Comment