Tuesday 23 September 2014

Beware: Coercion of variable types in Apply

Beware: Coercion of variable types in Apply

So I’m sure others have coverd this better on http://www.stackoverflow.com, but I thought I’d write this here as it’s something that I find pops up every now and then in my work. Afterall, it took me a few coffees and head-scratching sessions to uncover the full detail of this little nuance of R.

Apply automatically coerces the type of the array into the FUN parameter. What? Let’s get an example to show what I mean - I’ll use the iris data set as it should be familiar to most people.

Here is an example of column-wise application of the mean function to the iris data set. I’m just going to apply it to the first four columns (the fifth and final column is a factor). It’s a straightforward application from the documentation.

apply(iris[,1:4],2,FUN=mean)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##        5.843        3.057        3.758        1.199

It’s pretty easy to see what is going on, the mean of the first, second, third and fourth columns are all calculated and returned. But what if we include the fifth column?

apply(iris[,1:5],2,FUN=mean)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##           NA           NA           NA           NA           NA

The result returns NA because apply has coerced the first four column types into characters, even though they’re numeric, and the the mean of a string isn't a really sensible operation. Don’t quite believe me? Have a look.

# Variable type of first four columns using apply
apply(iris[,1:4],2,FUN=class)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    "numeric"    "numeric"    "numeric"    "numeric"

They’re all numeric (as expected). Now look what happens when we try the above code for the fifth column.

# Variable type of all columns using apply
apply(iris[,1:5],2,FUN=class)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##  "character"  "character"  "character"  "character"  "character"

As we can see, the inclusion of the Species column (which is type Factor) causes the coercion to take place.

I find this problem eventuates when you’re running some older code over new data, and variable types have unexpectedly changed (the guys and girls in reporting just decided to place ‘$’ signs in front of monthly revenue etc..). This this is not necessarily an issue within R, more so something that occasionally catches you off-guard.

Hope this helps!

Cheers

No comments:

Post a Comment