Beware: Coercion of variable types in Apply
cb
15/09/2014
So I’m sure others have coverd this better on http://www.stackoverflow.com, but I thought I’d write this here as it’s something that I find pops up every now and then in my work. Afterall, it took me a few coffees and head-scratching sessions to uncover the full detail of this little nuance of R.
Apply automatically coerces the type of the array into the FUN
parameter. What? Let’s get an example to show what I mean - I’ll use the iris data set as it should be familiar to most people.
Here is an example of column-wise application of the mean function to the iris data set. I’m just going to apply it to the first four columns (the fifth and final column is a factor). It’s a straightforward application from the documentation.
apply(iris[,1:4],2,FUN=mean)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843 3.057 3.758 1.199
It’s pretty easy to see what is going on, the mean of the first, second, third and fourth columns are all calculated and returned. But what if we include the fifth column?
apply(iris[,1:5],2,FUN=mean)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## NA NA NA NA NA
The result returns NA because apply has coerced the first four column types into characters, even though they’re numeric, and the the mean of a string isn't a really sensible operation. Don’t quite believe me? Have a look.
# Variable type of first four columns using apply
apply(iris[,1:4],2,FUN=class)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## "numeric" "numeric" "numeric" "numeric"
They’re all numeric (as expected). Now look what happens when we try the above code for the fifth column.
# Variable type of all columns using apply
apply(iris[,1:5],2,FUN=class)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## "character" "character" "character" "character" "character"
As we can see, the inclusion of the Species column (which is type Factor) causes the coercion to take place.
I find this problem eventuates when you’re running some older code over new data, and variable types have unexpectedly changed (the guys and girls in reporting just decided to place ‘$’ signs in front of monthly revenue etc..). This this is not necessarily an issue within R, more so something that occasionally catches you off-guard.
Hope this helps!
Cheers