Trying to use user-defined function to populate new column in dataframe. What is going wrong?

https://stackoverflow.com/questions/7800202

22-10-2019
|

Question

Super short version: I'm trying to use a user-defined function to populate a new column in a dataframe with the command:

TestDF$ELN<-EmployeeLocationNumber(TestDF$Location)

However, when I run the command, it seems to just apply EmployeeLocationNumber to the first row's value of Location rather than using each row's value to determine the new column's value for that row individually.

Please note: I'm trying to understand R, not just perform this particular task. I was actually able to get the output I was looking for using the Apply() function, but that's irrelevant. My understanding is that the above line should work on a row-by-row basis, but it isn't.

Here are the specifics for testing:

TestDF<-data.frame(Employee=c(1,1,1,1,2,2,3,3,3), 
                   Month=c(1,5,6,11,4,10,1,5,10), 
                   Location=c(1,5,6,7,10,3,4,2,8))

This testDF keeps track of where each of 3 employees was over the course of the year among several locations.

(You can think of "Location" as unique to each Employee...it is eseentially a unique ID for that row.)

The the function EmployeeLocationNumber takes a location and outputs a number indicating the order that employee visited that location. For example EmployeeLocationNumber(8) = 2 because it was the second location visited by the employee who visited it.

EmployeeLocationNumber <- function(Site){
  CurrentEmployee <- subset(TestDF,Location==Site,select=Employee, drop = TRUE)[[1]]
  LocationDate<- subset(TestDF,Location==Site,select=Month, drop = TRUE)[[1]]
  LocationNumber <- length(subset(TestDF,Employee==CurrentEmployee & Month<=LocationDate,select=Month)[[1]])
  return(LocationNumber)
}

I realize I probably could have packed all of that into a single subset command, but I didn't know how referencing worked when you used subset commands inside other subset commands.

So, keeping in mind that I'm really trying to understand how to work in R, I have a few questions:

Why won't TestDF$ELN<-EmployeeLocationNumber(TestDF$Location) work row-by-row like other assignment statements do?
Is there an easier way to reference a particular value in a dataframe based on the value of another one? Perhaps one that does not return a dataframe/list that then must be flattened and extracted from?
I'm sure the function I'm using is laughably un-R-like...what should I have done to essentially emulate an INNER Join type query?

Solution

The vectorized nature of R (aka row-by-row) works not by repeatedly calling the function with each next value of the arguments, but by passing the entire vector at once and operating on all of it at one time. But in EmployeeLocationNumber, you only return a single value, so that value gets repeated for the entire data set.

Also, your example for EmployeeLocationNumber does not match your description.

> EmployeeLocationNumber(8)
[1] 3

Now, one way to vectorize a function in the manner you are thinking (repeated calls for each value) is to pass it through Vectorize()

TestDF$ELN<-Vectorize(EmployeeLocationNumber)(TestDF$Location)

which gives

> TestDF
  Employee Month Location ELN
1        1     1        1   1
2        1     5        5   2
3        1     6        6   3
4        1    11        7   4
5        2     4       10   1
6        2    10        3   2
7        3     1        4   1
8        3     5        2   2
9        3    10        8   3

As to your other questions, I would just write it as

TestDF$ELN<-ave(TestDF$Month, TestDF$Employee, FUN=rank)

The logic is take the months, looking at groups of the months by employee separately, and give me the rank order of the months (where they fall in order).

OTHER TIPS

Using logical indexing, the condensed one-liner replacement for your function is:

EmployeeLocationNumber <- function(Site){
    with(TestDF[do.call(order, TestDF), ], which(Location[Employee==Employee[which(Location==Site)]] == Site))
}

Of course this isn't the most readable way, but it demonstrates the principles of logical indexing and which() in R. Then, like others have said, just wrap it up with a vectorized *ply function to apply this across your dataset.

A) TestDF$Location is a vector. Your function is not set up to return a vector, so giving it a vector will probably fail.

B) In what sense is Location:8 the "second location visited"?

C) If you want within group ordering then you need to pass you dataframe split up by employee to a funciton that calculates a result.

D) Conditional access of a data.frame typically involves logical indexing and or the use of which()

If you just want the sequence of visits by employee try this: (Changed first argument to Month since that is what determines the sequence of locations)

 with(TestDF, ave(Location, Employee, FUN=seq))
[1] 1 2 3 4 2 1 2 1 3
 TestDF$LocOrder <-  with(TestDF, ave(Month, Employee, FUN=seq))

If you wanted the second location for EE:3 it would be:

subset(TestDF, LocOrder==2 & Employee==3, select= Location)
#   Location
# 8        2

Your EmployeeLocationNumber function takes a vector in and returns a single value. The assignment to create a new data.frame column therefore just gets a single value:

EmployeeLocationNumber(TestDF$Location) # returns 1

TestDF$ELN<-1 # Creates a new column with the single value 1 everywhere

Assignment doesn't do any magic like that. It takes a value and puts it somewhere. In this case the value 1. If the value was a vector of the same length as the number of rows, it would work as you wanted.
I'll get back to you on that :)
Dito.

Update: I finally worked out some code to do it, but by then @DWin has a much better solution :(

TestDF$ELN <- unlist(lapply(split(TestDF, TestDF$Employee), function(x) rank(x$Month)))

...I guess the ave function does pretty much what the code above does. But for the record:

First I split the data.frame into sub-frames, one per employee. Then I rank the months (just in case your months are not in order). You could use order too, but rank can handle ties better. Finally I combine all the results into a vector and put it into the new column ELN.

Update again Regarding question 2, "What is the best way to reference a value in a dataframe?":

This depends a bit on the specific problem, but if you have a value, say Employee=3 and want to find all rows in the data.frame that matches that, then simply:

TestDF$Employee == 3 # Returns logical vector with TRUE for all rows with Employee == 3
which(TestDF$Employee == 3) # Returns a vector of indices instead
TestDF[which(TestDF$Employee == 3), ] # Subsets the data.frame on Employee == 3

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow