Question

I have a dataframe where each row is a unique user, and there is a column where each user's choice is represented along with columns for background variables. In the actual dataset, there are 800+ choices and background variables can have between 2 and 60+ levels.

What I am doing is finding total counts for each choice based on a variety of different grouping combinations. I have gotten that successfully, but I now want to name them in a meaningful way.

Source that helped for the first part of this question: How to use aggregate with a list of column names

# sample data
d <- data.frame(user_id = 1:12, choice = rep(c("apple", "banana", "banana"),4), cond1 = c("a1", "a1", "a2", "a1", "a2", "a1", "a2", "a2", "a3", "a3", "a3", "a3"), gender = c(rep(c("F", "M"), 6)))

# list of the different grouping combinations (choice will always be one of these terms)
c1 <- list(c("choice", "cond1"), c("choice", "cond1","gender"))  

# variable to be aggregated over
v1 <- c("user_id")    

varNames <- c("cond1Ct", "cond1_GenderCt")         # for later use

counts <- mapply(FUN= function(z,y, ...) {aggregate(d[ , y], by=d[z], ...)},
       c1, v1, MoreArgs=list(FUN=length) )

# assign each part of the output to its own dataframe for later use
cond1 <- counts[[1]]
cond1_Gender <- counts[[2]]

So this works so far.

As seen with Name columns within aggregate in R, it is possible to assign a column name to the variable you are aggregating over. I couldn't get this to work in the mapply function call. That is my primary question here --- how can I automate the naming of the tabulated variables for each grouping combination.

My brute force method of doing it:

# rename the column labeled 'x' in a meaningful way ... should be possible to do using varnames (defined above)

cond1_Gender$cond1_Gender_Ct <- cond1_Gender$x          
cond1$cond1_Ct <- cond1$x
cond1_Gender$x <- NULL
cond1$x <- NULL

cond1_Gender

# bring back the now meaningfuly named columns into the original dataframe    
dUse <- merge(merge(d, cond1), cond1_Gender)  
     # efficiency of this for large dataset and 8+ aggregated datasets???



dUse_Cond <- dUse[!duplicated(dUse[c("choice", "cond1")]),]   # later: use item from list here!
# has extra columns that just get ignored ... but that could be potentially confusing for later users

# repeat for all other grouping variable combinations

# dUse_CondGender <- dUse[!duplicated(dUse[c1[[2]],])]   # (DOES NOT WORK!)

How can I accomplish this naming in a more automated fashion so it can work across datafiles and multiple variables?

Bigger Picture: This is only part of a larger situation in which I'm wanting to graph the frequency of the top choices for each grouping combination. I have code that determines the most highly rated occupations for each grouping combination and plots those frequencies, for each of the levels in a grouping category.


ADDED 3-25-14: Clarification of the context in which this arises, although likely beyond the scope of this question.

Individuals have saved a choice (out of a long list). I want to know (graphically & numerically), which choices were the "most frequent" (and their corresponding freq) for different combinations of background grouping variables: i.e., most frequent choices for 5th grade males, or most freq choices for 10th graders in state X, or for all high school students in the treatment condition. I have code that takes me through each individual background grouping: tabulating (as worked on in this problem), finding the 'most frequent' choices, computing statistics on freq for that group, and graphing based on the variables involved in that group. Overall, I'm trying to make this more generic and vectorized so it works more easily with a subsequent user defining what combinations of background variables they want to look at. jlhoward's answer adds count columns for each of the desired combinations to the original dataframe.


Thanks! I hope this makes sense ... I've read a lot of questions and answers here, but haven't tried to post before.

Was it helpful?

Solution

Your call to mapply(...) is more complex than it needs to be:

counts   <- mapply(FUN= function(z,y) aggregate(d[ , y], by=d[z], FUN=length),
                      c1, "user_id")

If all you want to do is automate renaming the column x in each dataframe to something else, this will work:

# rename all "x" columns
for (i in 1:length(counts)) 
  colnames(counts[[i]])[ncol(counts[[i]])]<- varNames[i]

To address your core issue, I'd need to see an example of what you mean by "graph the frequency of the top choices for each grouping combination."

EDIT (Response to OP's comment)

If your intermediate goal is to combine everything into a singe data frame, then there is an easier way. Note that this leaves the aggregated columns named as x until the end.

counts   <- mapply(FUN= function(z,y) aggregate(d[ , y], by=d[z], FUN=length),
                   c1, "user_id")
mrg <- lapply(counts,function(df)merge(d,df)[,c("user_id","x")])
mrg <- do.call(cbind,lapply(mrg,function(df)merge(d,df,by="user_id")$x))
colnames(mrg) <- varNames
result <- cbind(d,mrg)
result
#    user_id choice cond1 gender cond1Ct cond1_GenderCt
# 1        1  apple    a1      F       2              1
# 2        2 banana    a1      M       2              2
# 3        3 banana    a2      F       3              2
# 4        4  apple    a1      M       2              1
# 5        5 banana    a2      F       3              2
# 6        6 banana    a1      M       2              2
# 7        7  apple    a2      F       1              1
# 8        8 banana    a2      M       3              1
# 9        9 banana    a3      F       3              2
# 10      10  apple    a3      M       1              1
# 11      11 banana    a3      F       3              2
# 12      12 banana    a3      M       3              1

The first use of lapply(...)

mrg <- lapply(counts,function(df)merge(d,df)[,c("user_id","x")])

creates a list of data frames wherein each associates user_id with the count for the appropriate combination. Then,

mrg <- do.call(cbind,lapply(mrg,function(df)merge(d,df,by="user_id")$x))

combines the x column from each into a single data frame ordered properly by user_id. Finally,

result <- cbind(d,mrg)

combines the columns with the original data frame d, which is already in user_id order.

Again, it would be much better to understand your ultimate goal, as there is almost certainly a way to achieve it without going through all this.

OTHER TIPS

I think this can be simplified greatly using the table function (and class) with its as.data.frame method producing an object suitable for merging:

counts <- lapply(c1, function(cond) { as.data.frame( table(d[cond]))}) 
# That returns two 'Freq' vectors (named in the as.dataframe` step) in a list. 

d[order(d[2],d[3],d[4]), varNames] <- lapply(counts, function( cts) {
               merge(d[order(d[2],d[3],d[4]), ], cts )[['Freq']] })
#Could also have `cbind`-ed it. The `d[names] <-` assigned the names. 
#Could also have used `setNames` on the RHS. 

#------------

> d
   user_id choice cond1 gender cond1Ct cond1_GenderCt
1        1  apple    a1      F       2              1
2        2 banana    a1      M       2              2
3        3 banana    a2      F       3              2
4        4  apple    a1      M       2              1
5        5 banana    a2      F       3              2
6        6 banana    a1      M       2              2
7        7  apple    a2      F       1              1
8        8 banana    a2      M       3              1
9        9 banana    a3      F       3              2
10      10  apple    a3      M       1              1
11      11 banana    a3      F       3              2
12      12 banana    a3      M       3              1

I will admit that I kind of ran down a dead-end rabbit hole trying to get the ave function to deliver the count vectors, but it did not accept a list argument to its indexing arguments. I reviewed an earlier function I had developed and saw that table does accept a list. My second admission is that I did not realize that assignment to an ordered position would not reorder the original object:

> a <- 10:1
> a[order(a)][2] <-100
> a
 [1]  10   9   8   7   6   5   4   3 100   1  # surprised me anyway.

The as.data.frame method for table-objects just creates a "long" dataframe from the table entries with the Freq column holding the counts:

 as.data.frame( table(d[-(1:3)]) )
#-----------------------
   gender cond1Ct cond1_GenderCt Freq
1       F       1              1    1
2       M       1              1    1
3       F       2              1    1
4       M       2              1    1
5       F       3              1    1
6       M       3              1    1
7       F       1              2    0
8       M       1              2    0
9       F       2              2    1
10      M       2              2    1
11      F       3              2    2
12      M       3              2    2
> table(d[-(1:3)]) 
, , cond1_GenderCt = 1

      cond1Ct
gender 1 2 3
     F 1 1 1
     M 1 1 1

, , cond1_GenderCt = 2

      cond1Ct
gender 1 2 3
     F 0 1 2
     M 0 1 2

Using the package plyr seems to greatly simplify the code and handle both grouping variables that have missing values and instances where one id has multiple choices (both of these came up when I took this back to the larger dataset).

library (plyr)
d2 <- data.frame(user_id = 1:12, choice = rep(c("apple", "banana", "banana"),4), 
                 cond1 = c("a1", "a1", "a2", "a1", "a2", "a1", "a2", "a2", "a3", "a3", "a3", "a3"), 
                 gender = c(rep(c("F", "M"), 6)))

d2$user_id[7] <- 5         # modify the dataset some
d2$gender[10] <- NA

tmp1 <- ddply(d2, ~cond1 + gender + choice, summarize, cond1_GenderCt = length(choice))     
tmp2 <- ddply(d2, ~cond1 + choice, summarize, cond1Ct = length(choice))     
result2 <- merge (tmp2, merge(tmp1, d2))
result2

This creates one dataframe with named variables that has brought back in the frequencies of each choice within each set of grouping variables.

EDIT: So I apparently forgot the main point of my own question! Handling different combinations of variables.

doddply <- function(df, x){
  ddply(df,x,summarize,nChoice = length(choice))
}

lapply (c2, function (x) {doddply(d2, x)})

It would seem like a variant on the doddply function above that takes a varNames list as well as the source of "nChoice" and is called by mapply would help, but I couldn't get that to work.

So this ends up exactly the same as @jlhoward's solution ... the code there after the counts variable is still what is needed for naming and merging. (I'm leaving this here though as just another way to get to that point).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top