Counting combinations without destroying type

https://stackoverflow.com/questions/11042748

14-06-2021
|

题

I wonder whether someone has an idea for how to count combinations like the following in a better way than I've thought of.

> library(lubridate)
> df <- data.frame(x=sample(now()+hours(1:3), 100, T), y=sample(1:4, 100, T))
> with(df, as.data.frame(table(x, y)))
                     x y Freq
1  2012-06-15 00:10:18 1    5
2  2012-06-15 01:10:18 1    9
3  2012-06-15 02:10:18 1    8
4  2012-06-15 00:10:18 2    9
5  2012-06-15 01:10:18 2   10
6  2012-06-15 02:10:18 2   12
7  2012-06-15 00:10:18 3    7
8  2012-06-15 01:10:18 3    9
9  2012-06-15 02:10:18 3    6
10 2012-06-15 00:10:18 4    5
11 2012-06-15 01:10:18 4   14
12 2012-06-15 02:10:18 4    6

I like that format, but unfortunately when we ran x and y through table(), they got converted to factors. In the final output they can exist quite nicely as their original type, but getting there seems problematic. Currently I just manually fix all the types afterward, which is really messy because I have to re-set the timezone, and look up the percent-codes for the default date format, etc. etc.

It seems like an efficient solution would involve hashing the objects, or otherwise mapping integers to the unique values of x and y so we can use tabulate(), then mapping back.

Ideas?

解决方案

Here's data.table version that preserves the column classes:

library(data.table)

dt <- data.table(df, key=c("x", "y"))
dt[, .N, by=key(dt)]
#                       x y  N
#  1: 2012-06-14 18:10:22 1  8
#  2: 2012-06-14 18:10:22 2 10
#  3: 2012-06-14 18:10:22 3  8
#  4: 2012-06-14 18:10:22 4  8
#  5: 2012-06-14 19:10:22 1  6
#  6: 2012-06-14 19:10:22 2  8
#  7: 2012-06-14 19:10:22 3  6
#  8: 2012-06-14 19:10:22 4  6
#  9: 2012-06-14 20:10:22 1 15
# 10: 2012-06-14 20:10:22 2  5
# 11: 2012-06-14 20:10:22 3 12
# 12: 2012-06-14 20:10:22 4  8

str(dt[, .N, by=key(dt)])
# Classes ‘data.table’ and 'data.frame':  12 obs. of  3 variables:
#  $ x: POSIXct, format: "2012-06-14 18:10:22" "2012-06-14 18:10:22" ...
#  $ y: int  1 2 3 4 1 2 3 4 1 2 ...
#  $ N: int  8 10 8 8 6 8 6 6 15 5 ...

Edit in response to follow-up question

To count the number of appearances of all possible combinations of the observed factor levels (including those which don't appear in the data), you can do something like the following:

dt<-dt[1:30,]  # Make subset of dt in which some factor combinations don't appear

ii <- do.call("CJ", lapply(dt, unique))  # CJ() is similar to expand.grid()
dt[ii, .N]
#                      x y N
# 1: 2012-06-14 22:53:05 1 8
# 2: 2012-06-14 22:53:05 2 7
# 3: 2012-06-14 22:53:05 3 9
# 4: 2012-06-14 22:53:05 4 5
# 5: 2012-06-14 23:53:05 1 1
# 6: 2012-06-14 23:53:05 2 0
# 7: 2012-06-14 23:53:05 3 0
# 8: 2012-06-14 23:53:05 4 0

其他提示

You can use ddply

library(plyr)

ddply(df, .(x, y), summarize, Freq = length(y))

If you want it arranged by y then x

ddply(df, .(y, x), summarize, Freq = length(y))

or if column ordering is important as well as row ordering

arrange(ddply(df, .(x, y), summarize, Freq = length(y)), y)

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow