Question

I have a dataset that has a set of plants. Two of these plants have multiple lines. When analyzing the data, I'd like to have a column that would have the two plants that have multiple lines put together but all others as they are. Here is my reproducible data set:

testset <- data.table(date=as.Date(c("2013-07-02","2013-08-03","2013-09-04","2013-10-05","2013-11-06")), yr = c(2013,2013,2013,2013,2013), mo = c(07,08,09,10,11), da = c(02,03,04,05,06), plant = LETTERS[1:5], PlantID = c(1,2,3,4,5,1,2,3,6,7), product = as.factor(letters[26:22]), rating = runif(25))

This is the appended column output that I'm looking for:

A1

B2

C3

D4

E5

A1

B2

C3

D6

E7

This is a simple example but my true dataset is much, much larger so I'd like to have an elegant data.table way to produce it.

Was it helpful?

Solution

You don't need to do this when you use data.table's. Instead, you should set a key or use an ad-hoc by (like I show in the example below). This is one of the key foundations of operations in data.table.


Toy example using by:

Look at the toy example below. We sum the rating by the id and grp variable. So where duplicates exist, they get summed, but unique combinations of the grouping variables will be treated by themselves (so note the value for rating and sum_rating for the last row which has a unique combination of grouping variables (the other rows have two rows each like in your example):

# Make this data reproducible
set.seed(1)
dt <- data.table( id = c( rep( 1:2 , 2 ) , 1 ) , grp = c( rep( 1:2 , 2 ) , 3 ) , rating = sample( 5 , 5 , TRUE ) ) 
#   id grp rating
#1:  1   1      4
#2:  2   2      1
#3:  1   1      3
#4:  2   2      4
#5:  1   3      4

# Sum by 'id' and 'grp'...
dt[ , sum_rating := sum( rating ) , by = list( id , grp ) ]
dt
#   id grp rating sum_rating
#1:  1   1      4          7
#2:  2   2      1          5
#3:  1   1      3          7
#4:  2   2      4          5
#5:  1   3      4          4  <=====  rating and sum_rating are the same because this is a unique row

OTHER TIPS

I don't understand what you're desired output is, but hopefully this will help you on the way. Here's a data.table solution for finding all the unique plant lines:

> testset[,unique(paste0(plant, PlantID))]
[1] "A1" "B2" "C3" "D4" "D6" "E5" "E7"
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top