Question

I have a dataset as -

    id       date       sector   region     price             score
1   SWECEC2 2011-12-31 Finance     R1    2.297936          -1.0000000
2   FRACPY1 2011-12-31 FMCG        R1   10.500936          -0.8571429
3   FRAACO1 2011-12-31 Finance     R1   19.228936          -0.7142857
4   GERAIM1 2011-12-31 Finance     R1    4.936936          -0.5714286
5   GERAKX1 2011-12-31 Finance     R1    8.140936          -0.5714286
6   NETAHI1 2011-12-31 FMCG        R1   -7.229064          -0.5714286
7   UKIJLK1 2011-12-31 Finance     R2   17.766936          -0.5714286
8   FRAAHJ1 2011-12-31 Finance     R2    5.840936          -0.4285714
9   FRACLB1 2011-12-31 FMCG        R2   -4.490064          -0.4285714
10  GERAIO1 2011-12-31 FMCG        R2    2.338936          -0.4285714

I need to create a new column to this dataset that will categorize each row of the dataset for a group of date, sector and region as 1 if the score is less than the average score for that group of date, sector and region else 2. i.e. average of score for date=2011-12-32, sector=Finance, Region = R1 is -0.7142857 and average of score for date=2011-12-32, sector=FMCG, Region = R2 is -0.4285714 and so on then the new column for the first group should be like -

id          date    sector  region  price     score      Rating
SWECEC2 12/31/2011  Finance R1      2.297936    -1          1
FRAACO1 12/31/2011  Finance R1      19.228936   -0.7142857  2
GERAIM1 12/31/2011  Finance R1      4.936936    -0.5714286  2
GERAKX1 12/31/2011  Finance R1      8.140936    -0.5714286  2

I thought we could do this in ddply but I can't think of how to because I am not summarizing the data here, just creating a new column based on each group in a data.

Any suggestions?

Thanks!

The sample of my data set is here -

ds <- structure(list(id = structure(c(9L, 4L, 1L, 5L, 7L, 8L, 10L, 
2L, 3L, 6L), .Label = c("FRAACO1", "FRAAHJ1", "FRACLB1", "FRACPY1", 
"GERAIM1", "GERAIO1", "GERAKX1", "NETAHI1", "SWECEC2", "UKIJLK1"
), class = "factor"), date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),     .Label = "12/31/2011", class = "factor"), 
sector = structure(c(1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 
2L), .Label = c("Finance", "FMCG"), class = "factor"), region = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("R1", "R2"
), class = "factor"), price = c(2.297936, 10.500936, 19.228936, 
4.936936, 8.140936, -7.229064, 17.766936, 5.840936, -4.490064, 
2.338936), score = c(-1, -0.8571429, -0.7142857, -0.5714286, 
-0.5714286, -0.5714286, -0.5714286, -0.4285714, -0.4285714, 
-0.4285714)), .Names = c("id", "date", "sector", "region", 
"price", "score"), class = "data.frame", row.names = c(NA, -10L
))
Was it helpful?

Solution

Try this:

library(plyr)
ddply(ds, .(date, sector, region), function(x) data.frame(x, rate = ifelse(x$score < mean(x$score), 1, 2)))

then you will get:

        id       date  sector region     price      score rate
1  SWECEC2 12/31/2011 Finance     R1  2.297936 -1.0000000    1
2  FRAACO1 12/31/2011 Finance     R1 19.228936 -0.7142857    2
3  GERAIM1 12/31/2011 Finance     R1  4.936936 -0.5714286    2
4  GERAKX1 12/31/2011 Finance     R1  8.140936 -0.5714286    2
5  UKIJLK1 12/31/2011 Finance     R2 17.766936 -0.5714286    1
6  FRAAHJ1 12/31/2011 Finance     R2  5.840936 -0.4285714    2
7  FRACPY1 12/31/2011    FMCG     R1 10.500936 -0.8571429    1
8  NETAHI1 12/31/2011    FMCG     R1 -7.229064 -0.5714286    2
9  FRACLB1 12/31/2011    FMCG     R2 -4.490064 -0.4285714    2
10 GERAIO1 12/31/2011    FMCG     R2  2.338936 -0.4285714    2

OTHER TIPS

Here's an approach similar to @koshske's, using dplyr:

library(dplyr)

ds.new <- merge(ds %.%
  group_by(date, sector, region) %.%
  summarise(mean = mean(score)), ds)

within(ds.new, Rating <- ifelse(score < mean, 1, 2))

#          date  sector region       mean      id     price      score Rating
# 1  12/31/2011 Finance     R1 -0.7142857 SWECEC2  2.297936 -1.0000000      1
# 2  12/31/2011 Finance     R1 -0.7142857 FRAACO1 19.228936 -0.7142857      2
# 3  12/31/2011 Finance     R1 -0.7142857 GERAIM1  4.936936 -0.5714286      2
# 4  12/31/2011 Finance     R1 -0.7142857 GERAKX1  8.140936 -0.5714286      2
# 5  12/31/2011 Finance     R2 -0.5000000 UKIJLK1 17.766936 -0.5714286      1
# 6  12/31/2011 Finance     R2 -0.5000000 FRAAHJ1  5.840936 -0.4285714      2
# 7  12/31/2011    FMCG     R1 -0.7142857 FRACPY1 10.500936 -0.8571429      1
# 8  12/31/2011    FMCG     R1 -0.7142857 NETAHI1 -7.229064 -0.5714286      2
# 9  12/31/2011    FMCG     R2 -0.4285714 FRACLB1 -4.490064 -0.4285714      2
# 10 12/31/2011    FMCG     R2 -0.4285714 GERAIO1  2.338936 -0.4285714      2

There's a good intro to dplyr here. One of the neat things about it is that you can "chain" expressions with %.%.

As @iugrina mentions in the comment below, the summarise and merge is in fact unnecessary; we can use mutate to achieve the same:

ds %.% 
  group_by(date,sector,region) %.% 
  mutate(Rating=ifelse(score < mean(score), 1, 2))

Another way without loading libraries:

  # Split data
  ds.splitted <- split(ds, list(ds$date, ds$sector, ds$region))

  # Function to create Ratings (input: df, output: vector of 1's and 2's)
  myfun <- function(x){
     x.mean.score <- mean(x$score)
     ifelse(x$score < x.mean.score, 1, 2)
  }

  # get the ratings
  lapply(ds.splitted, function(y) cbind(y, Rating = myfun(y)))

Or optionally it is possible to combine the output list into a data frame directly:

  ds.ratings <- do.call("rbind",
                        lapply(ds.splitted, function(y) cbind(y, Rating = myfun(y))))

  # to conserve the order in the original ds data frame as rowname:
  rownames(ds.ratings) <- unlist(lapply(ds.splitted, rownames))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top