Question

I have a dataset with the following variables - Bill_Number,Item_Name,Quantity. A single Bill_Number can have multiple Item_Names under it. I am trying to summarize it by Bill_Number and Quantity wherein the Item_Names are concatenated into a single variable for each bill number. The solutions here :Combine several row variables work well for a limited and pre-defined frame (like given below), but I have hundreds of Item_Names. I am sure there is a simpler way to do this. Can anyone help please?

Data (sample only):

BillN<-c('B1','B1','B1','B1','B2','B2','B2','B2','B3','B3','B3','B3')

Item_Name<-c('Prod A','Prod B','Prod C','Prod D','Prod A','Prod B','Prod C','Prod D','Prod A','Prod B','Prod C','Prod D') # going on to Product(n)

Quantity<-c(1,2,1,2,1,2,1,1,1,2,1,1)

Output: using ply::ddply

ddply(within(Dummy1, {
  Item_Name <- ifelse(Item_Name %in% c('Prod A','Prod B','Prod C'), 'Prod A + Prod B + Prod C', 'Prod D')
}), .(BillN, Item_Name), summarise, Count=sum(Quantity))

This depends on 'Prod D' being a separate row, whereas I need a combination of all the products for a given number and the overall quantity for that bill number.

The output I require is in this format:

  • BillN | Item_Name | Quantity
  • B1 | Prod A + Prod B + Prod C + Prod D | 4
  • B2 | Prod A + Prod B + Prod C + Prod D | 4
  • B3 | Prod A + Prod B + Prod C + Prod D | 4
  • Whereas the output I get from the code above is:

  • BillN | Item_Name | Quantity
  • B1 | Prod A + Prod B + Prod C | 4
  • B1 | Prod D | 2
  • B2 | Prod A + Prod B + Prod C | 4
  • B2 | Prod D | 2
  • B3 | Prod A + Prod B + Prod C | 4
  • B3 | Prod D | 2
  • Extending the scenario, what happens if I have more variables such as the Brand, Category, Sub_category for each Item_Name and the Sales_Amount as a numeric variable to further sum it by?

    Please let me know if you need more information or have questions.

    Thanks, Raoul

    Was it helpful?

    Solution

    Here is a quick solution using data.table package:

    Step1: Create the data.table

    library(data.table)
    
    DT <- data.table(
      BillN=c('B1','B1','B1','B1','B2','B2','B2','B2','B3','B3','B3','B3'),
      Item_Name=c('Prod A','Prod B','Prod C','Prod D','Prod A','Prod B','Prod C','Prod D','Prod A','Prod B','Prod C','Prod D'), # going on to Product(n)
      Quantity=c(1,2,1,2,1,2,1,1,1,2,1,1)
    )
    

    Step2: Set appropriate key:

    setkey(DT,BillN)
    

    Step3: Make sure that the string vector Item_Name is not a factor.

    DT[,Item_Name := as.character(Item_Name)]
    

    Step4: Perform the operation by key

    DT[,list(Item_Name =paste(Item_Name,collapse=" + "),
             Quantity=sum(Quantity)),
       by=key(DT)]
    

    You can use the result as it is, or collapse each row! Which is simple!!

    OTHER TIPS

    Here's an example summing both the Quantity variables by BillN and the count of BillNs as your expected output wants:

    Dummy1 <- data.frame(BillN, Item_Name, Quantity)
    
    ldply(by(Dummy1, Dummy1$BillN, simplify=TRUE, FUN=function(x) {
    
      BillN <- x[1,]$BillN
      Quantity <- sum(x$Quantity)
      Item_Name <- paste(as.character(unique(x$Item_Name)), sep="", collapse=" + ")
      BillCt <- nrow(x)
    
      return(data.frame(BillN, Item_Name, Quantity, BillCt))
    
    }))[2:5]
    
    ##  BillN                         Item_Name Quantity BillCt
    ## 1    B1 Prod A + Prod B + Prod C + Prod D        6      4
    ## 2    B2 Prod A + Prod B + Prod C + Prod D        5      4
    ## 3    B3 Prod A + Prod B + Prod C + Prod D        5      4
    
    Licensed under: CC-BY-SA with attribution
    Not affiliated with StackOverflow
    scroll top