문제

I have a dataset with the following variables - Bill_Number,Item_Name,Quantity. A single Bill_Number can have multiple Item_Names under it. I am trying to summarize it by Bill_Number and Quantity wherein the Item_Names are concatenated into a single variable for each bill number. The solutions here :Combine several row variables work well for a limited and pre-defined frame (like given below), but I have hundreds of Item_Names. I am sure there is a simpler way to do this. Can anyone help please?

Data (sample only):

BillN<-c('B1','B1','B1','B1','B2','B2','B2','B2','B3','B3','B3','B3')

Item_Name<-c('Prod A','Prod B','Prod C','Prod D','Prod A','Prod B','Prod C','Prod D','Prod A','Prod B','Prod C','Prod D') # going on to Product(n)

Quantity<-c(1,2,1,2,1,2,1,1,1,2,1,1)

Output: using ply::ddply

ddply(within(Dummy1, {
  Item_Name <- ifelse(Item_Name %in% c('Prod A','Prod B','Prod C'), 'Prod A + Prod B + Prod C', 'Prod D')
}), .(BillN, Item_Name), summarise, Count=sum(Quantity))

This depends on 'Prod D' being a separate row, whereas I need a combination of all the products for a given number and the overall quantity for that bill number.

The output I require is in this format:

  • BillN | Item_Name | Quantity
  • B1 | Prod A + Prod B + Prod C + Prod D | 4
  • B2 | Prod A + Prod B + Prod C + Prod D | 4
  • B3 | Prod A + Prod B + Prod C + Prod D | 4
  • Whereas the output I get from the code above is:

  • BillN | Item_Name | Quantity
  • B1 | Prod A + Prod B + Prod C | 4
  • B1 | Prod D | 2
  • B2 | Prod A + Prod B + Prod C | 4
  • B2 | Prod D | 2
  • B3 | Prod A + Prod B + Prod C | 4
  • B3 | Prod D | 2
  • Extending the scenario, what happens if I have more variables such as the Brand, Category, Sub_category for each Item_Name and the Sales_Amount as a numeric variable to further sum it by?

    Please let me know if you need more information or have questions.

    Thanks, Raoul

    도움이 되었습니까?

    해결책

    Here is a quick solution using data.table package:

    Step1: Create the data.table

    library(data.table)
    
    DT <- data.table(
      BillN=c('B1','B1','B1','B1','B2','B2','B2','B2','B3','B3','B3','B3'),
      Item_Name=c('Prod A','Prod B','Prod C','Prod D','Prod A','Prod B','Prod C','Prod D','Prod A','Prod B','Prod C','Prod D'), # going on to Product(n)
      Quantity=c(1,2,1,2,1,2,1,1,1,2,1,1)
    )
    

    Step2: Set appropriate key:

    setkey(DT,BillN)
    

    Step3: Make sure that the string vector Item_Name is not a factor.

    DT[,Item_Name := as.character(Item_Name)]
    

    Step4: Perform the operation by key

    DT[,list(Item_Name =paste(Item_Name,collapse=" + "),
             Quantity=sum(Quantity)),
       by=key(DT)]
    

    You can use the result as it is, or collapse each row! Which is simple!!

    다른 팁

    Here's an example summing both the Quantity variables by BillN and the count of BillNs as your expected output wants:

    Dummy1 <- data.frame(BillN, Item_Name, Quantity)
    
    ldply(by(Dummy1, Dummy1$BillN, simplify=TRUE, FUN=function(x) {
    
      BillN <- x[1,]$BillN
      Quantity <- sum(x$Quantity)
      Item_Name <- paste(as.character(unique(x$Item_Name)), sep="", collapse=" + ")
      BillCt <- nrow(x)
    
      return(data.frame(BillN, Item_Name, Quantity, BillCt))
    
    }))[2:5]
    
    ##  BillN                         Item_Name Quantity BillCt
    ## 1    B1 Prod A + Prod B + Prod C + Prod D        6      4
    ## 2    B2 Prod A + Prod B + Prod C + Prod D        5      4
    ## 3    B3 Prod A + Prod B + Prod C + Prod D        5      4
    
    라이센스 : CC-BY-SA ~와 함께 속성
    제휴하지 않습니다 StackOverflow
    scroll top