문제

I'm struggling with finding an efficient solution for the following problem:

I have a large manipulated data frame with around 8 columns and 80000 rows that generally includes multiple data types. I want to create a new data frame that includes the sum of one column if conditions from the large data frame are met.

Imagine the head of the original data frame looks like this. The column $years.raw indicates that the company measured data for x years.

> cbind(company.raw,years.raw,source,amount.inkg)
     company.raw years.raw source      amount.inkg
[1,] "C1"        "1"       "Ink"       "5"        
[2,] "C1"        "1"       "Recycling" "2"        
[3,] "C2"        "1"       "Coffee"    "10"       
[4,] "C2"        "1"       "Combusted" "15"       
[5,] "C2"        "2"       "Printer"   "14"       
[6,] "C2"        "2"       "Tea"       "18"   

What I need to do now is to create a new data frame that sums up the values of column $amount.inkg for every company and every year based on certain string elements. I saved the string elements in three vectors below. The string elements are part of the column $source from the original data frame.

> vector1 <- c("Tea","Coffee")
> vector2 <- c("Ink","Printer")
> vector3 <- c("Recycling","Combusted")

The preferred data frame would then look like this:

Company Year              amount.vector1    amount.vector 2 amount.vector 3
C1           1                 0             5                 2
C2           1                 10            0                15        
C2           2                 18            14                0

The general approach for $amount.vector1 would be: Sum up the values of column $amount.inkg for every company and every year where string elements of the original data frame column $source== string elements of vector1. The same for column $amount.vector2 except that the elements are different of course.

If there are no values available, a "0" should be added instead a NA error. This needs to be done for the whole raw data frame that includes around 250 companies with data for every company for the years 1:8 (differs quite a lot).

Edit: For the data frame I need one-row-per-company-per-year.

C1 Year 1  
C1 Year 2
C1 Year 3
C2 Year 1
C2 Year 2

I tried to write a function that combines these conditions but I failed. I'm quite new to R and didn't know how to link these conditions and apply them on the whole data frame.

도움이 되었습니까?

해결책

Your data is in 'long form' (multiple rows of company, source, year, ...)

You want to aggregate amount.inkg over each company and year, for multiple values of source. Specifically you want to aggregate with conditionals on 'source' field.

Again, please give us reproducible example. (Thanks josilber). This is a four-liner with either Split-Apply-Combine(ddply) or logical indexing:

df = data.frame(company.raw = c("C1", "C1", "C2", "C2", "C2", "C2"),
                years.raw = c(1, 1, 1, 1, 2, 2),
                source = c("Ink", "Recycling", "Coffee", "Combusted", "Printer", "Tea"),
                amount.inkg = c(5, 2, 10, 15, 14, 18))

# OPTION 1. Split-Apply-Combine: ddply(...summarize) with a conditional on the data
require(plyr) # dplyr if performance on large d.f. becomes an issue
ddply(df, .(company.raw,years.raw), summarize,
    amount.vector1=sum(amount.inkg[source %in% c('Tea','Coffee')]),
    amount.vector2=sum(amount.inkg[source %in% c('Ink','Printer')]),
    amount.vector3=sum(amount.inkg[source %in% c('Recycling','Combusted')])
)


# OPTION 2. sum with logical indexing on the df:
# (This is from before you modified the question to one-row-per-company-and-per-year)
df$amount.vector1 <- sum( df[(df$source %in% c('Tea','Coffee')),]$amount.inkg )
# josilber clarifies you want one-row-per-company
...

Option 3. You could also use aggregate(manpage here) with subset(...), although aggregate for a sum is overkill.

aggregate(df, source %in% c('Tea','Coffee'), FUN = sum)

The by argument to aggregate is where the action is (selecting, subsetting by criteria).

Note: %in% performs a scan operation, so if your vector and d.f. get large, or for scalability, you'd need to break it into boolean operations which can be vectorized: (source=='Tea' | source=='Coffee')

As to preventing NA sums if the subset was empty, sum(c()) = 0 so don't worry about that. But if you do, either use na.omit, or do ifelse(is.na(x),0,x) on the final result.

다른 팁

This is a good task for the split-apply-combine paradigm. First, you split your data frame by company/year pair:

data = data.frame(company.raw = c("C1", "C1", "C2", "C2", "C2", "C2"),
                  years.raw = c(1, 1, 1, 1, 2, 2),
                  source = c("Ink", "Recycling", "Coffee", "Combusted", "Printer", "Tea"),
                  amount.inkg = c(5, 2, 10, 15, 14, 18))
spl = split(data, paste(data$company.raw, data$years.raw))

Now, you compute the rolled-up data frame for each element in the split-up data:

spl2 = lapply(spl, function(x) {
  data.frame(Company=x$company.raw[1],
             Year=x$years.raw[1],
             amount.vector1 = sum(x$amount.inkg[x$source %in% vector1]),
             amount.vector2 = sum(x$amount.inkg[x$source %in% vector2]),
             amount.vector3 = sum(x$amount.inkg[x$source %in% vector3]))
})

And finally, combine everything together:

do.call(rbind, spl2)
#      Company Year amount.vector1 amount.vector2 amount.vector3
# C1 1      C1    1              0              5              2
# C2 1      C2    1             10              0             15
# C2 2      C2    2             18             14              0
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top