Pergunta

I have a data frame with annual exports of firms to different countries in different years. My problem is i need to create a variable that says, for each year, how many firms there are in each country. I can do this perfectly with a "tapply" command, like

incumbents <- tapply(id, destination-year, function(x) length(unique(x)))

and it works just fine. My problem is that incumbents has length length(destination-year), and I need it to have length length(id) -there are many firms each year serving each destination-, to use it in a subsequent regression (of course, in a way that matches the year and the destination). A "for" loop can do this, but it is very time-consuming since the database is kind of huge.

Any suggestions?

Foi útil?

Solução

You don't provide a reproducible example, so I can't test this, but you should be able to use ave:

incumbents <- ave(id, destination-year, FUN=function(x) length(unique(x)))

Outras dicas

Just "merge" the tapply summary back in with the original data frame with merge.

Since you didn't provide example data, I made some. Modify accordingly.

n           = 1000
id          = sample(1:10, n, replace=T)
year        = sample(2000:2011, n, replace=T)
destination = sample(LETTERS[1:6], n, replace=T)

`destination-year` = paste(destination, year, sep='-')

dat = data.frame(id, year, destination, `destination-year`)

Now tabulate your summaries. Note how I reformatted to a data frame and made the names match the original data.

incumbents = tapply(id, `destination-year`, function(x) length(unique(x)))
incumbents = data.frame(`destination-year`=names(incumbents), incumbents)

Finally, merge back in with the original data:

merge(dat, incumbents)

By the way, instead of combining destination and year into a third variable, like it seems you've done, tapply can handle both variables directly as a list:

incumbents = melt(tapply(id, list(destination=destination, year=year), function(x) length(unique(x))))

Using @JohnColby's excellent example data, I was thinking of something more along the lines of this:

#I prefer not to deal with the pesky '-' in a variable name
destinationYear = paste(destination, year, sep='-')

dat = data.frame(id, year, destination, destinationYear)

#require(plyr)
dat <- ddply(dat,.(destinationYear),transform,newCol = length(unique(id)))

#Or if more speed is required, use data.table
require(data.table)
datTable <- data.table(dat)

datTable <- datTable[,transform(.SD,newCol = length(unique(id))),by = destinationYear]
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top