如何在r中找到平衡面板数据(又名,如何在给定窗口上查找面板中的哪些条目已完成)
题
我有来自Compustat的大量数据面板。为此,我添加了一些手工收集的数据(从一堆旧书籍中获得了严重的手工收集)。但是我不想对整个面板进行手工收集,只有一个随机选择的子集。要找到较大的集合(我是从中随机选择的),我想从Compustat的平衡面板开始。
我明白了 plm
用于使用不平衡面板的图书馆,但我想保持平衡。是否有一种干净的方法可以做到这一点,而没有搜寻并抛弃不运行样本期的公司(panelspeak中的个人)?谢谢!
解决方案
经过第二次思考,有一种更简单的方法来做到这一点。
看这个:
data.with.only.complete.subjects.data <- function(xx, subject.column, number.of.observation.a.subject.should.have)
{
subjects <- xx[,subject.column]
num.of.observations.per.subject <- table(subjects)
subjects.to.keep <- names(num.of.observations.per.subject)[num.of.observations.per.subject == number.of.observation.a.subject.should.have]
subset.by.me <- subjects %in% subjects.to.keep
new.xx <- xx[subset.by.me ,]
return(new.xx)
}
xx <- data.frame(subject = rep(1:4, each = 3),
observation.per.subject = rep(rep(1:3), 4))
xx.mis <- xx[-c(2,5),]
data.with.only.complete.subjects.data(xx.mis , 1, 3)
其他提示
现在看它,我在某些数据上丢失了格式,但是稍后可以弄清楚。这是我尝试采用面板平衡部分的尝试:
> data <- read.csv("223601533.csv")
> head(data)
gvkey indfmt datafmt consol popsrc fyear fyr datadate exchg isin
1 2721 INDL HIST_STD C I 2000 12 20001231 264 JP3242800005
2 2721 INDL HIST_STD C I 2001 12 20011231 264 JP3242800005
3 2721 INDL HIST_STD C I 2002 12 20021231 264 JP3242800005
4 2721 INDL HIST_STD C I 2003 12 20031231 264 JP3242800005
5 2721 INDL HIST_STD C I 2004 12 20041231 264 JP3242800005
6 2721 INDL HIST_STD C I 2005 12 20051231 264 JP3242800005
sedol conm costat fic
1 6172323 CANON INC A JPN
2 6172323 CANON INC A JPN
3 6172323 CANON INC A JPN
4 6172323 CANON INC A JPN
5 6172323 CANON INC A JPN
6 6172323 CANON INC A JPN
>
> obs.all <- tabulate(data$gvkey) # incl lots of zeros for unused gvkey
> num.obs <- tabulate(obs.all)
> mode.num.obs <- which(num.obs == max(num.obs))
> nt.bal <- num.obs[mode.num.obs] * mode.num.obs
> pot.obs <- which(obs.all == mode.num.obs)
> data.bal <- as.data.frame(matrix(NA, nrow=nt.bal, ncol=ncol(data)))
> colnames(data.bal) <- colnames(data)
>
> for(i in 1:length(pot.obs)) {
+ last.row <- i * mode.num.obs
+ first.row <- last.row - (mode.num.obs - 1)
+ data.bal[first.row:last.row, ] <- subset(data, gvkey == pot.obs[i])
+ }
>
> head(data.bal)
gvkey indfmt datafmt consol popsrc fyear fyr datadate exchg isin sedol conm
1 2721 2 1 1 1 2000 12 20001231 264 875 359 331
2 2721 2 1 1 1 2001 12 20011231 264 875 359 331
3 2721 2 1 1 1 2002 12 20021231 264 875 359 331
4 2721 2 1 1 1 2003 12 20031231 264 875 359 331
5 2721 2 1 1 1 2004 12 20041231 264 875 359 331
6 2721 2 1 1 1 2005 12 20051231 264 875 359 331
costat fic
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
>
更新:我认为该解决方案不如我上面发布的另一个解决方案,但我将其作为解决方案的示例 - 这不是很好:) *
嗨,瑞沙德,
除了一些示例数据以帮助您有点困难。
但这听起来好像您可以使用“ Reshape”软件包中的“熔体”和“铸造”重塑数据。这样做将使您能够找到每个主题观察到太少的位置,然后使用该信息来征您数据。
这是如何完成此操作的示例代码:
xx <- data.frame(subject = rep(1:4, each = 3),
observation.per.subject = rep(rep(1:3), 4))
xx.mis <- xx[-c(2,5),]
require(reshape)
num.of.obs.per.subject <- cast(xx.mis, subject ~.)
the.number <- num.of.obs.per.subject[,2]
subjects.to.keep <- num.of.obs.per.subject[,1] [the.number == 3]
ss.index.of.who.to.keep <- xx.mis $subject %in% subjects.to.keep
xx.to.work.with <- xx.mis[ss.index.of.who.to.keep ,]
xx.to.work.with
干杯,
塔尔
> # read data
> file.in <- "243815928.csv"
> data <- read.csv(file.in)
>
> # find which gvkeys run the entire sample period
> obs.all <- tabulate(data$gvkey) # incl lots of zeros for unused gvkey
> num.obs <- tabulate(obs.all)
> mode.num.obs <- which(num.obs == max(num.obs))
> nt.bal <- num.obs[mode.num.obs] * mode.num.obs
> pot.obs <- which(obs.all == mode.num.obs)
>
> # create new df w/o firms that don't run the whole sample period
> pot.obs.index <- which(data$gvkey %in% pot.obs)
> data.bal <- data[pot.obs.index, ]
>
> # write data to csv file
> file.out <- paste(substr(file.in, 1, (nchar(file.in)-4)), "sorted.csv", sep="")
> write.csv(data.bal, file.out)
不隶属于 StackOverflow