如何重写“ sapply”命令以提高性能？

https://stackoverflow.com/questions/5303807

24-10-2019
|

题

我有一个data.frame名为〜1,300,000行和4列的“ D”列为“ D”，另一个数据名为〜12,000行和2列的“ GC”帧（但请参见下面的较小示例）。

d <- data.frame( gene=rep(c("a","b","c"),4), val=rnorm(12), ind=c( rep(rep("i1",3),2), rep(rep("i2",3),2) ), exp=c( rep("e1",3), rep("e2",3), rep("e1",3), rep("e2",3) ) )
gc <- data.frame( gene=c("a","b","c"), chr=c("c1","c2","c3") )

这是“ D”的样子：

   gene         val ind exp
1     a  1.38711902  i1  e1
2     b -0.25578496  i1  e1
3     c  0.49331256  i1  e1
4     a -1.38015272  i1  e2
5     b  1.46779219  i1  e2
6     c -0.84946320  i1  e2
7     a  0.01188061  i2  e1
8     b -0.13225808  i2  e1
9     c  0.16508404  i2  e1
10    a  0.70949804  i2  e2
11    b -0.64950167  i2  e2
12    c  0.12472479  i2  e2

这是“ GC”：

  gene chr
1    a  c1
2    b  c2
3    c  c3

我想通过合并与“ D”的第一列的“ GC”数据中的数据，将第五列添加到“ D”中。目前我正在使用 sapply.

d$chr <- sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )

但是在真实数据上，这需要一个“很长的”时间（我正在使用“ system.time（）”运行命令，因为超过30分钟，但仍未完成）。

您是否知道如何以巧妙的方式重写它？还是我应该考虑使用 plyr, ，也许有“并行”选项（我的计算机上有四个内核）？在这种情况下，最好的语法是什么？

提前致谢。

解决方案

我认为您可以将因子用作索引：

gc[ d[,1], 2]
 [1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3

做与：

 sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
 [1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3

但是要快得多：

> system.time(replicate(1000,sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )))
   user  system elapsed 
   5.03    0.00    5.02 
> 
> system.time(replicate(1000,gc[ d[,1], 2]))
   user  system elapsed 
   0.12    0.00    0.13

编辑：

扩展我的评论。这 gc 数据帧需要每个级别的一行 gene 按照级别的顺序起作用：

 d <- data.frame( gene=rep(c("a","b","c"),4), val=rnorm(12), ind=c( rep(rep("i1",3),2), rep(rep("i2",3),2) ), exp=c( rep("e1",3), rep("e2",3), rep("e1",3), rep("e2",3) ) )
gc <- data.frame( gene=c("c","a","b"), chr=c("c1","c2","c3") )

gc[ d[,1], 2]
 [1] c1 c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3
Levels: c1 c2 c3

sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
 [1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3

但是并不难解决：

levels(gc$gene) <- levels(d$gene) # Seems redundant as this is done right quite often automatically
gc <- gc[order(gc$gene),]


gc[ d[,1], 2]
 [1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3

sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )
 [1] c2 c3 c1 c2 c3 c1 c2 c3 c1 c2 c3 c1
Levels: c1 c2 c3

其他提示

一种不击败萨沙（Sasha）的方法时机的替代解决方案，但更具普遍性和可读性为就是简单 merge 两个数据帧：

d <- merge(d, gc)

我的系统较慢，所以这是我的时间：

> system.time(replicate(1000,sapply( 1:nrow(d), function(x) gc[ gc$gene==d[x,1], ]$chr )))
   user  system elapsed 
  11.22    0.12   11.86 
> system.time(replicate(1000,gc[ d[,1], 2])) 
   user  system elapsed 
   0.34    0.00    0.35 
> system.time(replicate(1000, merge(d, gc, by="gene"))) 
   user  system elapsed 
   3.35    0.02    3.40

好处是，您可以拥有多个键，对非匹配项的精细控制等。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow