如何向量化[R strsplit？

https://stackoverflow.com/questions/3054612

27-09-2019
|

题

当创建函数使用strsplit，矢量输入不所期望的行为，和sapply需要被使用。这是由于列表输出strsplit产生。有一种方法进行向量化的处理 - 即，该函数产生正确的元件列表中的每个输入的元件的

？

例如，来计算单词的长度在一个字符向量：

words <- c("a","quick","brown","fox")

> length(strsplit(words,""))
[1] 4 # The number of words (length of the list)

> length(strsplit(words,"")[[1]])
[1] 1 # The length of the first word only

> sapply(words,function (x) length(strsplit(x,"")[[1]]))
a quick brown   fox 
1     5     5     3 
# Success, but potentially very slow

理想情况下，像length(strsplit(words,"")[[.]])其中.被解释为输入矢量的是有关部分。

解决方案

在一般情况下，你应该尝试使用矢量功能开始。使用strsplit经常会需要某种迭代之后（这会慢一些），所以如果可能，尝试避免它。在你的榜样，你应该使用nchar代替：

> nchar(words)
[1] 1 5 5 3

更一般地，利用这一strsplit返回一个列表和使用lapply事实的优势：

> as.numeric(lapply(strsplit(words,""), length))
[1] 1 5 5 3

或者使用从l*ply的plyr家庭功能。例如：

> laply(strsplit(words,""), length)
[1] 1 5 5 3

编辑：

在荣誉的布鲁姆 ，我决定测试性能的这些方法使用Joyce的尤里西斯：

joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt") joyce <- unlist(strsplit(joyce, " "))

现在，我有所有的话，我们可以做我们的罪状：

> # original version > system.time(print(summary(sapply(joyce, function (x) length(strsplit(x,"")[[1]]))))) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 3.000 4.000 4.666 6.000 69.000 user system elapsed 2.65 0.03 2.73 > # vectorized function > system.time(print(summary(nchar(joyce)))) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 3.000 4.000 4.666 6.000 69.000 user system elapsed 0.05 0.00 0.04 > # with lapply > system.time(print(summary(as.numeric(lapply(strsplit(joyce,""), length))))) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 3.000 4.000 4.666 6.000 69.000 user system elapsed 0.8 0.0 0.8 > # with laply (from plyr) > system.time(print(summary(laply(strsplit(joyce,""), length)))) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 3.000 4.000 4.666 6.000 69.000 user system elapsed 17.20 0.05 17.30 > # with ldply (from plyr) > system.time(print(summary(ldply(strsplit(joyce,""), length)))) V1 Min. : 0.000 1st Qu.: 3.000 Median : 4.000 Mean : 4.666 3rd Qu.: 6.000 Max. :69.000 user system elapsed 7.97 0.00 8.03

在矢量化功能和lapply是相当快于原始sapply版本。所有的解决方案返回相同的答案（由摘要输出看到的）。

显然plyr的最新版本是更快（这是使用一个稍微较旧版本）。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow