My "splitstackshape" package was written to deal with these types of tasks. You can explore the concat.split
family of functions.
Here are a few examples:
As a list
. (But the function sorts the output--you'd do better off with strsplit
until I add an option to not sort the output).
library(splitstackshape)
x1 <- concat.split.list(mydf, split.col="features", sep=",", drop = TRUE)
x1
# id features_list
# 1. 131 FeatureA, FeatureB, FeatureC
# 2. 132 FeatureA, FeatureD, FeatureE, FeatureF
# 3. 135 FeatureD, FeatureE, FeatureC
# 4. 139 FeatureF, FeatureB
str(x1)
# 'data.frame': 4 obs. of 2 variables:
# $ id : int 131 132 135 139
# $ features_list:List of 4
# ..$ : chr "FeatureA" "FeatureB" "FeatureC"
# ..$ : chr "FeatureA" "FeatureD" "FeatureE" "FeatureF"
# ..$ : chr "FeatureD" "FeatureE" "FeatureC"
# ..$ : chr "FeatureF" "FeatureB"
As a "wide" data.frame
:
x2 <- concat.split.multiple(mydf, split.col="features", sep=",")
x2
# id features_1 features_2 features_3 features_4
# 1. 131 FeatureA FeatureB FeatureC <NA>
# 2. 132 FeatureA FeatureD FeatureE FeatureF
# 3. 135 FeatureD FeatureE FeatureC <NA>
# 4. 139 FeatureF FeatureB <NA> <NA>
As a "long" data.frame
:
x3 <- concat.split.multiple(mydf, split.cols="features", seps=",", direction="long")
x3
# id time features
# 1 131 1 FeatureA
# 2 132 1 FeatureA
# 3 135 1 FeatureD
# 4 139 1 FeatureF
# 5 131 2 FeatureB
# 6 132 2 FeatureD
# 7 135 2 FeatureE
# 8 139 2 FeatureB
# 9 131 3 FeatureC
# 10 132 3 FeatureE
# 11 135 3 FeatureC
# 12 139 3 <NA>
# 13 131 4 <NA>
# 14 132 4 FeatureF
# 15 135 4 <NA>
# 16 139 4 <NA>
Update, based on your comment:
Here's the result of strsplit
directly, as I mentioned in the comment. Note the approach to extraction.
> mydf$featuresList <- strsplit(mydf$features, ",")
> mydf
id features featuresList
1. 131 FeatureA,FeatureB,FeatureC, FeatureA, FeatureB, FeatureC
2. 132 FeatureA,FeatureD,FeatureE,FeatureF FeatureA, FeatureD, FeatureE, FeatureF
3. 135 FeatureD,FeatureE,FeatureC FeatureD, FeatureE, FeatureC
4. 139 FeatureF,FeatureB FeatureF, FeatureB
> mydf[, "featuresList"][[2]]
[1] "FeatureA" "FeatureD" "FeatureE" "FeatureF"
> mydf[, "featuresList"][[2]][2]
[1] "FeatureD"