Question

I'm quite new to the R language and not quite sure how to do this one. If I have a tsv (tab separated file) and read into a table via something like:

> table <- read.delim(file='test.tsv',sep='\t',header=TRUE,stringsAsFactors=FALSE)

    id              features
1. 131  FeatureA,FeatureB,FeatureC,
2. 132  FeatureA,FeatureD,FeatureE,FeatureF
3. 135  FeatureD,FeatureE,FeatureC
4. 139  FeatureF,FeatureB

I'd like to be able to visualize the clusterings of features, but to leverage this in R I'd need to change the type of the column named feature into a list.

What is the best way to do this?

Was it helpful?

Solution

My "splitstackshape" package was written to deal with these types of tasks. You can explore the concat.split family of functions.

Here are a few examples:

As a list. (But the function sorts the output--you'd do better off with strsplit until I add an option to not sort the output).

library(splitstackshape)
x1 <- concat.split.list(mydf, split.col="features", sep=",", drop = TRUE)
x1
#     id                          features_list
# 1. 131           FeatureA, FeatureB, FeatureC
# 2. 132 FeatureA, FeatureD, FeatureE, FeatureF
# 3. 135           FeatureD, FeatureE, FeatureC
# 4. 139                     FeatureF, FeatureB
str(x1)
# 'data.frame':  4 obs. of  2 variables:
#  $ id           : int  131 132 135 139
#  $ features_list:List of 4
#   ..$ : chr  "FeatureA" "FeatureB" "FeatureC"
#   ..$ : chr  "FeatureA" "FeatureD" "FeatureE" "FeatureF"
#   ..$ : chr  "FeatureD" "FeatureE" "FeatureC"
#   ..$ : chr  "FeatureF" "FeatureB"

As a "wide" data.frame:

x2 <- concat.split.multiple(mydf, split.col="features", sep=",")
x2
#     id features_1 features_2 features_3 features_4
# 1. 131   FeatureA   FeatureB   FeatureC       <NA>
# 2. 132   FeatureA   FeatureD   FeatureE   FeatureF
# 3. 135   FeatureD   FeatureE   FeatureC       <NA>
# 4. 139   FeatureF   FeatureB       <NA>       <NA>

As a "long" data.frame:

x3 <- concat.split.multiple(mydf, split.cols="features", seps=",", direction="long")
x3
#     id time features
# 1  131    1 FeatureA
# 2  132    1 FeatureA
# 3  135    1 FeatureD
# 4  139    1 FeatureF
# 5  131    2 FeatureB
# 6  132    2 FeatureD
# 7  135    2 FeatureE
# 8  139    2 FeatureB
# 9  131    3 FeatureC
# 10 132    3 FeatureE
# 11 135    3 FeatureC
# 12 139    3     <NA>
# 13 131    4     <NA>
# 14 132    4 FeatureF
# 15 135    4     <NA>
# 16 139    4     <NA>

Update, based on your comment:

Here's the result of strsplit directly, as I mentioned in the comment. Note the approach to extraction.

> mydf$featuresList <- strsplit(mydf$features, ",")
> mydf
    id                            features                           featuresList
1. 131         FeatureA,FeatureB,FeatureC,           FeatureA, FeatureB, FeatureC
2. 132 FeatureA,FeatureD,FeatureE,FeatureF FeatureA, FeatureD, FeatureE, FeatureF
3. 135          FeatureD,FeatureE,FeatureC           FeatureD, FeatureE, FeatureC
4. 139                   FeatureF,FeatureB                     FeatureF, FeatureB
> mydf[, "featuresList"][[2]]
[1] "FeatureA" "FeatureD" "FeatureE" "FeatureF"
> mydf[, "featuresList"][[2]][2]
[1] "FeatureD"

OTHER TIPS

you could use strsplit:

table$list.features = strsplit(table$features,",")

you might also want to create indicator variables for these features:

table[unique(unlist(table$list.features))]=0
for (i in 1:nrow(table)) table[i,table$list.features[[i]]]=1
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top