Question

I'm quite new to the R language and not quite sure how to do this one. If I have a tsv (tab separated file) and read into a table via something like:

> table <- read.delim(file='test.tsv',sep='\t',header=TRUE,stringsAsFactors=FALSE)

    id              features
1. 131  FeatureA,FeatureB,FeatureC,
2. 132  FeatureA,FeatureD,FeatureE,FeatureF
3. 135  FeatureD,FeatureE,FeatureC
4. 139  FeatureF,FeatureB

I'd like to be able to visualize the clusterings of features, but to leverage this in R I'd need to change the type of the column named feature into a list.

What is the best way to do this?

Était-ce utile?

La solution

My "splitstackshape" package was written to deal with these types of tasks. You can explore the concat.split family of functions.

Here are a few examples:

As a list. (But the function sorts the output--you'd do better off with strsplit until I add an option to not sort the output).

library(splitstackshape)
x1 <- concat.split.list(mydf, split.col="features", sep=",", drop = TRUE)
x1
#     id                          features_list
# 1. 131           FeatureA, FeatureB, FeatureC
# 2. 132 FeatureA, FeatureD, FeatureE, FeatureF
# 3. 135           FeatureD, FeatureE, FeatureC
# 4. 139                     FeatureF, FeatureB
str(x1)
# 'data.frame':  4 obs. of  2 variables:
#  $ id           : int  131 132 135 139
#  $ features_list:List of 4
#   ..$ : chr  "FeatureA" "FeatureB" "FeatureC"
#   ..$ : chr  "FeatureA" "FeatureD" "FeatureE" "FeatureF"
#   ..$ : chr  "FeatureD" "FeatureE" "FeatureC"
#   ..$ : chr  "FeatureF" "FeatureB"

As a "wide" data.frame:

x2 <- concat.split.multiple(mydf, split.col="features", sep=",")
x2
#     id features_1 features_2 features_3 features_4
# 1. 131   FeatureA   FeatureB   FeatureC       <NA>
# 2. 132   FeatureA   FeatureD   FeatureE   FeatureF
# 3. 135   FeatureD   FeatureE   FeatureC       <NA>
# 4. 139   FeatureF   FeatureB       <NA>       <NA>

As a "long" data.frame:

x3 <- concat.split.multiple(mydf, split.cols="features", seps=",", direction="long")
x3
#     id time features
# 1  131    1 FeatureA
# 2  132    1 FeatureA
# 3  135    1 FeatureD
# 4  139    1 FeatureF
# 5  131    2 FeatureB
# 6  132    2 FeatureD
# 7  135    2 FeatureE
# 8  139    2 FeatureB
# 9  131    3 FeatureC
# 10 132    3 FeatureE
# 11 135    3 FeatureC
# 12 139    3     <NA>
# 13 131    4     <NA>
# 14 132    4 FeatureF
# 15 135    4     <NA>
# 16 139    4     <NA>

Update, based on your comment:

Here's the result of strsplit directly, as I mentioned in the comment. Note the approach to extraction.

> mydf$featuresList <- strsplit(mydf$features, ",")
> mydf
    id                            features                           featuresList
1. 131         FeatureA,FeatureB,FeatureC,           FeatureA, FeatureB, FeatureC
2. 132 FeatureA,FeatureD,FeatureE,FeatureF FeatureA, FeatureD, FeatureE, FeatureF
3. 135          FeatureD,FeatureE,FeatureC           FeatureD, FeatureE, FeatureC
4. 139                   FeatureF,FeatureB                     FeatureF, FeatureB
> mydf[, "featuresList"][[2]]
[1] "FeatureA" "FeatureD" "FeatureE" "FeatureF"
> mydf[, "featuresList"][[2]][2]
[1] "FeatureD"

Autres conseils

you could use strsplit:

table$list.features = strsplit(table$features,",")

you might also want to create indicator variables for these features:

table[unique(unlist(table$list.features))]=0
for (i in 1:nrow(table)) table[i,table$list.features[[i]]]=1
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top