Question

I have an .csv output for two samples, with a few 'calculator' statistics calculated for each sample. Some 'calculators' have associated lower and higher confidence interval values. Eventually, I want to graph boxplots for all calculators with error bars for confidence intervals for those calculators that have them. But first, I need to manipulate the data into R-friendly format.

How do I take this input:

df <- data.frame(sample = as.factor(c("0.22um", "3um")),
                 nseqs = c(29445, 30212), coverage = c(0.96, 0.99),
                 invsimpson = c(20.36, 8.76), invsimpson_lci = c(19.99, 8.59), 
                 invsimpson_hci =c(20.76, 8.95),
                 shannon = c(3.75, 3.04), shannon_lci = c(3.73, 3.02), 
                 shannon_hci = c(3.77, 3.06))

Which looks like this:

  sample nseqs coverage invsimpson invsimpson_lci invsimpson_hci shannon shannon_lci shannon_hci
1 0.22um 29445     0.96      20.36          19.99          20.76    3.75        3.73        3.77
2    3um 30212     0.99       8.76           8.59           8.95    3.04        3.02        3.06

And convert it to this:

  sample calculator value  lci  hci
1 0.22um      nseqs   num <NA> <NA>
2 0.22um   coverage   num <NA> <NA>
3 0.22um invsimpson   num  num  num
4 0.22um    shannon   num  num  num
5    3um      nseqs   num <NA> <NA>
6    3um   coverage   num <NA> <NA>
7    3um invsimpson   num  num  num
8    3um    shannon   num  num  num

, where num are corresponding values from df. This data frame will have NA where the original df did not have confidence values for corresponding intervals

temp <- melt(df, id.vars= c("sample"), measure.vars=c("nseqs", "coverage", "invsimpson", "shannon"), variable.name="calculator")
partial.solution <- temp[with(base, order(group)), ]

will get values for all calculators but getting lci and hci to fall in line is a bit tricky.

A generic solution would be awesome. I expect matrices with hundreds of samples and variable number of calculators.

Thanks for all your help!

Était-ce utile?

La solution 2

You may try this:

library(reshape2)
temp <- melt(df)

df2 <- cbind(temp, colsplit(string = temp$variable, pattern = "_",
                            names = c("calc", "stat")))

df3 <- dcast(df2, sample + calc ~ stat, value.var = "value")
df3

#   sample       calc    Var.3   hci   lci
# 1 0.22um   coverage     0.96    NA    NA
# 2 0.22um invsimpson    20.36 20.76 19.99
# 3 0.22um      nseqs 29445.00    NA    NA
# 4 0.22um    shannon     3.75  3.77  3.73
# 5    3um   coverage     0.99    NA    NA
# 6    3um invsimpson     8.76  8.95  8.59
# 7    3um      nseqs 30212.00    NA    NA
# 8    3um    shannon     3.04  3.06  3.02

Possibly rename and reorder variables:

names(df3) <- c("sample", "calculator", "value", "hci",  "lci")
df3[ , c("sample", "calculator", "value", "lci",  "hci")]

#   sample calculator    value   lci   hci
# 1 0.22um   coverage     0.96    NA    NA
# 2 0.22um invsimpson    20.36 19.99 20.76
# 3 0.22um      nseqs 29445.00    NA    NA
# 4 0.22um    shannon     3.75  3.73  3.77
# 5    3um   coverage     0.99    NA    NA
# 6    3um invsimpson     8.76  8.59  8.95
# 7    3um      nseqs 30212.00    NA    NA
# 8    3um    shannon     3.04  3.02  3.06

Autres conseils

I would do it in 2 steps:

## put in the long format simple column using melt
## no need to work in all variables 
xx = melt(df[,c(1,2,3,4,7)])

## use reshape to put in the long format column with lci and hci
yy = reshape(df[,c(1,5,8,6,9)],direction='long',
        varying=list(c(2,3),c(3,4)),
        times=c('invsimpson','shannon'),
        sep="_", v.names=c("lci", "hci"))[,c('sample','time','lci','hci')]

Then merge the 2 results

 merge(xx,yy,by=1:2,all.x=T)

 sample   variable    value   lci   hci
1 0.22um      nseqs 29445.00    NA    NA
2 0.22um   coverage     0.96    NA    NA
3 0.22um invsimpson    20.36 19.99  3.73
4 0.22um    shannon     3.75  3.73 20.76
5    3um      nseqs 30212.00    NA    NA
6    3um   coverage     0.99    NA    NA
7    3um invsimpson     8.76  8.59  3.02
8    3um    shannon     3.04  3.02  8.95
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top