Question

I have this dataset and I would like to recast in a way that the ID.name are the row. The Canonical_Hugo_Symbol are the column name and the Canonical_Protein_Change are the value of the cells. It will be great if there are no NA but just 0 for the other cells.

mydata.df <- data.frame(ID.name = c("1000", "1000", "1000", "1001","1001","1001","1002","1002" ), Canonical_Protein_Change = c("p.Y1467H", "p.R1466W", "p.*427Q", "p.V320fs","p.S5383fs","p.D519V","p.S51A", "p.K183_splice" ), Canonical_Hugo_Symbol = c("gene1", "gene3", "gene1", "gene1","gene3","gene4","gene1", "gene2" ))

I have melt it:

ff.melt <- melt(mydata.df, id.var = c("ID.name", "Canonical_Hugo_Symbol"))

ff.melt
 ID.name Canonical_Hugo_Symbol                 variable         value
1    1000                 gene1 Canonical_Protein_Change      p.Y1467H
2    1000                 gene3 Canonical_Protein_Change      p.R1466W
3    1000                 gene1 Canonical_Protein_Change       p.*427Q
4    1001                 gene1 Canonical_Protein_Change      p.V320fs
5    1001                 gene3 Canonical_Protein_Change     p.S5383fs
6    1001                 gene4 Canonical_Protein_Change       p.D519V
7    1002                 gene1 Canonical_Protein_Change        p.S51A
8    1002                 gene2 Canonical_Protein_Change p.K183_splice

Then I have recast it:

ff.cast <- dcast(ff.melt, ID.name ~ Canonical_Hugo_Symbol + value)

And I get this df:

ff.cast
  ID.name gene1_p.*427Q gene1_p.S51A gene1_p.V320fs gene1_p.Y1467H gene2_p.K183_splice gene3_p.R1466W gene3_p.S5383fs
 1    1000       p.*427Q         <NA>           <NA>       p.Y1467H                <NA>       p.R1466W            <NA>
 2    1001          <NA>         <NA>       p.V320fs           <NA>                <NA>           <NA>       p.S5383fs
3    1002          <NA>       p.S51A           <NA>           <NA>       p.K183_splice           <NA>             <NA>
  gene4_p.D519V
1          <NA>
2       p.D519V
3          <NA>

It is close to what I want but now for each "gene" there are many column with different name. e.g. I want that gene1_p.*427Q, gene1_p.S51A, gene1_p.V320fs, gene1_p.Y1467H all in one column.

I also used:

dcast(mydata.df, ID.name ~ Canonical_Hugo_Symbol, value_var = "Canonical_Protein_Change" )

but I get this error message:

Error in .fun(.value[0], ...) : 2 arguments passed to 'length' which requires 1 > 

Thanks

I would like to have this table or something like this! Thanks!

  ID.name   gene1    gene2      gene3      gene4
1    1000  Cp.*427Q    0      p.R1466W       0
2    1001  p.V320fs    0      p.S5383fs   p.D519V
3    1002  p.S51A   p.K183        0          0

when i tried I am getting closer but the colnames are wrong:

  reshape(mydata.df, direction = 'wide', idvar = 'ID.name', timevar = 'Canonical_Hugo_Symbol')

I have fix the colnames:

colnames(mydata.reshape) <- sub("Canonical_Protein_Change.(.*?)","\\1",  colnames(mydata.reshape))

But the NA are still there

Was it helpful?

Solution

You may try this:

# concatenate values in cells with more than one value  
dcast(mydata.df, ID.name ~ Canonical_Hugo_Symbol, value.var = "Canonical_Protein_Change",
      fun.aggregate = function(x) paste(x, collapse = "; "), fill = "0")

#   ID.name             gene1         gene2     gene3   gene4
# 1    1000 p.Y1467H; p.*427Q             0  p.R1466W       0
# 2    1001          p.V320fs             0 p.S5383fs p.D519V
# 3    1002            p.S51A p.K183_splice         0       0

# ...or pick the first value in cells with more than one value
dcast(mydata.df, ID.name ~ Canonical_Hugo_Symbol, value.var = "Canonical_Protein_Change",
      fun.aggregate = head, 1, fill = "0")
#   ID.name    gene1         gene2     gene3   gene4
# 1    1000 p.Y1467H             0  p.R1466W       0
# 2    1001 p.V320fs             0 p.S5383fs p.D519V
# 3    1002   p.S51A p.K183_splice         0       0
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top