Pregunta

NOTE: I'm not asking a Bioconductor-specific question, but I need Bioconductor in the example code. Bear with me please.

Hi,

I have a number of tab delimited files containing various types of information about specific genes. One or more of the columns can be Aliases to Gene Symbols that I need to upgrade to the latest Gene Symbol annotation.

I'm using Bioconductor's org.Hs.eg.db library to do so (the org.Hs.egALIAS2EG and org.Hs.egSYMBOL objects in particular).

The code reported does the job but is very slow, I guess because of the nested for loops that query the org.Hs.eg.db database at each iteration. Is there a quicker/simpler/smarter way to achieve the same result?

library(org.Hs.eg.db)

myTable <- read.table("tab_delimited_file.txt", header=TRUE, sep="\t", as.is=TRUE)

for (i in 1:nrow(myTable)) {
    for (j in 1:ncol(myTable)) {
        repl <- org.Hs.egALIAS2EG[[myTable[i,j]]][1]
        if (!is.null(repl)) {
            repl <- org.Hs.egSYMBOL[[repl]][1]
            if (!is.null(repl)) {
                myTable[i,j] <- repl
            }
        }
    }
}

write.table(myTable, file="new_tab_delimited_file", quote=FALSE, sep="\t", row.names=FALSE, col.names=TRUE)

I'm thinking to use one of the apply function, but bear in mind that org.Hs.egALIAS2EG and org.Hs.egSYMBOL are objects, and not functions.

Thank you!

¿Fue útil?

Solución 4

This is the best I could come up with.

First write a function:

alias2GS <- function(x) {
    for (i in 1:length(x)) {
        if (!is.na(x[i])) {
            repl <- org.Hs.egALIAS2EG[[x[i]]][1]
            if (!is.null(repl)) {
                repl <- org.Hs.egSYMBOL[[repl]][1]
                if (!is.null(repl)) {
                    x[i] <- repl
                }
            }
        }
    }
    return(x)
}

And then call the function for each column of the data frame where the conversion is needed, like so:

df$GeneSymbols <- alias2GS(df$GeneSymbols)

Otros consejos

Use the mget function.

eg[i,] <- mget( myTable[i,],  org.Hs.egALIAS2EG )
symbol[i, ] <- mget( myTable[i,], org.Hs.egSYMBOL )

etc. This is the way it is supposed to be used and much faster than eny other alternative. However, maybe it would be worthwile to reshape myTable into a vector of gene names first:

v <- unique( as.vector( as.matrix( myTable ) ) )
v <- v[ v %in% ls( org.Hs.egALIAS2EG ) ]
eg <- unlist( mget( v, org.Hs.egALIAS2EG ) )
symbol <- unlist( mget( eg, org.Hs.egSYMBOL ) )

etc. The second line above makes sure that you are only looking up the symbols that actually are in the database. Now you can use the symbol table to modify the table again. Here is a way one can do it, assuming not all elements of myTable match. I'm copying the table to t for brevity:

t <- as.matrix( myTable )
names( symbol ) <- v
t[ !is.na( match( t, v ) ) ] <- symbol[ match( t, v ) ][ ! is.na( match( t, v ) ) ]

OK. That was assuming that we are working with a matrix (more or less) of characters. However, frankly, you only have a data frame with two columns, so no need really to automatize the code as if you had hundreds of columns. Let us write a little function. (it would be simpler if we could assume that all elements in your table can be found in org.Hs.egALIAS2EG)

convert2symbol <- function( x ) {
  v <- unique( as.character( x ) )
  v <- v[ v %in% ls( org.Hs.egALIAS2EG ) ]
  eg <- unlist( mget( v, org.Hs.egALIAS2EG ) )
  symbol <- unlist( mget( eg, org.Hs.egSYMBOL ) )
  m <- match( x, v )
  v[ ! is.na( m ) ] <- symbol[ m ][ ! is.na( m ) ]
  v
}

Now you can

myTable$LigandGene <- convert2symbol( myTable$LigandGene )

or

newTable <- apply( myTable, 2, convert2symbol )

As to why as.vector( data.frame ) doesn't work: data.frame is not a matrix. It is a list that is displayed in a fancy way and has the [] function defined on it.

You can use sapply and name several variables that are not the vector, such as your objects from the org.Hs.eg.db library:

library(org.Hs.eg.db)
myTable <- read.table("tab_delimited_file.txt", header=TRUE, sep="\t", as.is=TRUE)

myfunc <- function(idx,mytab,a2e,es){
            i = idx %/% nrow(mytab) + 1
            j = idx %% ncol(mytab) + 1
            repl <- a2e[[myTable[i,j]]][1];
            if (!is.null(repl)) {
              repl <- es[[repl]][1]
              if (!is.null(repl)) {
                return(repl)
              }
            }
            else {return("NA")}
          }

vec <- 0:(ncol(myTable)*nrow(myTable)-1)
out <- sapply(vec,mytab=myTable,a2e=org.Hs.egALIAS2EG,es=org.Hs.egSYMBOL,myfunc)
myTable <- matrix(out, nrow=nrow(myTable),ncol=ncol(myTable),byrow=T)

Just a quick warning: An alias can map to multiple Entrez Gene IDs.

So, your current solution assumes the first listed ID is correct (which may not be true).

# e.g. The alias "A1B" is assumed to map to "1" and not "6641"
mget("A1B", org.Hs.egALIAS2EG)
# $A1B
# [1] "1"    "6641"

If you check out the help for ?org.Hs.egALIAS2EG, you will see that it is never recommended to use Aliases or Symbols as primary gene identifiers.

## From the 'Details' section of the help:
# Since gene symbols are sometimes redundantly assigned in the literature, 
# users are cautioned that this map may produce multiple matching results 
# for a single gene symbol. Users should map back from the entrez gene IDs 
# produced to determine which result is the one they want when this happens.

# Because of this problem with redundant assigment of gene symbols, 
# is it never advisable to use gene symbols as primary identifiers.

Without manual curation, it is impossible to know which ID is "correct". Thus, your safest bet is to get all of the possible IDs and symbols for each alias in your table, while maintaining information about which ones were receptors and which were ligands:

# your example subset with "A1B" and "trash" added for complexity
myTable <- data.frame(
    ReceptorGene = c("A1B", "ACVR2B", "ACVR2B", "ACVR2B", "ACVR2B", "AMHR2", "BLR1", "BMPR1A", "BMPR1A", "BMPR1A", "BMPR1A", "BMPR1A"),
    LigandGene = c("trash", "INHA", "INHBA", "INHBB", "INHBC", "AMH", "SCYB13", "BMP10", "BMP15", "BMP2", "BMP3", "BMP4"), 
    stringsAsFactors = FALSE
)

# unlist and rename
my.aliases <- unlist(myTable)
names(my.aliases) <- paste(names(my.aliases), my.aliases, sep = ".")

# determine which aliases have a corresponding Entrez Gene ID
has.key <- my.aliases %in% keys(org.Hs.egALIAS2EG)

# replace Aliases with character vectors of all possible entrez gene IDs 
my.aliases[has.key] <- sapply(my.aliases[has.key], function(x) {
    eg.ids <- unlist(mget(x, org.Hs.egALIAS2EG))
    symbols <- unlist(mget(eg.ids, org.Hs.egSYMBOL))
})

# my.aliases retains all pertinent information regarding the original alias
my.aliases[1:3]
# $ReceptorGene1.A1B
#       1    6641 
#  "A1BG" "SNTB1" 
# 
# $ReceptorGene2.ACVR2B
#       93 
# "ACVR2B" 
# 
# $ReceptorGene3.ACVR2B
#       93 
# "ACVR2B"

Once you know which Entrez Gene IDs are appropriate, you can store them as additional columns on your table.

myTable$receptor.id <- c("1", "93", "93", "93", "93", "269", "643", "657", "657", "657", "657", "657") 
myTable$ligand.id   <- c(NA, "3623", "3624", "3625", "3626", "268", "10563", "27302", "9210", "650", "651", "652")

Then, when you need to update to the latest symbol, you can just use the Entrez Gene IDs (which should never need to be updated).

has.key <- myTable$receptor.id %in% keys(org.Hs.egSYMBOL)
myTable$ReceptorGene[has.key] <- unlist(mget(myTable$receptor.id[has.key], org.Hs.egSYMBOL))

has.key <- myTable$ligand.id %in% keys(org.Hs.egSYMBOL)
myTable$LigandGene[has.key] <- unlist(mget(myTable$ligand.id[has.key], org.Hs.egSYMBOL))

head(myTable)
#   ReceptorGene LigandGene receptor.id ligand.id
# 1         A1BG      trash           1      <NA>
# 2       ACVR2B       INHA          93      3623
# 3       ACVR2B      INHBA          93      3624
# 4       ACVR2B      INHBB          93      3625
# 5       ACVR2B      INHBC          93      3626
# 6        AMHR2        AMH         269       268
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top