Find transcription start sites with biomaRt

https://stackoverflow.com/questions/13012210

13-07-2021
|

Pregunta

I am using biomaRt in R to query ensembl's hsapiens database of human genes. I am using the function getBM to get all genes' name, start position and stop position, but I cannot find the right attribute for retrieving the TSS (transcription start site). Is it maybe because it is considered the same as the seqType= c("3utr", "5utr")?

Solución

A complete list of queriable attributes can be retrieved in a data frame using listAttributes. Then it's just a matter of searching it for the attributes you want.

mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
att <- listAttributes(mart)
grep("transcript", att$name, value=TRUE)

will get you a rather long list, begining like this

 [1] "ensembl_transcript_id"                                        
 [2] "transcript_start"                                             
 [3] "transcript_end"                                               
 [4] "external_transcript_id"                                       
 [5] "transcript_db_name"                                           
 [6] "transcript_count"                                             
 [7] "transcript_biotype"                                           
 [8] "transcript_status"                                            
 [9] "clone_based_ensembl_transcript_name"                          
[10] "clone_based_vega_transcript_name"

Then you can go ahead and query using these names

getBM(attributes=c("transcript_start", "transcript_end"),
      filters="hgnc_symbol", values="foxp2", mart=mart)

and you get

   transcript_start transcript_end
1         113726382      114330960
2         113726494      114271639
3         113726615      114330155
4         113728221      114066565
5         113728221      114271650
6         114054329      114330218
7         114055052      114139783
8         114055052      114333827
9         114055110      114330155
10        114055113      114330200
11        114055275      114269037
12        114055374      114285885
13        114055378      114330012
14        114066555      114294198
15        114066557      114271754
16        114066557      114282629
17        114066570      114294198
18        114055052      114333823
19        114268613      114329981
20        113726615      114310038

If you want all the transcripts of all genes remove the filter and values arguments, but be aware that you will get a lot of data coming your way.

Otros consejos

There is now a specific attribute for the transcription start site that can be downloaded: transcription_start_site.

library("biomaRt")
ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
attributes = listAttributes(ensembl, page = "structure")
attributes[grep("transcript", attributes$description, ignore.case = TRUE), ]
#                         name                                description
# 178    ensembl_transcript_id                      Ensembl Transcript ID
# 183         transcript_start                      Transcript Start (bp)
# 184           transcript_end                        Transcript End (bp)
# 185 transcription_start_site             Transcription Start Site (TSS)
# 186        transcript_length Transcript length (including UTRs and CDS)
# 195         transcript_count                           Transcript count
# 201                     rank                    Exon Rank in Transcript

As an example, here is the result for the gene BTC. Note that because it is on the reverse strand (strand == -1), the value for transcription_start_site is the same as the value for transcript_end. Basically, downloading transcription_start_site is a shortcut so that you don't have to determine which end of the transcript is the TSS based on which strand the gene is on.

tss <- getBM(attributes = c("transcription_start_site", "chromosome_name",
                            "transcript_start", "transcript_end",
                            "strand",  "ensembl_gene_id",
                            "ensembl_transcript_id", "external_gene_name"),
             filters = "external_gene_name", values = "BTC",
             mart = ensembl)
tss
# transcription_start_site chromosome_name transcript_start transcript_end strand
# 1                 75635873     HG706_PATCH         75612096       75635873     -1
# 2                 75660403     HG706_PATCH         75610476       75660403     -1
# 3                 75719896               4         75669969       75719896     -1
# 4                 75695366               4         75671589       75695366     -1
# ensembl_gene_id ensembl_transcript_id external_gene_name
# 1 ENSG00000261530       ENST00000567516                BTC
# 2 ENSG00000261530       ENST00000566356                BTC
# 3 ENSG00000174808       ENST00000395743                BTC
# 4 ENSG00000174808       ENST00000512743                BTC

I believe the "transcript_start" and "transcript_end" are the translation start and stop site, but not necessarily the TSS (transcription start site).

Looking at the "start_position" and "end_position" attributes, these seem to be the TSS (start_position for + strand and end_position for - strand), because they are always the smallest number of the transcript_start options for different transcript for the + strand and the largest number of the transcript_end options for the - strand.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow