Question

I have a FASTQ quality score which is presented as a series of ASCII characters. In this case (likely) ASCII character 64 to 126 represent the a score of 0 to 62 (presuming it is Illumina). This gives rise to underlying sequence :

feffefdfbefdfffcfdeTddaYddffbfcI``S_KKX_]]MR[D_TY[VTVXQ]`Q_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

How do I extract which is the number of the ASCII characters?

Thank you San

EDIT: This sequence denotes the quality of a biological sequence that is made up of bases (from base pairs in nucleic acids, meaning a character (ATGC)). A base quality is the phred-scaled base error probability which equals -10 log10 Pr{base is wrong}.

Was it helpful?

Solution

Well, as Marek said : you might find a function to convert Illumina quality scores in Bioconductor. You can ask at biostar.stackexchange.com.

Using base functions, you can use charToRaw():

> x <- "feeffdbefc`\\KKX]_BBBB"
> charToRaw(x)
 [1] 66 65 65 66 66 64 62 65 66 63 60 5c 4b 4b 58 5d 5f 42 42 42 42
> as.numeric(charToRaw(x))
 [1] 102 101 101 102 102 100  98 101 102  99  96  92  75  75  88  93  95  66  66  66  66
> as.character(charToRaw(x))
 [1] "66" "65" "65" "66" "66" "64" "62" "65" "66" "63" "60" "5c" "4b" "4b" "58" "5d" "5f" "42" "42" "42" "42"

Mind you, you'll have to escape the backslash, or you'll get into trouble. That depends on how you read in your data and so forth.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top