Pregunta

When I am converting a vcf file to ped format (with vcftools or with vcf to ped converter of 1000G), I run into the problem that the IDs of the variants that don't have a dbSNP ID get the base pair position of that variant as an ID. Example of couple of variants:

1   rs35819278  0   23333187
1   23348003    0   23348003
1   23381893    0   23381893
1   rs18325622  0   23402111
1   rs23333532  0   23408301
1   rs55531117  0   23810772
1   23910834    0   23910834

However, I would like the variants without dbSNP ID to get the the format "chr:basepairposition". So the example of above would look like:

1   rs35819278  0   23333187
1   chr1:23348003   0   23348003
1   chr1:23381893   0   23381893
1   rs18325622  0   23402111
1   rs23333532  0   23408301
1   rs55531117  0   23810772
1   chr1:23910834   0   23910834

Would be great if anyone could help me to explain what command or which script I have to use to change this 2nd column for the variants without a dbSNP ID.

Thanks!

¿Fue útil?

Solución

This can be done with sed. Since tabs are involved, the exact syntax may vary a bit depending on what sed is installed on your system; the following should work for Linux:

cat [.map filename] | sed 's/^\([0-9]*\)\t\([0-9]\)/\1\tchr\1:\2/g' > [new filename]

This looks for lines starting with [number][tab][digit], and makes them start with [number][tab]chr[number]:[digit] instead, while leaving other lines unchanged.

OS X is a bit more painful (you'll need to use ctrl-V or [[:blank:]] to deal with the tab).

Otros consejos

This can be done with plink2. You just need to use the --set-missing-var-ids option (https://www.cog-genomics.org/plink2/data#set_missing_var_ids) accordingly:

plink --vcf [filename] \
    --keep-allele-order \
    --vcf-idspace-to _ \
    --double-id \
    --allow-extra-chr 0 \
    --split-x b37 no-fail \
    --set-missing-var-ids chr@:# \
    --make-bed \
    --out [prefix]

However, notice that you could have multiple variants being assigned the same IDs using this method and plink2 will not tolerate variants with the same ID. To learn more about converting VCF files to plink, the following resource has further insights: http://apol1.blogspot.com/2014/11/best-practice-for-converting-vcf-files.html

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top