Frage

The 1000 genome project provides us information about "variation" of thousands people's DNA sequence against the human reference DNA sequence. The variation is stored in VCF file
format. Basically, for each person in that project, we can get his/her DNA variation information from the VCF file, for example, the type of variation (e.g Insertion/deletion and SNP ) and the position of the variation relative to the reference. The reference is in FASTA format. By combining variation information of one person from the VCF file and the human reference in FASTA file, I want to construct the DNA sequence for that person.

My question is: Does it already exist some tools can perform the task pretty well,or I have to write the scripts by myself.

War es hilfreich?

Lösung

The perl script vcf-consensus from VCFtools seems close to what you are looking for:

vcf-consensus  
Apply VCF variants to a fasta file to create consensus sequence.

Usage: cat ref.fa | vcf-consensus [OPTIONS] in.vcf.gz > out.fa
Options:
   -h, -?, --help         This help message.
   -H, --haplotype <int>  Apply only variants for the given haplotype (1,2)
   -s, --sample <name>    If not given, all variants are applied
Examples:
   samtools faidx ref.fa 8:11870-11890 | vcf-consensus in.vcf.gz > out.fa


The answers to the question New fasta sequence from reference fasta and variant calls file? posted on Biostar might also help.

Andere Tipps

You can use bcftools (https://github.com/samtools/bcftools) to perform this task:

bcftools consensus <file.vcf> \
  --fasta-ref <file> \
  --iupac-codes \
  --output <file> \
  --sample <name>

To install bcftools:

git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
sudo cp bcftools/bcftools /usr/local/bin/

You can also combine bcftools consensus with samtools faidx (http://www.htslib.org/) to extract specific intervals from the fasta file. See bcftools consensus for more information:

About:   Create consensus sequence by applying VCF variants to a reference
         fasta file.
Usage:   bcftools consensus [OPTIONS] <file.vcf>
Options:
    -f, --fasta-ref <file>     reference sequence in fasta format
    -H, --haplotype <1|2>      apply variants for the given haplotype
    -i, --iupac-codes          output variants in the form of IUPAC ambiguity codes
    -m, --mask <file>          replace regions with N
    -o, --output <file>        write output to a file [standard output]
    -c, --chain <file>         write a chain file for liftover
    -s, --sample <name>        apply variants of the given sample
Examples:
   # Get the consensus for one region. The fasta header lines are then expected
   # in the form ">chr:from-to".
   samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz > out.fa

Anyone still coming to this page, if you have a fasta reference genome and a bam file that you want to turn into the reference file by changing SNP's and N's, you may try this one-liner using samtools, bcftools and vcfutils.pl (ps for beginners: both samtools and bcftools can be compiled in a computing cluster or in Linux, if so just add the locations of each before the software name; vcfutils is already a perl script from bcftools)

samtools mpileup -d8000 -q 20 -Q 10 -uf  REFERENCE.fasta Your_File.bam | bcftools call -c | vcfutils.pl vcf2fq  > OUTPUT.fastq

d, --max-depth == -q, -min-MQ Minimum mapping quality for an alignment to be used == -Q, --min-BQ Minimum base quality for a base to be considered == (You can use different values of course, see http://www.htslib.org/doc/samtools.html)

Which generates a weird format that looks like fastq but isn't, so you can't convert it using a converter, but you can use the following sed command, which I wrote specific for this output:

sed -i -e '/^+/,/^\@/{/^+/!{/^\@/!d}}; /^+/ d; s/@/>/g' OUTPUT.fastq

In the end, make sure to compare your new fasta files to your reference to be sure that everything is fine.

EDIT BE CAREFUL WITH THE SED COMMAND IT MAY DELETE SOME OF YOUR READS IN DIFFERENT CASES OF QUALITY SCORING THAN I HAD

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top