Pregunta

Goal: To map mutation location from file1 to a region or feature from file two. For this you need to make sure that chromosome (chr1) and strands (+/-) are the same before comparing chromosome location from file 1 to regions of file2.

Question: How to use mapreduce or Disco to map one location to a region.. . Aka formulate the location -> chromosomal region in a mapreduce method?

Description: I have two medium sized files (10gb) and two file types that I wanted to process. I already have these files parsed in basic python but I will likely have to parse many larger similar files in the future so I wanted to try it with mapreduce (hadoop/Pig to be more specific)or Disco to learn .

While I can run the nodes on an EC2 cluster ideally a one cluster hadoop (yes I know it defeats the purpose) or on something like Disco or Sparc.

I like the idea of using Pig because that would reduce the process to just processing the file from .csv files but I have no idea for how to use mapreduce for mapping something to a region instead of just a key/value pair

Here is a visual representation of what I was thinking of: was thinking of.

File info:

  1. First file is TCGA cancer SNP mutations. Some important features include

    • Chromosome location
    • Chromosome number
    • strand
    • sample id
    • the rest is not so important
  2. 3' UTR sequence.

    • Chromosome start location: int
    • Chromosome end location: int
    • Chromosome number: chrX
    • strand +/-
    • gene id
    • the rest is not so important

sample files are here:two sample files

Finally python is my language of choice for this if it matters..

No hay solución correcta

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top