mapping mutation to a chromosome location with mapreduce/PIG or Disco

https://stackoverflow.com/questions/16903711

30-05-2022
|

Frage

Goal: To map mutation location from file1 to a region or feature from file two. For this you need to make sure that chromosome (chr1) and strands (+/-) are the same before comparing chromosome location from file 1 to regions of file2.

Question: How to use mapreduce or Disco to map one location to a region.. . Aka formulate the location -> chromosomal region in a mapreduce method?

Description: I have two medium sized files (10gb) and two file types that I wanted to process. I already have these files parsed in basic python but I will likely have to parse many larger similar files in the future so I wanted to try it with mapreduce (hadoop/Pig to be more specific)or Disco to learn .

While I can run the nodes on an EC2 cluster ideally a one cluster hadoop (yes I know it defeats the purpose) or on something like Disco or Sparc.

I like the idea of using Pig because that would reduce the process to just processing the file from .csv files but I have no idea for how to use mapreduce for mapping something to a region instead of just a key/value pair

Here is a visual representation of what I was thinking of:

File info:

First file is TCGA cancer SNP mutations. Some important features include
- Chromosome location
- Chromosome number
- strand
- sample id
- the rest is not so important
3' UTR sequence.
- Chromosome start location: int
- Chromosome end location: int
- Chromosome number: chrX
- strand +/-
- gene id
- the rest is not so important

sample files are here:two sample files

Finally python is my language of choice for this if it matters..

Keine korrekte Lösung

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow