mapping mutation to a chromosome location with mapreduce/PIG or Disco
-
30-05-2022 - |
Frage
Goal: To map mutation location from file1 to a region or feature from file two. For this you need to make sure that chromosome (chr1) and strands (+/-) are the same before comparing chromosome location from file 1 to regions of file2.
Question: How to use mapreduce or Disco to map one location to a region.. . Aka formulate the location -> chromosomal region in a mapreduce method?
Description: I have two medium sized files (10gb) and two file types that I wanted to process. I already have these files parsed in basic python but I will likely have to parse many larger similar files in the future so I wanted to try it with mapreduce (hadoop/Pig to be more specific)or Disco to learn .
While I can run the nodes on an EC2 cluster ideally a one cluster hadoop (yes I know it defeats the purpose) or on something like Disco or Sparc.
I like the idea of using Pig because that would reduce the process to just processing the file from .csv files but I have no idea for how to use mapreduce for mapping something to a region instead of just a key/value pair
Here is a visual representation of what I was thinking of:
File info:
First file is TCGA cancer SNP mutations. Some important features include
- Chromosome location
- Chromosome number
- strand
- sample id
- the rest is not so important
3' UTR sequence.
- Chromosome start location: int
- Chromosome end location: int
- Chromosome number: chrX
- strand +/-
- gene id
- the rest is not so important
sample files are here:two sample files
Finally python is my language of choice for this if it matters..
Keine korrekte Lösung