Question

I would like to filter a large tab separated file which contains coordinates based on another file, and am hoping to do it in a functional style. I would like to know an efficient/functional way to do the following task in F#.

Take one file with ranges (essentially a list of [group, start, end], and another file with specific coordinates (essentially [group, position]). Filter the second file by outputting only sites where the [group, position] is inside the intervals of the firsts file (e.g. if the first file contained [A,1,20] then [A,5] would be output, while [A,25] would not).

The file to be filtered is to large to hold in memory, does anyone know a snazzy functional way to do this?

Was it helpful?

Solution

This really depends on the scale of the problem, i.e. how many things do you expect to have in both of the groups. However, as a starting point, you could use Seq.groupBy to get a list of ranges for each group and then use Seq.exist to check whether a position is in any of the ranges.

For example, given the following inputs:

let ranges = [ (1, 0, 5); (1, 10, 15 ) ]
let coords = [ (1, 3); (1, 7); (1, 12) ]

You can create a dictionary that gives you a list of ranges for each group:

let byGroup =
  ranges 
  |> Seq.groupBy (fun (g, _, _) -> g) 
  |> Seq.map (fun (k, v) -> k, Seq.map (fun (_, s, e) -> (s, e)) v)
  |> dict

And then you can filter coordinates, and for each position, scan the appropriate ranges to see if the position belongs to any of them:

coords
|> Seq.filter (fun (g, pos) ->
  byGroup.[g] |> Seq.exists (fun (l, h) -> pos >= l && pos <= h))

This is probably not going to work if the number of ranges is very high, but it should be good enough for smaller examples. For bigger data, you'll probably want to create some sort of binary tree-like structure (where you can search for ranges), but this is more complex question, which is more about algorithm than about F#.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top