R pseudo-merge of rows of factors that fall within a certain numeric region

https://stackoverflow.com/questions/23252424

08-07-2023
|

Question

I have a complicated merge-like problem that you can hopefully shed some light on.

I have two data frames. The first contains wavelength regions denoted by number (1,2,3, etc) with subregions found within those regions denoted by color (RED, BLUE, etc). Also their wavelength midpoint positions (mid), start positions (start) and end positions (end).

>df1
sub_region  region  mid     start   end
RED         1       15      10      20
GREEN       3       3       1       5
BLUE        2       310     300     320
(etc... ~50,000 rows total)

The second contains descriptions of those colors (VERY, SLIGHTLY, etc), a catalogue reference ID (GFHHTSTGGSH, GFDDDRDRDD, etc), a region (1,2,3, etc) matching that to df1, and their own precise wavelength start and end positions, some which are found within those of df1.

>df2
region  start   end     colorDescrip    refID
2       312     318     VERY            GFHHTSTGGSH
1       55      76      SLIGHTY         GFDDDRDRDD
(etc... ~500,000 rows total)

I want to create a data frame (df3) in which the regions of df1 and df2 (1,2,3, etc) match AND in the matching region rows, the color description's (colorDescrip) start and end wavelength from df2 fall WITHIN the start and end wavelengths of df1 (such as row 1 of df2 with row 3 of df1). The resulting df3 needs to have only three columns: "sub_region", "colorDescrip" and "refID".

Here is what an example would look like. The only example that fits both perimeters in the examples given are row 1 of df2 matching with row 3 of df1:

>df3

sub_region    colorDescrip    refID
BLUE          VERY            GFHHTSTGGSH

Again, the regions match (both are region 2) and the start/end of "VERY" (312, 318) fall within the start/end wavelengths of "BLUE" (300, 320).

I am having a very hard time writing a script in R that can accomplish this task. Any help is very much appreciated.

Thank you in advance.

Solution

I believe this can be accomplished with a combination of two rolling joins, a feature of data.table.

Define both datasets as data.tables and set the keys for matching them by region start (lower bound). This way, each color in df2 will be matched to the next start in df1 that is smaller.

df1 <- data.table(df1, key='region,start')
df2 <- data.table(df2, key='region,start')
df.start <- df1[df2, roll=T, allow.cartesian=TRUE]

We do the same thing for the end, but we reverse the direction in which the match is made (next largest upper end of spectrum)

setkey(df1, region, end)   ## reset the keys
setkey(df2, region, end)
df.end <- df1[df2, roll=-Inf, allow.cartesian=TRUE]

The solution you want is the intersection between the two datasets. This can be found by inner join (in database terms). We first need to set the keys so that they identify each combo uniquely.

setkey(df.start, sub_region, refID)
setkey(df.end, sub_region, refID)
df.start[df.end, list(colorDescrip), nomatch=0]

The last line returns the result you want, and you can save that in d3. The syntax can appear a bit cryptic if you have never seen it before, but data.table is worthwhile looking into.

Edit: Noticed part about region matching and updated code to reflect that.

OTHER TIPS

Here's an attempt:

subset(merge(df1,df2,by="region"),
    start.y>start.x & end.y<end.x,
    select=c("sub_region","colorDescrip","refID"))

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow