identify a column by its name for a specified value for each row, r

https://stackoverflow.com/questions/22136027

19-10-2022
|

Question

G'day everyone,

I have a data.frame with many columns of data, however, for each row I am only interested in a subset of these columns. I would like to use another specific value to identify the column I am interested in. I will then take the mean of the column of interest and the 5 previous columns.

My data.frame includes point location, month of collection and values extracted from a set of monthly rasters over 1996-2012 for each point. For each point I am interested in a six month average prior to the collection date, eg. if I recorded a variable in 200106 (06/2001) I want the average of the rasters from 200101-200106.

Date of collection values are coded the same as the column names that corresponds to values extracted for the same month.

Is there a way to identify the column I am interested in given the collection date I have?

My data.frame looks like:

    df <- data.frame(lat = c(-34, -34.5, -35, -35.5, -36, -36.5, -37),
                     lon = c(144, 144.5, 145, 145.5, 146, 146.5, 147),
                     dt = c('x200106', 'x200107', 'x200108', 'x200109', 'x200110', 'x200111', 'x200112'),
                     x200101 = c(1, 2, 3, 4, 5, 6, 7),
                     x200102 = c(10, 20, 30, 40, 50, 60, 70),
                     x200103 = c(1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5),
                     x200104 = c(11, 12, 13, 14, 15, 16, 17),
                     x200105 = c(11.5, 12.5, 13.5, 14.5, 15.5, 16.5, 17.5),
                     x200106 = c(1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7),
                     x200107 = c(21, 22, 23, 24, 25, 26, 27),
                     x200108 = c(10, 20, 30, 40, 50, 60, 70),
                     x200109 = c(1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5),
                     x200110 = c(11, 12, 13, 14, 15, 16, 17),
                     x200111 = c(11.5, 12.5, 13.5, 14.5, 15.5, 16.5, 17.5),
                     x200112 = c(1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7))

Given dt (date) can I get a six month average of the corresponding columns?

I have no idea how to proceed, I imagine a data transformation of some kind but don't know where to begin. Any help would be greatly appreciated. Thanks lots!

Cheers, Adam

Solution

The main thing you want to do is reshape your data so that it is in long format, and convert the dates so that you can perform arithmetic on them. This is what we do here:

library(reshape2)
df.mlt <- melt(df, id.vars=c("lat", "lon", "dt"))
df.mlt[c("dt", "variable")] <- lapply(df.mlt[c("dt", "variable")], function(x) as.Date(paste0(x, "01"), format="x%Y%m%d"))
library(data.table)
data.table(df.mlt)[(dt - variable) %between% c(0, 190), mean(value), by=list(lat, lon, dt)]

Look at df.mlt to see what I mean by long format (basically, the columns become rows). The second command just converts the two columns dt, and variable (variable holds the names of what used to be columns before the melt) into date format. Finally, I use data.table to select the appropriate rows (date difference must be less than 190, which I take to be a proxy for six months assuming your data is monthly this should be safe) and to compute statistics on row groups (you could also use dplyr or other "split/apply/combine" style techniques). This produces:

     lat   lon         dt        V1
1: -34.0 144.0 2001-06-01  6.016667
2: -34.5 144.5 2001-07-01 10.314286
3: -35.0 145.0 2001-08-01 16.328571
4: -35.5 145.5 2001-09-01 14.700000
5: -36.0 146.0 2001-10-01 18.214286
6: -36.5 146.5 2001-11-01 20.442857
7: -37.0 147.0 2001-12-01 20.342857

update: apparently I can't count: these are averages of six months, as per your question.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow