How do I sample n values at random nearest to value y when the data aren't continuous?
-
08-10-2019 - |
Question
I have a dataset that includes a list of species, their counts, and the day count from when the survey began. Since many days were not sampled, day is not continuous. So for example, there could be birds counted on day 5,6,9,10,15,34,39 and so on. I set the earliest date to be day 0.
Example data:
species counts day
Blue tit 234 0
Blue tit 24 5
Blue tit 45 6
Blue tit 32 9
Blue tit 6 10
Blue tit 98 15
Blue tit 40 34
Blue tit 57 39
Blue tit 81 43
..................
I need to bootstrap this data and get a resulting dataset where I specify when to start, what interval to proceed in and number of points to sample.
Example: Let's say I randomly pick day 5 as the start day, the interval as 30, and number of rows to sample as 2. It means that I will start at 5, add 30 to it, and look for 2 rows around 35 days (but not day 35 itself). In this case I will grab the two rows where day is 34 and 39.
Next I add 30 to 35 and look for two points around 65. Rinse, repeat till I get to the end of the dataset.
I've written this function to do the sampling but it has flaws (see below):
resample <- function(x, ...) x[sample.int(length(x), ...)]
locate_points<- function(dataz,l,n) #l is the interval, n is # points to sample. This is called by another function that specifies start time among other info.
{
tlength=0
i=1
while(tlength<n)
{
low=l-i
high=l+i
if(low<=min(dataz$day)) { low=min(dataz$day) }
if(high>=max(dataz$day)) { high=max(dataz$day) }
test=resample(dataz$day[dataz$day>low & dataz$day<high & dataz$day!=l])
tlength=length(test)
i=i+1
}
test=sort(test)
k=test[1:n]
return (k)
}
Two issues I need help with:
While my function does return the desired number of points, it is not centered around my search value. Makes sense because as I get wider, I get more points and when I sort those and pick the first n, They tend not to be the low values.
Second, how do I get the actual rows out? For now I have another function to locate these rows using
which
, thenrbind
'ing those rows together. Seems like there should be a better way.
thanks!
Solution
How about something like the following:
day = 1:1000
search = seq(from=5, to=max(day), by=30)
x = sort(setdiff(day, search))
pos = match(x[unlist(lapply(findInterval(search, x), seq, len=2))], day)
day[pos]
To get the rows from your data.frame just subset it:
rows = data[pos, ]
This is maybe slightly cleaner than the unlist/lapply/seq combo:
pos = match(x[outer(c(0, 1), findInterval(search, x), `+`)], day)
Also note that if you want a larger window (eg say 4), its just a matter of going back a bit:
pos = match(x[outer(-1:2, findInterval(search, x), `+`)], day)
OTHER TIPS
Loved the solution of Charles, which works perfectly for the case n=2. Alas, it's not extendible to larger windows. It still has the problem described by OP: with larger windows, the selection is not centered around the search value. Given n is even, I came up with following solution, heavily based on Charles idea.
The function controls the borders. if there are 100 days, and the next midpoint is say the second last day, a window of 4 would mean that you select index 101, which gives NA
. This function shifts the window so all selected indices lie within the original data. This also has the side effect that depending on the values of start (st
), length(l
) and window(n
) values of the start and the end have a higher chance of been selected twice. The lengths should always be at least twice the window size.
The output of the function are the indices of the bootstrap sample. It can be used as the pos
variable of Charles on vectors and dataframes.
bboot <- function(day,st,l,n){
mid <- seq(st,max(day),by=l)
x <-sort(setdiff(day,mid))
lx <- length(x)
id <- sapply(mid,
function(y){
m <- match(T,x>y)
seq(
from=min( lx-n, max(1,m+(-n/2)) ),
to=min( lx, max(n,m+(n/2-1)) )
)
}
)
pos <- match(x[id],day)
return(pos)
}
Then
> day <- sample(1:100,50)
> sample.rownr <- bboot(day,10,20,6)
> sort(day)
[1] 3 4 5 7 9 10 13 15 16 18 19 21 22 24 25 26 27 28 29
[20] 30 31 32 35 36 38 40 45 49 51 52 54 55 58 59 62 65 69 72 73
[40] 74 80 84 87 88 91 92 94 97 98 99
> day[sample.rownr]
[1] 5 7 9 13 15 16 27 28 29 31 32 35 40 45 49 51 52 54 62
[20] 65 69 72 73 74 84 87 88 91 92 94
>
edit : regarding bootstrapping for time series, you should go through The CRAN taskview on time series, especially the section about resampling. For irregular time series, the zoo
package also offers quite some other functionalities that can come in handy.