Question

I have a table consisting of >400,000 rows with ~200 columns. Each row has column containing a position number that ranges from 0-140 and can be a decimal (ex. 45.6345). I have binned the rows by position increments of 5. My first bin contains all rows of data with positions (0-5]. My last bin contains rows with positions (135,140]. To bin the data, I used the following code.

#what is the maximum bin value. Add 1 in case the value is a decimal
maxposbin = max(ceiling(data$POS),na.rm=TRUE)+1
#what is the maximum position value
maxposvalue = max(data$POS, na.rm=TRUE)
#Assign the positions to a variable
posvalues = data$POS
#Cut the position values into bins by intervals of 5
posbin = cut(posvalues, breaks=seq(from=0,to=maxposbin, by=5))
#Make a frequency table to see how many rows are in each bin
posbinned = as.data.frame(table(posbin))
#Plot the frequency distribution
barplot(posbinned$Freq)

My posbinned table looks like this:

  posbin   Freq     binprob
1      (0,5]   8533 0.031925105
2     (5,10]   7318 0.037225597
3    (10,15]   9324 0.029216744
4    (15,20]  10576 0.025758029
5    (20,25]   7065 0.038558658
6    (25,30]   3178 0.085719609
7    (30,35]   5900 0.046172359
8    (35,40]   8132 0.033499375
9    (40,45]   8335 0.032683493
10   (45,50]  16409 0.016601677
11   (50,55]  20481 0.013300958
12   (55,60]  25978 0.010486447
13   (60,65] 161292 0.001688967
14   (65,70]  26063 0.010452247
15   (70,75]  11427 0.023839758
16   (75,80]  11232 0.024253643
17   (80,85]   5129 0.053113066
18   (85,90]  11180 0.024366451
19   (90,95]   4188 0.065047019
20  (95,100]   9871 0.027597702
21 (100,105]  13645 0.019964596
22 (105,110]  13294 0.020491719
23 (110,115]   8791 0.030988160
24 (115,120]   3583 0.076030398
25 (120,125]   4874 0.055891858
26 (125,130]   7304 0.037296949
27 (130,135]   2997 0.090896536
28 (135,140]   7376 0.036932879

I would like to select a defined number of rows across this data set based on probabilities that are assigned to each bin. My resulting sample should have an even distribution of samples across the positions (0 to 140). For instance, bin 13 has the highest number of rows in that bin and therefore it would be assigned the lowest probability that a row would be selected from that bin. Bin 27 has the lowest number of rows and should have the highest selection probability. Each bin should be represented approximately equally to every other bin in the resulting sample. I have assigned a probability to each bin and it is contained in the variable posbinned$binprob.

I calculated the bin probabilities relative to bin 27 which contains the fewest rows. For example bin 7 has about twice as many rows as bin 27 and therefore should be half as likely to get rows selected as bin 27. I then adjusted so the sum of the 28 bin probabilities equaled 1. I'm a little rough on my probability stats so maybe that wasn't the correct way to think about that?

How do I take a sample from 'data' without replacement using set probabilities which are defined by bin in the 'posbinned' table? Currently I don't have a table containing the positions and their corresponding bin (ex. (0,5]). I'm just not sure what the best way is to approach this.

Thank you.

Was it helpful?

Solution

The first step is to identify the bin of each row in data. Since your bins are increments of 5 starting at (but not including) 0, this can be done with simple arithmetic:

bin.number <- ceiling(data$POS / 5)

Next, you'll want to access the bin frequency for each row:

bin.freq <- posbinned$Freq[bin.number]

Then, you'll want to sample without replacement, with probabilities proportional to one divided by the bin frequency:

num.to.sample <- 100    # Select the number of samples you want
rows <- sample(1:nrow(data), size=num.to.sample, replace=FALSE, prob=1/bin.freq)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top